Skip to main content

Documentation Index

Fetch the complete documentation index at: https://vastai-80aa3a82-fix-google-validation-docs-404s.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Initial Rollout

The serverless engine learns the cost-vs-performance profile of each GPU class in your search_params from real workers running real traffic (see Choosing GPUs). The speed at which it “settles” into the most cost-effective mix depends on how quickly workers are recruited and released, so it helps to apply a test load during the first day of operation to give the engine enough signal to converge. Best practice is to scale to double the number of expected required workers, then back down, three separate times.

Simulating Load

For examples of how to simulate load against your endpoint, see the client examples in the Vast SDK repository: https://github.com/vast-ai/vast-sdk/blob/main/examples/client/vllm_load_example.py

Managing for Bursty Load

  • Adjust min_workers: This will change the number of managed inactive workers, and increase capacity for high peak
  • Check max_workers: Ensure this parameter is set high enough for the serverless engine to create the necessary number of workers

Managing for Low Demand or Idle Periods

  • Adjust min_load: Reducing min_load will reduce the minimum number of active workers. Set to 1 to reduce the number to its minimum value of 1 worker, or set to 0 to put all workers into inactive states.
  • Adjust min_workers: This will change the number of managed inactive workers

Scaling to Zero

To allow your endpoint to fully scale to zero during idle periods, configure inactivity_timeout alongside your other scaling parameters. The inactivity_timeout value (in seconds) determines how long the endpoint must be idle before scaling down is permitted.
  • To scale to zero active workers (while keeping cold workers available): set min_load = 0 and configure a positive inactivity_timeout. Workers in the cold_workers pool will remain available for fast reactivation.
  • To scale to zero total workers: set min_load = 0, cold_workers = 0, and configure a positive inactivity_timeout. This minimizes cost during extended idle periods but incurs cold-start latency when traffic resumes.
  • To prevent scaling to zero regardless of other settings: set inactivity_timeout to a negative value (e.g., -1).
A value of 0 for inactivity_timeout disables inactivity-based gating entirely, the endpoint will rely solely on normal autoscaling decisions.

Managing Queue Time

Use max_queue_time and target_queue_time to control how the autoscaler responds to request queuing:
  • Increase max_queue_time to allow more requests to buffer on each worker before the system holds them in the global queue. This is useful for workloads with predictable, longer processing times.
  • Decrease target_queue_time to trigger more aggressive scale-up when queue times rise, reducing latency at the cost of potentially higher worker counts.
  • Increase target_queue_time to tolerate higher queue times before scaling up, reducing costs when some latency is acceptable.