githubEdit

How to serve with Ollama in Production

Worthy considerations when LLM's lifecycle depends on Ollama. To minimize this errors and ensure scalability, certain environment variables should be considered during deployment.

Environment Variables Recap

The following lists the environment variables to consider prior to hosting and selecting your cloud compute services.

chevron-rightListhashtag
  • CUDA_VISIBLE_DEVICES

  • GPU_DEVICE_ORDINAL

  • HIP_VISIBLE_DEVICES

  • HSA_OVERRIDE_GFX_VERSION

  • HTTPS_PROXY

  • HTTP_PROXY

  • NO_PROXY

  • OLLAMA_CONTEXT_LENGTH=4096

  • OLLAMA_DEBUG=INFO

  • OLLAMA_FLASH_ATTENTION=false

  • OLLAMA_GPU_OVERHEAD=0

  • OLLAMA_HOST=http://0.0.0.0:11434

  • OLLAMA_INTEL_GPU=false

  • OLLAMA_KEEP_ALIVE=15m0s

  • OLLAMA_KV_CACHE_TYPE

  • OLLAMA_LLM_LIBRARY

  • OLLAMA_LOAD_TIMEOUT=5m0s

  • OLLAMA_MAX_LOADED_MODELS=0

  • OLLAMA_MAX_QUEUE=512

  • OLLAMA_MODELS=/root/.ollama/models

  • OLLAMA_MULTIUSER_CACHE=false

  • OLLAMA_NEW_ENGINE=false

  • OLLAMA_NOHISTORY=false

  • OLLAMA_NOPRUNE=false

  • OLLAMA_NUM_PARALLEL=1

  • OLLAMA_ORIGINS=[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*]

  • OLLAMA_REMOTES=[ollama.com]

  • OLLAMA_SCHED_SPREAD=false

  • ROCR_VISIBLE_DEVICES

  • http_proxy

  • https_proxy

  • no_proxy

Load-balancing your weights

To ensure the persisted model won't consume all your compute resources over night, ensure to set this correctly:

OLLAMA_KEEP_ALIVE=15m0s

Last updated

Was this helpful?