How to serve with Ollama in Production

Worthy considerations when LLM's lifecycle depends on Ollama. To minimize this errors and ensure scalability, certain environment variables should be considered during deployment.

Environment Variables Recap

The following lists the environment variables to consider prior to hosting and selecting your cloud compute services.

List

CUDA_VISIBLE_DEVICES
GPU_DEVICE_ORDINAL
HIP_VISIBLE_DEVICES
HSA_OVERRIDE_GFX_VERSION
HTTPS_PROXY
HTTP_PROXY
NO_PROXY
OLLAMA_CONTEXT_LENGTH=4096
OLLAMA_DEBUG=INFO
OLLAMA_FLASH_ATTENTION=false
OLLAMA_GPU_OVERHEAD=0
OLLAMA_HOST=http://0.0.0.0:11434
OLLAMA_INTEL_GPU=false
OLLAMA_KEEP_ALIVE=15m0s
OLLAMA_KV_CACHE_TYPE
OLLAMA_LLM_LIBRARY
OLLAMA_LOAD_TIMEOUT=5m0s
OLLAMA_MAX_LOADED_MODELS=0
OLLAMA_MAX_QUEUE=512
OLLAMA_MODELS=/root/.ollama/models
OLLAMA_MULTIUSER_CACHE=false
OLLAMA_NEW_ENGINE=false
OLLAMA_NOHISTORY=false
OLLAMA_NOPRUNE=false
OLLAMA_NUM_PARALLEL=1
OLLAMA_ORIGINS=[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*]
OLLAMA_REMOTES=[ollama.com]
OLLAMA_SCHED_SPREAD=false
ROCR_VISIBLE_DEVICES
http_proxy
https_proxy
no_proxy

Load-balancing your weights

To ensure the persisted model won't consume all your compute resources over night, ensure to set this correctly:

OLLAMA_KEEP_ALIVE=15m0s

PreviousDeep Learning Models Providers in Production NextStreaming Inference Models

Last updated 5 days ago

Was this helpful?

hashtagEnvironment Variables Recap

hashtagLoad-balancing your weights

Environment Variables Recap

Load-balancing your weights