Deep Learning Models Providers in Production

Hosting servers / platforms that support large models or deep neural networks e.g. LLM, vision models. Utilizing

When you invoke LLMs, they retain previous inputs and outputs via caches during a chat session. This works fine for locally hosted models, but becomes a real challenge when supporting multiple users. Many platforms offer built-in services that solve this issue for you, but their marketing and cost structures can make you reconsider whether it’s worth it.

If I had to draw an analogy, the solution would look something like a cloud load balancer—except instead of simply routing traffic, it’s responsible for running and re-running large models. The twist is that I want my agent to remember more than just the current session; for example, I’d want Tom to recall important highlights and memories from the very beginning of his creation.

As of 2025, the current available AI hosting tools are displayed in the following table.

Pros

Cons

ollama

Oddly, any model served with ollama is faster than other tools (e.g. llmcpp)

More restrictive i.e. simplified or user-friendly which can be annoying sometimes.

llmcpp

Easier to customize the prompts and the model's architecture.

huggingface-cli

If you prefer traditional methods used in serving applications e.g. off-loading models on gpu can be manually managed.

Too much cloud development related requirements that it can bore a lazy person

PreviousText to speech (TTS)NextHow to serve with Ollama in Production

Last updated 5 days ago

Was this helpful?