# Deep Learning Models Providers in Production

When you invoke LLMs, they retain previous inputs and outputs via caches during a chat session. This works fine for locally hosted models, but becomes a real challenge when supporting multiple users. Many platforms offer built-in services that solve this issue for you, but their marketing and cost structures can make you reconsider whether it’s worth it.

If I had to draw an analogy, the solution would look something like a cloud load balancer—except instead of simply routing traffic, it’s responsible for running and re-running large models. The twist is that I want my agent to remember more than just the current session; for example, I’d want Tom to recall important highlights and memories from the very beginning of his creation.

As of 2025, the current available AI hosting tools are displayed in the following table.

|                   | Pros                                                                                                                                     | Cons                                                                                                 |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
| `ollama`          | <ul><li>Oddly, any model served with ollama is faster than other tools (e.g. <code>llmcpp</code>)</li></ul>                              | <ul><li>More restrictive i.e. simplified or user-friendly which can be annoying sometimes.</li></ul> |
| `llmcpp`          | <ul><li>Easier to customize the prompts and the model's architecture. </li></ul>                                                         | <ul><li></li></ul>                                                                                   |
| `huggingface-cli` | <ul><li>If you prefer traditional methods used in serving applications e.g. off-loading models on gpu can be manually managed.</li></ul> | <ul><li>Too much cloud development related requirements that it can bore a lazy person</li></ul>     |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://whoamimi.gitbook.io/blog/ai-ml-and-data-science-development/deep-neural-network-servers-for-local-hosting-or-prod.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
