# Serving AI/LLMops In Production

{% hint style="warning" %}
**NOTE**

No longer active in web and mobile apps.
{% endhint %}

## Background

This post documents my experience containerizing a tarot reading application powered by local LLMs using Ollama. While this setup is now **archived** (moved to `_taro` directory), the architectural elegance of serving MLOPS through this docker compose configuration file deserves a formal recognition excerpt.

This architecture was intended to be served on [Lighting Studio](https://lightning.ai/) with MLOPs schedulers via AWS Lambda functions and LLM/ML models accessed/downloaded from [Hugging Face Repo](https://huggingface.co/docs/huggingface_hub/en/guides/repository).&#x20;

Serving models with Lightning Studio doesn't require docker containerizing projects and is intended to isolate you away from doing so. It is built to resolve the Cloud development problems so that more time is invested in experimenting machine learning models.

The intent I had with this architecture was the idea of moving away from Lightning Studio after short term objectives are achieved.

## Architecture Overview

Requirements:

* Hugging Face Repo&#x20;
* Any Cloud Container e.g. AWS EC2, Google Cloud run, etc.
* Any Cloud Schedulers e.g. AWS Lambda, Google Cloud functions, etc.

The system uses a multi-container Docker Compose setup with two main services:

{% @mermaid/diagram content="graph LR
A\[Frontend FullStack] --> B\[FastAPI App Container on Lightning Studio]
B --> C\[Ollama Server :11434]
C --> D\[Stored Local LLM Models: Hugging Face Repos and AWS S3 Bucket]
B -.-> E\[Shared Volume: ollama-data]
C -.-> E" %}

#### Service Communication

Both containers communicate over a custom bridge network (`tarotarot-network`), allowing them to discover each other by service name rather than IP address.

### Service Breakdown

#### 1. Application Container (`app`)

**Base Image**: `python:3.12-slim`

**Key Features**:

* Non-root user execution (UID 1000) for security
* Minimal system dependencies (curl, gcc, build tools)
* FastAPI server via Uvicorn on port 8080
* Environment variables loaded from `.env`

**Build Process**:

```dockerfile
FROM python:3.12-slim
RUN useradd -m -u 1000 user
RUN apt-get update && apt-get install -y curl gcc make build-essential
COPY requirements.txt .
RUN pip install --upgrade pip && pip install --no-cache-dir -r requirements.txt
COPY /taro /code
WORKDIR /code
User user
```

**Runtime Command**:

```bash
uvicorn taro.app:app --host 0.0.0.0 --port 8080
```

#### 2. Ollama Server Container (`ollama`)

**Base Image**: `ollama/ollama:latest`

**Purpose**: Hosts the local LLM inference engine

**Initialization**:

* Custom startup script (`start_ollama.sh`) handles model loading
* Persistent model storage via Docker volume
* Always pulls latest image to stay updated

**Port Exposure**: 11434 (Ollama's default API port)

## Maintaining Databases & Volumes

Two volumes ensure data persistence:

1. **`ollama-data`**: Shared between both containers
   * Stores downloaded LLM models
   * Caches model weights
   * Persists across container restarts
2. **`.lightning`**: Application-specific data
   * Lightning AI checkpoints
   * Training artifacts

## Docker Compose File Code

Prior to running this docker compose file it is assumed that your current python project directory is organised as follows:

```bash
├── .docker-compose.yml
├── .dockerignore
├── .env
├── requirements.txt
│
├── taro/                        # Main application package (mounted to /code)
│   ├── __init__.py
│   ├── app.py                   # Exposes `app` for: uvicorn taro.app:app
│   │
│   ├── core/
│   │   ├── config.py
│   │   ├── settings.py
│   │   └── logging.py
│   │
│   ├── api/
│   │   ├── routes.py
│   │   ├── dependencies.py
│   │   └── schemas.py
│   │
│   ├── services/
│   │   ├── tarot_engine.py
│   │   ├── spread_generator.py
│   │   └── llm_service.py       # Wraps Ollama interaction
│   │
│   ├── ollama/
│   │   ├── build_class.py       # Your archived Ollama Build Class
│   │   └── client.py
│   │
│   ├── models/
│   │   ├── tarot_cards.py
│   │   └── prompts.py
│   │
│   └── utils/
│       ├── io.py
│       └── helpers.py
│
├── scripts/
│   └── ollama_root/             # Copied into Ollama container image
│       ├── start_ollama.sh
│       ├── Modelfile
│       └── preload_models.sh
│
├── ollama_root/                 # Persistent host directory (bind-mounted)
│   ├── models/
│   ├── manifests/
│   └── logs/
│
├── _taro/                       # Archived production snapshot
│   └── (legacy implementation)
│
├── tests/
│   ├── test_api.py
│   ├── test_spreads.py
│   └── test_llm_service.py
│
└── .lightning/                  # Lightning Studio local metadata (mounted)
```

<details>

<summary>Docker Compose File</summary>

```dockercompose
# .docker-compose.yml
# This docker container was intended to be deployed on any cloud or local container with Dockers e.g. Lightning Studio
# it should only be used with Ollama which is now no longer in production and archived in _taro directory, as it leverages the Ollama Build Class in there
# To run the docker container, copy this and .dockerignore into root directory and change the main directory to _taro either in docker file or at root folder

services:
  app:
    build:
      # https://www.docker.com/blog/llm-docker-for-local-and-hugging-face-hosting/
      dockerfile_inline: |
        FROM python:3.12-slim

        RUN useradd -m -u 1000 user
        RUN chsh -s /bin/bash user

        RUN apt-get update && apt-get install -y \
            curl \
            gcc \
            make \
            build-essential \
            ca-certificates \
            bash \
        && rm -rf /var/lib/apt/lists/*

        ENV PYTHONPATH=/code

        COPY /requirements.txt .
        RUN pip install --upgrade pip && pip install --no-cache-dir -r requirements.txt
        COPY /taro /code
        RUN chown -R user:user /code

        WORKDIR /code
        User user
    container_name: app
    entrypoint:
      ["/usr/bin/bash", "-c", "uvicorn taro.app:app --host 0.0.0.0 --port 8080"]
    depends_on:
      - ollama
    env_file:
      - .env
    ports:
      - 8080:8080
    expose:
      - 8080
    networks:
      - tarotarot-network
    volumes:
      - ollama-data:/code/ollama_root
      - ./.lightning:/code/.lightning

  ollama:
    build:
      dockerfile_inline: |
        FROM ollama/ollama:latest
        WORKDIR /root
        COPY ./scripts/ollama_root /root
        RUN chmod +x /root/start_ollama.sh
    entrypoint: ["/usr/bin/bash", "./start_ollama.sh"]
    container_name: ollama_server
    ports:
      - 11434:11434
    expose:
      - 11434
    volumes:
      - ./ollama_root:/root/.ollama
      - ollama-data:/root/.ollama
    pull_policy: always
    tty: true
    networks:
      - tarotarot-network
    restart: unless-stopped

volumes:
  ollama-data:

networks:
  tarotarot-network:
    driver: bridge

```

</details>

**Development Notes**

Docker compose CLI notes that assisted me in tinkering with the containers:

```bash
# Start services
docker-compose up -d

# View logs
docker-compose logs -f app

# Stop services
docker-compose down

# Clean volumes (removes models)
docker-compose down -v
```

## Improvements

* Add `ngrok` container to the Docker Compose File
* FastAPI can be opted for Lightning&#x20;

## Decision Making

Given my current short and long term objectives of this app, I have decided to leverage [Google Vertex AI](https://cloud.google.com/vertex-ai?utm_source=pmax\&utm_medium=display\&utm_campaign=Cloud-SS-DR-GCP-1713664-GCP-DR-APAC-AU-en-PMAX-Display-PMAX-Prospecting-Vertex_AI\&utm_content=c--x--9071744-17821954699\&utm_term\&gclsrc=aw.ds&\&https://ad.doubleclick.net/ddm/trackclk/N5295.276639.GOOGLEADWORDS/B26943865.344329733;dc_trk_aid=535895606;dc_trk_cid%3D163098484;dc_lat%3D;dc_rdid%3D;tag_for_child_directed_treatment%3D;tfua%3D;ltd%3D\&gad_source=1\&gad_campaignid=17820972903\&gclid=Cj0KCQiA18DMBhDeARIsABtYwT1nl6i3O6aPiB2zNH6vF6ef1Efu700_jn28M383zDnx7IrcS9UxqJ4aAiC-EALw_wcB) to serve models to my frontend. The following content outlines my assessment on the long-term merits and trade-offs inherit in this architecture, which ultimately guided my decision on this decision.

### Advantages & Disadvantages

*Table 1.1 highlights the pros and cons of this architecture.*

<table data-full-width="true"><thead><tr><th width="148.9951171875" align="right">DOMAIN</th><th>PROS</th><th>CONS</th></tr></thead><tbody><tr><td align="right">Infrastructure Control</td><td>Full sovereignty over runtime, model weights, networking, and execution environment. No third-party API dependency.</td><td>You assume responsibility for uptime, orchestration, hardware provisioning, and incident response.</td></tr><tr><td align="right">Model Determinism</td><td>Explicit control over model versions via Modelfile and local artifacts. Eliminates silent provider-side upgrades.</td><td>Manual upgrades required. Security patches and model refresh cycles must be actively managed.</td></tr><tr><td align="right">Cost Structure</td><td>No per-token billing. Predictable infrastructure cost once GPU/CPU resources are provisioned. Economically favorable at sustained inference volume.</td><td>High upfront infrastructure cost. Idle GPU resources still incur expense. Cost efficiency depends on utilization rate.</td></tr><tr><td align="right">Latency</td><td>Low-latency inference when deployed close to users or within same VPC. No external API round trips.</td><td>Performance constrained by local hardware. Without GPU acceleration, response times degrade significantly.</td></tr><tr><td align="right">Data Privacy</td><td>Sensitive prompts and user data remain within controlled infrastructure. Useful for compliance-sensitive deployments.</td><td>Requires secure network configuration, encryption at rest, and internal access control. Compliance burden shifts to you.</td></tr><tr><td align="right">Offline Capability</td><td>Can operate without internet once models are pulled. Useful for restricted or air-gapped environments.</td><td>Initial model downloads are large. Ongoing updates require manual intervention.</td></tr><tr><td align="right">Customization</td><td>Deep workflow control: prompt pipelines, embeddings, retrieval augmentation, orchestration layers, wrapper abstractions.</td><td>Increased architectural complexity. Higher cognitive load for maintaining modular boundaries.</td></tr><tr><td align="right">Volume Sharing</td><td>Shared persistent volume enables multi-container access to model artifacts. Efficient reuse across services.</td><td>Requires robust persistent storage provisioning. Cloud environments must support high-throughput shared volumes.</td></tr><tr><td align="right">Dev–Prod Parity</td><td>Docker Compose ensures reproducible environments across local and cloud deployments.</td><td>Compose is not a true production orchestrator. For horizontal scaling, migration to Kubernetes or ECS may be necessary.</td></tr><tr><td align="right">Observability</td><td>Direct access to logs, runtime telemetry, and system-level diagnostics.</td><td>Requires explicit monitoring stack (Prometheus, Grafana, etc.). Not included by default.</td></tr><tr><td align="right">Vendor Lock-In</td><td>Reduced dependency on API providers. Models can be swapped or fine-tuned without API contract constraints.</td><td>Implicit dependence on Ollama runtime compatibility and supported model ecosystem.</td></tr><tr><td align="right">Scalability</td><td>Can scale vertically (larger GPU instances) and horizontally with orchestration layer extension.</td><td>Compose alone does not auto-scale. Requires additional infrastructure (load balancer, orchestration controller).</td></tr><tr><td align="right">Security Surface</td><td>Reduced exposure to public API endpoints when deployed privately.</td><td>Larger attack surface internally if container hardening, network isolation, and secrets management are misconfigured.</td></tr><tr><td align="right">Experimental Flexibility</td><td>Ideal for research-driven iteration and AI wrapper experimentation. Enables custom inference strategies.</td><td>Slower iteration compared to simply swapping API parameters in a hosted service.</td></tr></tbody></table>

### Comparison

*Table below compares the advantages/disadvantages of this architecture with external Cloud hosted LLMs like Google Cloud Vertex AI, etc. and*&#x20;

<table><thead><tr><th width="125.896484375" align="right">DOMAIN</th><th>External Cloud-Hosted LLM APIs (OpenAI, Anthropic)</th><th>Current Build</th><th>My Reasoning In Production</th></tr></thead><tbody><tr><td align="right">Setup Time</td><td>Minimal setup: (i) create account, (ii) obtain API key, (iii) test via Postman or HTTP client. Deployment-ready within hours.</td><td>Requires container orchestration, environment configuration, model pulling, volume management, and service networking.</td><td>For short-term experimentation or rapid MVP release, cloud APIs dominate. However, for controlled infrastructure ownership and long-term extensibility, the Compose stack provides architectural sovereignty.</td></tr><tr><td align="right">Scalability</td><td>Elastic scaling handled by provider. No infrastructure burden.</td><td>Scaling requires explicit container replication, GPU provisioning, load balancing, and orchestration layer (e.g., Kubernetes).</td><td>If projected traffic is uncertain or burst-heavy, cloud APIs reduce operational risk. Self-hosting is appropriate when demand patterns are predictable or constrained.</td></tr><tr><td align="right">Cost Model</td><td>Pay-per-token pricing. Ongoing operational expenditure. Context window expansion increases cost exposure.</td><td>Fixed infrastructure cost (compute + storage). No marginal token billing.</td><td>Token-based billing penalizes long contextual priors and iterative prompting. Self-hosting becomes economically favorable when inference volume is sustained and predictable.</td></tr><tr><td align="right">Maintenance &#x26; Versioning</td><td>Provider handles model updates. Limited control over model drift and deprecations.</td><td>Full control over model version, Modelfile configuration, and upgrade cadence.</td><td>Self-hosting ensures model transparency and deterministic reproducibility—critical for workflow customization and long-lived AI wrapper architectures.</td></tr><tr><td align="right">Observability &#x26; Transparency</td><td>Limited introspection into model internals or inference stack.</td><td>Direct access to runtime logs, model artifacts, and execution environment.</td><td>Greater system visibility supports experimentation, wrapper abstraction layers, and research-driven customization.</td></tr><tr><td align="right">Workflow Customization</td><td>Constrained by provider API abstraction.</td><td>Full freedom to modify pipelines, embeddings, prompt templates, and orchestration logic.</td><td>Aligns with long-term ambitions of building AI tooling layers rather than merely consuming APIs.</td></tr><tr><td align="right">Volume &#x26; Storage Requirements</td><td>No model storage responsibility.</td><td>Requires persistent volume mounting for model weights and manifests across containers.</td><td>Production deployment must provision high-capacity persistent volumes (e.g., attached storage or object-backed synchronization workflows).</td></tr><tr><td align="right">Infrastructure Dependency</td><td>Dependent on third-party uptime and policy.</td><td>Dependent on internal infrastructure reliability.</td><td>Self-hosting reduces vendor lock-in but increases infrastructure accountability.</td></tr><tr><td align="right">Resource Constraints</td><td>No local GPU/CPU burden.</td><td>LLM inference requires significant GPU or high-CPU allocation.</td><td>If GPU access is secured and predictable, Compose is viable. Without hardware guarantees, cloud APIs are more stable.</td></tr><tr><td align="right">Time Horizon Fit</td><td>Ideal for short-term engagement and rapid iteration.</td><td>Ideal for medium- to long-term infrastructure ownership and research-driven evolution.</td><td>Given a bounded product lifespan with moderate traffic, cloud APIs may be pragmatically sufficient; for architectural independence and experimentation, Compose justifies itself.</td></tr></tbody></table>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://whoamimi.gitbook.io/blog/projects/tarotarot-ai-fortune-teller/ai-ml-data-science-stack/serving-ai-llmops-in-production.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
