"Expert-level guide for Ollama LLM runtime - CLI, API, Modelfile, embeddings, OpenAI compatibility, and troubleshooting"
Ollama is the easiest way to run large language models locally. It supports models like:
# Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh
# Or via Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
http://localhost:11434~/.ollama/models~/.ollama/logs/server.log# Pull (download) a model
ollama pull <model>
ollama pull llama3.2
ollama pull qwen2.5:7b
# List downloaded models
ollama list
ollama ls
# Remove a model
ollama rm <model>
ollama rm llama3.2
# Show model details
ollama show <model>
ollama show --modelfile llama3.2 # View Modelfile
# Copy a model
ollama cp <source> <destination>
ollama cp llama3.2 my-custom-llama
# Interactive chat
ollama run <model>
ollama run llama3.2
# Single prompt
ollama run <model> "your prompt"
ollama run llama3.2 "Why is the sky blue?"
# Multimodal (with image)
ollama run llava "What's in this image? /path/to/image.png"
# Embeddings
ollama run nomic-embed-text "Hello world"
echo "Hello world" | ollama run nomic-embed-text
# Multiline input
ollama run llama3.2 """
This is a
multiline prompt.
"""
# List running models
ollama ps
# Stop a running model
ollama stop <model>
ollama stop llama3.2
# Start Ollama server
ollama serve
# View environment variables
ollama serve --help
# Create from Modelfile
ollama create <name> -f ./Modelfile
ollama create my-assistant -f ./Modelfile
# Push to registry (requires login)
ollama push <username>/<model>
# Sign in/out
ollama signin
ollama signout
# Launch with IDE integration
ollama launch
ollama launch claude
ollama launch claude --model qwen3-coder
ollama launch droid --config
Base URL: http://localhost:11434/api
POST /api/generate
Request Body:
{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"suffix": "", // Text after completion
"images": ["base64..."], // For vision models
"format": "json", // or JSON schema object
"system": "You are a helper.",
"stream": true,
"think": false, // Enable thinking output
"raw": false, // Skip prompt templating
"keep_alive": "5m", // Model memory duration
"logprobs": false,
"top_logprobs": 0,
"options": {
"num_ctx": 4096, // Context window
"temperature": 0.7,
"top_k": 40,
"top_p": 0.9,
"min_p": 0.0,
"seed": 42,
"num_predict": -1, // Max tokens (-1 = infinite)
"stop": ["<|end|>"],
"repeat_penalty": 1.1,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"num_gpu": 1,
"num_thread": 4,
"use_mmap": true
}
}
Response (streaming):
{
"model": "llama3.2",
"created_at": "2024-01-01T00:00:00Z",
"response": "The sky appears",
"thinking": "",
"done": false
}
Response (final):
{
"model": "llama3.2",
"created_at": "2024-01-01T00:00:00Z",
"response": "",
"done": true,
"done_reason": "stop",
"total_duration": 1500000000,
"load_duration": 500000000,
"prompt_eval_count": 15,
"prompt_eval_duration": 300000000,
"eval_count": 150,
"eval_duration": 700000000
}
POST /api/chat
Request Body:
{
"model": "llama3.2",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there!"},
{"role": "user", "content": "How are you?", "images": ["base64..."]}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}
],
"format": "json",
"stream": true,
"think": false,
"keep_alive": "5m",
"options": {}
}
Response:
{
"model": "llama3.2",
"created_at": "2024-01-01T00:00:00Z",
"message": {
"role": "assistant",
"content": "Hello! How can I help you?",
"tool_calls": []
},
"done": false
}
POST /api/embed
Request Body:
{
"model": "nomic-embed-text",
"input": "Hello world",
// or for batch:
"input": ["Hello world", "Another text"],
"truncate": true,
"dimensions": 768,
"keep_alive": "5m",
"options": {}
}
Response:
{
"embeddings": [[0.123, 0.456, ...]],
"total_duration": 500000000,
"prompt_eval_count": 5
}
# List local models
GET /api/tags
# List running models
GET /api/ps
# Show model details
POST /api/show
{"model": "llama3.2", "verbose": false}
# Create model
POST /api/create
{"model": "my-model", "from": "llama3.2", "system": "You are...", "quantize": "q4_K_M"}
# Copy model
POST /api/copy
{"source": "llama3.2", "destination": "my-copy"}
# Delete model
DELETE /api/delete
{"model": "my-model"}
# Pull model
POST /api/pull
{"model": "llama3.2", "insecure": false, "stream": true}
# Push model
POST /api/push
{"model": "username/model", "insecure": false, "stream": true}
# Get version
GET /api/version
A Modelfile defines a custom model configuration.
| Instruction | Required | Description |
|---|---|---|
FROM | ✅ | Base model or file path |
PARAMETER | ❌ | Model runtime parameters |
TEMPLATE | ❌ | Prompt template |
SYSTEM | ❌ | System message |
ADAPTER | ❌ | LoRA adapter path |
LICENSE | ❌ | License text |
MESSAGE | ❌ | Conversation history |
REQUIRES | ❌ | Minimum Ollama version |
# Modelfile
# Base model (required)
FROM llama3.2
# Or from GGUF file
# FROM ./model.gguf
# Or from Safetensors directory
# FROM ./safetensors-model/
# System prompt
SYSTEM """You are a helpful AI assistant specialized in Python programming.
You provide clear, concise answers with code examples."""
# Runtime parameters
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
# Prompt template
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>
{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>
{{ end }}<|start_header_id|>assistant<|end_header_id|>
"""
# Few-shot examples
MESSAGE user How do I read a file in Python?
MESSAGE assistant Use the `open()` function or `pathlib`:
```python
# Using open()
with open('file.txt', 'r') as f:
content = f.read()
# Using pathlib
from pathlib import Path
content = Path('file.txt').read_text()
LICENSE """MIT License"""
REQUIRES 0.1.26
### Parameter Reference
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `num_ctx` | int | 2048 | Context window size |
| `num_predict` | int | -1 | Max tokens to generate |
| `temperature` | float | 0.8 | Randomness (0-2) |
| `top_k` | int | 40 | Top-k sampling |
| `top_p` | float | 0.9 | Nucleus sampling |
| `min_p` | float | 0.0 | Minimum probability |
| `repeat_penalty` | float | 1.0 | Repetition penalty |
| `repeat_last_n` | int | 64 | Lookback for repeat penalty |
| `presence_penalty` | float | 0.0 | Presence penalty |
| `frequency_penalty` | float | 0.0 | Frequency penalty |
| `seed` | int | 0 | Random seed |
| `stop` | string | - | Stop sequences |
### Template Variables
| Variable | Description |
|----------|-------------|
| `{{ .System }}` | System message |
| `{{ .Prompt }}` | User prompt |
| `{{ .Response }}` | Model response |
### Creating Models
```bash
# Create from Modelfile
ollama create my-assistant -f ./Modelfile
# Create via API
curl -X POST http://localhost:11434/api/create -d '{
"model": "my-assistant",
"from": "llama3.2",
"system": "You are a helpful assistant.",
"parameters": {
"temperature": 0.7
},
"quantize": "q4_K_M"
}'
Ollama provides an OpenAI-compatible API at /v1/ endpoints.
# Chat completions
POST http://localhost:11434/v1/chat/completions
# Completions (legacy)
POST http://localhost:11434/v1/completions
# Embeddings
POST http://localhost:11434/v1/embeddings
# List models
GET http://localhost:11434/v1/models
# Model info
GET http://localhost:11434/v1/models/{model}
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but unused
)
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "user", "content": "Hello!"}
],
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
# LLM
llm = Ollama(model="llama3.2", base_url="http://localhost:11434")
response = llm.invoke("Why is the sky blue?")
# Embeddings
embeddings = OllamaEmbeddings(
model="nomic-embed-text",
base_url="http://localhost:11434"
)
vector = embeddings.embed_query("Hello world")
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
llm = Ollama(model="llama3.2", base_url="http://localhost:11434")
embed_model = OllamaEmbedding(
model_name="nomic-embed-text",
base_url="http://localhost:11434"
)
Popular embedding models in Ollama:
nomic-embed-text - General purposeall-minilm - Lightweightmxbai-embed-large - High qualityqwen3-embedding - Multilingual# CLI
ollama run nomic-embed-text "Hello world"
echo "Hello world" | ollama run nomic-embed-text
# API - Single text
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": "Hello world"
}'
# API - Batch
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": ["Hello", "World"]
}'
# API - Custom dimensions
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": "Hello world",
"dimensions": 512
}'
import requests
import numpy as np
def get_embedding(text, model="nomic-embed-text"):
response = requests.post(
"http://localhost:11434/api/embed",
json={"model": model, "input": text}
)
return np.array(response.json()["embeddings"][0])
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Compare texts
text1 = "The cat sat on the mat"
text2 = "A feline rested on a rug"
text3 = "Python is a programming language"
emb1 = get_embedding(text1)
emb2 = get_embedding(text2)
emb3 = get_embedding(text3)
print(f"text1 vs text2: {cosine_similarity(emb1, emb2):.3f}") # High
print(f"text1 vs text3: {cosine_similarity(emb1, emb3):.3f}") # Low
Some models (like DeepSeek-R1) support separate "thinking" output.
# CLI
ollama run deepseek-r1 "Solve: 2+2"
# API
curl http://localhost:11434/api/generate -d '{
"model": "deepseek-r1",
"prompt": "Solve: 2+2",
"think": true
}'
{
"model": "deepseek-r1",
"response": "The answer is 4.",
"thinking": "Let me think about this step by step...",
"done": true
}
curl http://localhost:11434/api/chat -d '{
"model": "deepseek-r1",
"messages": [
{"role": "user", "content": "What is 15 * 7?"}
],
"think": true
}'
| Variable | Default | Description |
|---|---|---|
OLLAMA_HOST | 127.0.0.1:11434 | Listen address |
OLLAMA_ORIGINS | * | CORS origins |
OLLAMA_MODELS | ~/.ollama/models | Models directory |
OLLAMA_KEEP_ALIVE | 5m | Default keep-alive |
OLLAMA_DEBUG | 0 | Debug logging |
| Variable | Description |
|---|---|
CUDA_VISIBLE_DEVICES | NVIDIA GPU selection |
ROCR_VISIBLE_DEVICES | AMD GPU selection |
OLLAMA_MAX_GPU_MEMORY | Max GPU memory per GPU |
OLLAMA_MAX_LOADED_MODELS | Max concurrent models |
| Variable | Description |
|---|---|
OLLAMA_NUM_PARALLEL | Parallel requests per model |
OLLAMA_MAX_QUEUE | Max queued requests |
OLLAMA_MAX_VRAM | Max VRAM to use |
# Systemd service override
sudo systemctl edit ollama
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/mnt/models"
Environment="CUDA_VISIBLE_DEVICES=0,1"
Environment="OLLAMA_MAX_GPU_MEMORY=16GB"
# Docker
ocker run -d \
-e OLLAMA_HOST=0.0.0.0:11434 \
-e CUDA_VISIBLE_DEVICES=0 \
-v ollama:/root/.ollama \
-p 11434:11434 \
--gpus all \
ollama/ollama
| Platform | Location |
|---|---|
| macOS | ~/.ollama/logs/server.log |
| Linux | journalctl -u ollama --no-pager -f |
| Docker | docker logs <container> |
| Windows | %LOCALAPPDATA%\Ollama\server.log |
# NVIDIA - Check drivers
nvidia-smi
# NVIDIA - Check container toolkit
docker run --gpus all nvidia/cuda:11.0-base nvidia-smi
# AMD - Check drivers
dmesg | grep -i amdgpu
# Force specific GPU library
OLLAMA_LLM_LIBRARY=cuda_v11 ollama run llama3.2
OLLAMA_LLM_LIBRARY=rocm_v6 ollama run llama3.2
OLLAMA_LLM_LIBRARY=cpu ollama run llama3.2
# Reduce context window
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Hello",
"options": {"num_ctx": 2048}
}'
# Limit GPU memory
OLLAMA_MAX_GPU_MEMORY=8GB ollama serve
# Use CPU fallback
OLLAMA_LLM_LIBRARY=cpu ollama run llama3.2
# Check if Ollama is running
curl http://localhost:11434/api/version
# Check listen address
ss -tlnp | grep 11434
# Bind to all interfaces
OLLAMA_HOST=0.0.0.0:11434 ollama serve
# Enable debug logging
OLLAMA_DEBUG=1 ollama serve
# Check disk I/O
iostat -x 1
# Use mmap (default, ensure enabled)
# In Modelfile: PARAMETER use_mmap true
On older Windows 10 (21H1), update to 22H1+ to fix:
# Enable debug logging
OLLAMA_DEBUG=1 ollama serve
# AMD additional logging
AMD_LOG_LEVEL=3 ollama serve
# Check model loading details
ollama run llama3.2 --verbose
| Use Case | Recommended Model |
|---|---|
| General chat | llama3.2, gemma3 |
| Code generation | qwen2.5-coder, deepseek-coder |
| Reasoning | deepseek-r1 |
| Embeddings | nomic-embed-text, qwen3-embedding |
| Vision | llava, gemma3 (multimodal) |
| Fast/cheap | phi3, gemma2:2b |
# Use quantized models (smaller, faster)
ollama pull llama3.2:q4_K_M # 4-bit
ollama pull llama3.2:q8_0 # 8-bit
# Adjust context to minimum needed
PARAMETER num_ctx 4096 # Not 32000 unless needed
# Use keep_alive wisely
curl -X POST http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Hello",
"keep_alive": "1m" # Unload quickly after use
}'
# 1. Bind to specific interface
OLLAMA_HOST=127.0.0.1:11434 ollama serve
# 2. Set CORS for web apps
OLLAMA_ORIGINS="https://myapp.com" ollama serve
# 3. Pre-pull models
ollama pull llama3.2
ollama pull nomic-embed-text
# 4. Monitor GPU memory
watch -n 1 nvidia-smi
# 5. Use systemd for auto-restart
sudo systemctl enable ollama
sudo systemctl start ollama
# In Agent Zero settings:
Chat Model:
provider: ollama
model: llama3.2
base_url: http://localhost:11434
Embedding Model:
provider: ollama
model: nomic-embed-text
base_url: http://localhost:11434
For Docker Agent Zero:
http://host.docker.internal:11434 if Ollama on hosthttp://ollama:11434 if both in same Docker network# Essential commands
ollama pull <model> # Download
ollama run <model> # Chat
ollama list # List local
ollama ps # List running
ollama stop <model> # Unload
ollama rm <model> # Delete
ollama create <name> -f Modelfile # Custom model
# API endpoints
POST /api/generate # Text generation
POST /api/chat # Chat completion
POST /api/embed # Embeddings
GET /api/tags # List models
GET /api/ps # Running models
POST /api/pull # Download model
DELETE /api/delete # Delete model
# OpenAI compatible
POST /v1/chat/completions
POST /v1/embeddings
GET /v1/models
Skill created by King Arthur for the Agent Zero kingdom.