Name: Ollama
Author: iamcapote

Ollama | Skills Pool

# Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh

# Or via Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# Pull (download) a model
ollama pull <model>
ollama pull llama3.2
ollama pull qwen2.5:7b

# List downloaded models
ollama list
ollama ls

# Remove a model
ollama rm <model>
ollama rm llama3.2

# Show model details
ollama show <model>
ollama show --modelfile llama3.2  # View Modelfile

# Copy a model
ollama cp <source> <destination>
ollama cp llama3.2 my-custom-llama

# Interactive chat
ollama run <model>
ollama run llama3.2

# Single prompt
ollama run <model> "your prompt"
ollama run llama3.2 "Why is the sky blue?"

# Multimodal (with image)
ollama run llava "What's in this image? /path/to/image.png"

# Embeddings
ollama run nomic-embed-text "Hello world"
echo "Hello world" | ollama run nomic-embed-text

# Multiline input
ollama run llama3.2 """
This is a
multiline prompt.
"""

# List running models
ollama ps

# Stop a running model
ollama stop <model>
ollama stop llama3.2

# Start Ollama server
ollama serve

# View environment variables
ollama serve --help

# Create from Modelfile
ollama create <name> -f ./Modelfile
ollama create my-assistant -f ./Modelfile

# Push to registry (requires login)
ollama push <username>/<model>

# Sign in/out
ollama signin
ollama signout

# Launch with IDE integration
ollama launch
ollama launch claude
ollama launch claude --model qwen3-coder
ollama launch droid --config

{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "suffix": "",                    // Text after completion
  "images": ["base64..."],          // For vision models
  "format": "json",                 // or JSON schema object
  "system": "You are a helper.",
  "stream": true,
  "think": false,                   // Enable thinking output
  "raw": false,                     // Skip prompt templating
  "keep_alive": "5m",               // Model memory duration
  "logprobs": false,
  "top_logprobs": 0,
  "options": {
    "num_ctx": 4096,               // Context window
    "temperature": 0.7,
    "top_k": 40,
    "top_p": 0.9,
    "min_p": 0.0,
    "seed": 42,
    "num_predict": -1,             // Max tokens (-1 = infinite)
    "stop": ["<|end|>"],
    "repeat_penalty": 1.1,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.0,
    "num_gpu": 1,
    "num_thread": 4,
    "use_mmap": true
  }
}

{
  "model": "llama3.2",
  "created_at": "2024-01-01T00:00:00Z",
  "response": "The sky appears",
  "thinking": "",
  "done": false
}

{
  "model": "llama3.2",
  "created_at": "2024-01-01T00:00:00Z",
  "response": "",
  "done": true,
  "done_reason": "stop",
  "total_duration": 1500000000,
  "load_duration": 500000000,
  "prompt_eval_count": 15,
  "prompt_eval_duration": 300000000,
  "eval_count": 150,
  "eval_duration": 700000000
}

{
  "model": "llama3.2",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there!"},
    {"role": "user", "content": "How are you?", "images": ["base64..."]}
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string"}
          },
          "required": ["location"]
        }
      }
    }
  ],
  "format": "json",
  "stream": true,
  "think": false,
  "keep_alive": "5m",
  "options": {}
}

{
  "model": "llama3.2",
  "created_at": "2024-01-01T00:00:00Z",
  "message": {
    "role": "assistant",
    "content": "Hello! How can I help you?",
    "tool_calls": []
  },
  "done": false
}

{
  "model": "nomic-embed-text",
  "input": "Hello world",
  // or for batch:
  "input": ["Hello world", "Another text"],
  "truncate": true,
  "dimensions": 768,
  "keep_alive": "5m",
  "options": {}
}

{
  "embeddings": [[0.123, 0.456, ...]],
  "total_duration": 500000000,
  "prompt_eval_count": 5
}

# List local models
GET /api/tags

# List running models
GET /api/ps

# Show model details
POST /api/show
{"model": "llama3.2", "verbose": false}

# Create model
POST /api/create
{"model": "my-model", "from": "llama3.2", "system": "You are...", "quantize": "q4_K_M"}

# Copy model
POST /api/copy
{"source": "llama3.2", "destination": "my-copy"}

# Delete model
DELETE /api/delete
{"model": "my-model"}

# Pull model
POST /api/pull
{"model": "llama3.2", "insecure": false, "stream": true}

# Push model
POST /api/push
{"model": "username/model", "insecure": false, "stream": true}

# Get version
GET /api/version

Instruction	Required	Description
`FROM`	✅	Base model or file path
`PARAMETER`	❌	Model runtime parameters
`TEMPLATE`	❌	Prompt template
`SYSTEM`	❌	System message
`ADAPTER`	❌	LoRA adapter path
`LICENSE`	❌	License text
`MESSAGE`	❌	Conversation history
`REQUIRES`	❌	Minimum Ollama version

# Modelfile

# Base model (required)
FROM llama3.2

# Or from GGUF file
# FROM ./model.gguf

# Or from Safetensors directory
# FROM ./safetensors-model/

# System prompt
SYSTEM """You are a helpful AI assistant specialized in Python programming.
You provide clear, concise answers with code examples."""

# Runtime parameters
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

# Prompt template
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>
{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>
{{ end }}<|start_header_id|>assistant<|end_header_id|>

"""

# Few-shot examples
MESSAGE user How do I read a file in Python?
MESSAGE assistant Use the `open()` function or `pathlib`:

```python
# Using open()
with open('file.txt', 'r') as f:
    content = f.read()

# Using pathlib
from pathlib import Path
content = Path('file.txt').read_text()


### Parameter Reference

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `num_ctx` | int | 2048 | Context window size |
| `num_predict` | int | -1 | Max tokens to generate |
| `temperature` | float | 0.8 | Randomness (0-2) |
| `top_k` | int | 40 | Top-k sampling |
| `top_p` | float | 0.9 | Nucleus sampling |
| `min_p` | float | 0.0 | Minimum probability |
| `repeat_penalty` | float | 1.0 | Repetition penalty |
| `repeat_last_n` | int | 64 | Lookback for repeat penalty |
| `presence_penalty` | float | 0.0 | Presence penalty |
| `frequency_penalty` | float | 0.0 | Frequency penalty |
| `seed` | int | 0 | Random seed |
| `stop` | string | - | Stop sequences |

### Template Variables

| Variable | Description |
|----------|-------------|
| `{{ .System }}` | System message |
| `{{ .Prompt }}` | User prompt |
| `{{ .Response }}` | Model response |

### Creating Models

```bash
# Create from Modelfile
ollama create my-assistant -f ./Modelfile

# Create via API
curl -X POST http://localhost:11434/api/create -d '{
  "model": "my-assistant",
  "from": "llama3.2",
  "system": "You are a helpful assistant.",
  "parameters": {
    "temperature": 0.7
  },
  "quantize": "q4_K_M"
}'

# Chat completions
POST http://localhost:11434/v1/chat/completions

# Completions (legacy)
POST http://localhost:11434/v1/completions

# Embeddings
POST http://localhost:11434/v1/embeddings

# List models
GET http://localhost:11434/v1/models

# Model info
GET http://localhost:11434/v1/models/{model}

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but unused
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings

# LLM
llm = Ollama(model="llama3.2", base_url="http://localhost:11434")
response = llm.invoke("Why is the sky blue?")

# Embeddings
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434"
)
vector = embeddings.embed_query("Hello world")

from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

llm = Ollama(model="llama3.2", base_url="http://localhost:11434")
embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    base_url="http://localhost:11434"
)

# CLI
ollama run nomic-embed-text "Hello world"
echo "Hello world" | ollama run nomic-embed-text

# API - Single text
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Hello world"
}'

# API - Batch
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["Hello", "World"]
}'

# API - Custom dimensions
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Hello world",
  "dimensions": 512
}'

import requests
import numpy as np

def get_embedding(text, model="nomic-embed-text"):
    response = requests.post(
        "http://localhost:11434/api/embed",
        json={"model": model, "input": text}
    )
    return np.array(response.json()["embeddings"][0])

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Compare texts
text1 = "The cat sat on the mat"
text2 = "A feline rested on a rug"
text3 = "Python is a programming language"

emb1 = get_embedding(text1)
emb2 = get_embedding(text2)
emb3 = get_embedding(text3)

print(f"text1 vs text2: {cosine_similarity(emb1, emb2):.3f}")  # High
print(f"text1 vs text3: {cosine_similarity(emb1, emb3):.3f}")  # Low

# CLI
ollama run deepseek-r1 "Solve: 2+2"

# API
curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-r1",
  "prompt": "Solve: 2+2",
  "think": true
}'

{
  "model": "deepseek-r1",
  "response": "The answer is 4.",
  "thinking": "Let me think about this step by step...",
  "done": true
}

curl http://localhost:11434/api/chat -d '{
  "model": "deepseek-r1",
  "messages": [
    {"role": "user", "content": "What is 15 * 7?"}
  ],
  "think": true
}'

Variable	Default	Description
`OLLAMA_HOST`	127.0.0.1:11434	Listen address
`OLLAMA_ORIGINS`	*	CORS origins
`OLLAMA_MODELS`	~/.ollama/models	Models directory
`OLLAMA_KEEP_ALIVE`	5m	Default keep-alive
`OLLAMA_DEBUG`	0	Debug logging

# Systemd service override
sudo systemctl edit ollama

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/mnt/models"
Environment="CUDA_VISIBLE_DEVICES=0,1"
Environment="OLLAMA_MAX_GPU_MEMORY=16GB"

# Docker
ocker run -d \
  -e OLLAMA_HOST=0.0.0.0:11434 \
  -e CUDA_VISIBLE_DEVICES=0 \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --gpus all \
  ollama/ollama

# NVIDIA - Check drivers
nvidia-smi

# NVIDIA - Check container toolkit
docker run --gpus all nvidia/cuda:11.0-base nvidia-smi

# AMD - Check drivers
dmesg | grep -i amdgpu

# Force specific GPU library
OLLAMA_LLM_LIBRARY=cuda_v11 ollama run llama3.2
OLLAMA_LLM_LIBRARY=rocm_v6 ollama run llama3.2
OLLAMA_LLM_LIBRARY=cpu ollama run llama3.2

# Reduce context window
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello",
  "options": {"num_ctx": 2048}
}'

# Limit GPU memory
OLLAMA_MAX_GPU_MEMORY=8GB ollama serve

# Use CPU fallback
OLLAMA_LLM_LIBRARY=cpu ollama run llama3.2

# Check if Ollama is running
curl http://localhost:11434/api/version

# Check listen address
ss -tlnp | grep 11434

# Bind to all interfaces
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# Enable debug logging
OLLAMA_DEBUG=1 ollama serve

# Check disk I/O
iostat -x 1

# Use mmap (default, ensure enabled)
# In Modelfile: PARAMETER use_mmap true

# Enable debug logging
OLLAMA_DEBUG=1 ollama serve

# AMD additional logging
AMD_LOG_LEVEL=3 ollama serve

# Check model loading details
ollama run llama3.2 --verbose

Use Case	Recommended Model
General chat	`llama3.2`, `gemma3`
Code generation	`qwen2.5-coder`, `deepseek-coder`
Reasoning	`deepseek-r1`
Embeddings	`nomic-embed-text`, `qwen3-embedding`
Vision	`llava`, `gemma3` (multimodal)
Fast/cheap	`phi3`, `gemma2:2b`

# Use quantized models (smaller, faster)
ollama pull llama3.2:q4_K_M    # 4-bit
ollama pull llama3.2:q8_0     # 8-bit

# Adjust context to minimum needed
PARAMETER num_ctx 4096  # Not 32000 unless needed

# Use keep_alive wisely
curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello",
  "keep_alive": "1m"  # Unload quickly after use
}'

# 1. Bind to specific interface
OLLAMA_HOST=127.0.0.1:11434 ollama serve

# 2. Set CORS for web apps
OLLAMA_ORIGINS="https://myapp.com" ollama serve

# 3. Pre-pull models
ollama pull llama3.2
ollama pull nomic-embed-text

# 4. Monitor GPU memory
watch -n 1 nvidia-smi

# 5. Use systemd for auto-restart
sudo systemctl enable ollama
sudo systemctl start ollama

# In Agent Zero settings:
Chat Model:
  provider: ollama
  model: llama3.2
  base_url: http://localhost:11434

Embedding Model:
  provider: ollama
  model: nomic-embed-text
  base_url: http://localhost:11434

# Essential commands
ollama pull <model>          # Download
ollama run <model>           # Chat
ollama list                  # List local
ollama ps                    # List running
ollama stop <model>          # Unload
ollama rm <model>            # Delete
ollama create <name> -f Modelfile  # Custom model

# API endpoints
POST /api/generate           # Text generation
POST /api/chat               # Chat completion
POST /api/embed              # Embeddings
GET  /api/tags               # List models
GET  /api/ps                 # Running models
POST /api/pull               # Download model
DELETE /api/delete           # Delete model

# OpenAI compatible
POST /v1/chat/completions
POST /v1/embeddings
GET  /v1/models

Variable	Description
`CUDA_VISIBLE_DEVICES`	NVIDIA GPU selection
`ROCR_VISIBLE_DEVICES`	AMD GPU selection
`OLLAMA_MAX_GPU_MEMORY`	Max GPU memory per GPU
`OLLAMA_MAX_LOADED_MODELS`	Max concurrent models

Variable	Description
`OLLAMA_NUM_PARALLEL`	Parallel requests per model
`OLLAMA_MAX_QUEUE`	Max queued requests
`OLLAMA_MAX_VRAM`	Max VRAM to use

Platform	Location
macOS	`~/.ollama/logs/server.log`
Linux	`journalctl -u ollama --no-pager -f`
Docker	`docker logs <container>`
Windows	`%LOCALAPPDATA%\Ollama\server.log`

Ollama

Table of Contents

Ollama

Table of Contents

Overview

Key Features

Installation

Default Configuration

CLI Reference

Model Management

Running Models

Model Lifecycle

Creating Custom Models

Integrations

REST API Reference

Generate Completion

Chat Completion

Embeddings

Model Management Endpoints

Modelfile Reference

Instructions

Complete Example

LoRA adapter

ADAPTER ./fine-tuned-adapter.gguf

License

Version requirement

OpenAI Compatibility

Compatible Endpoints

Usage Example

Using with LangChain

Using with LlamaIndex

Embeddings

Embedding Models

Generate Embeddings

Use Cases

Thinking/Reasoning Models

Enable Thinking

Response Format

Chat with Thinking

Environment Variables

Server Configuration

GPU Configuration

Performance

Example Configuration

Troubleshooting

Log Locations

Common Issues

GPU Not Detected

Out of Memory

Connection Refused

Slow Model Loading

Windows Terminal Issues

Debug Mode

Best Practices

Model Selection

Resource Optimization

Production Deployment

Agent Zero Integration

Quick Reference Card

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns