Deploy LLM with Ollama on GPU Cloud (2026 Guide) banner image

Deploy LLMs with Ollama on GPU Cloud (2026 Guide)

Deploy Ollama on GPU cloud infrastructure and test large language models with interactive model selection. Complete workflow from VM provisioning to automated teardown.

Ollama
LLM
GPU
Cloud Computing
AI Deployment
AI Agent Shortcut

Skip this manual setup. The complete workflow below exists as a tested recipe in the Massed Compute MCP — provision GPU VM, install Ollama, pull your model, and verify with one natural language request.

Running large language models requires GPU infrastructure that’s both powerful and cost-effective. This guide walks through deploying Ollama on Massed Compute GPU VMs with interactive model selection and proper teardown.

You’ll provision a GPU instance, bootstrap Ollama with NVIDIA drivers, select from popular model options based on your VRAM constraints, and run verification tests before cleanup.

Technology Stack

Component Purpose Version
Ollama LLM inference server Latest stable
Ubuntu 24.04 Base OS with NVIDIA drivers Image 184
NVIDIA GPU Model acceleration L40, A30, A6000, etc.
CUDA Toolkit GPU compute framework Pre-installed

Requirements

Requirement Specification Notes
Massed Compute Account API credentials configured For VM provisioning
GPU VM 8+ vCPU, 32+ GB RAM Minimum for 7B models
VRAM 8+ GB for 7B, 48+ GB for 70B Model-dependent
Storage 256+ GB For model downloads
SSH Client Terminal or SSH app For VM access

Massed Compute VM Pricing

Pricing fetched from the Massed Compute inventory API on June 10, 2026.
SKU Description vCPU RAM Storage Price Capacity
gpu_1x_A30 1x A30 (24GB) 16 48 GiB 256 GB $0.35/hr 0
gpu_1x_a5000 1x RTX A5000 (24GB) 10 32 GiB 256 GB $0.44/hr 0
gpu_2x_A30 2x A30 (24GB) 30 96 GiB 512 GB $0.70/hr 0
gpu_1x_l40_spot 1x L40 (48GB) [Spot] 14 72 GiB 625 GB $0.78/hr 19
gpu_1x_6000_ada 1x RTX 6000 ADA (48GB) 12 72 GiB 350 GB $0.79/hr 0
gpu_1x_l40 1x L40 (48GB) 14 72 GiB 625 GB $0.86/hr 19
Model Selection Strategy: Choose your model based on VRAM constraints. The L40 (48GB) handles most 7B models comfortably, while 70B models need careful memory management or larger GPUs.

Step-by-Step Deployment

1

Launch GPU VM

Provision a GPU instance using the Massed Compute API or dashboard. Select image 184 (Ubuntu 24.04 with NVIDIA drivers pre-installed):

curl -X POST https://api.massedcompute.com/v1/instances \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sku": "gpu_1x_l40",
    "imageId": 184,
    "name": "ollama-llm-test"
  }'

Wait for the instance status to show running and note the SSH connection details.

2

Connect and Bootstrap

SSH into the VM and install Ollama with GPU support:

ssh ubuntu@YOUR_VM_IP

# Verify GPU is available
nvidia-smi

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama service
sudo systemctl enable ollama
sudo systemctl start ollama

# Verify installation
ollama --version

3

Select Model

Choose an appropriate model based on your GPU’s VRAM capacity:

Model Tag Parameters ~VRAM Use Case Pull Size
llama3.2:3b 3B ~4-7 GB Quick testing only Small/fastest
llama3.1:8b 8B ~8-10 GB General chat + coding Medium/fast
llama3.1:70b 70B ~40+ GB Q4 Advanced reasoning Large/slow
qwen2.5:7b 7B ~8 GB Multilingual assistant Medium/fast
qwen2.5-coder:7b 7B ~8 GB Code-focused tasks Medium/fast

4

Pull and Test Model

Download your chosen model and run a verification test:

# Replace MODEL_TAG with your selection
ollama pull llama3.1:8b

# Verify model is available
ollama list

# Run smoke test
ollama run llama3.1:8b "Hello, can you confirm you're working properly?"

# Check GPU utilization
nvidia-smi

The model should respond normally, and nvidia-smi should show VRAM usage.

5

Optional: Setup SSH Tunnel

Create a local tunnel for API access or Open WebUI integration:

# From your local machine
ssh -L 11434:127.0.0.1:11434 ubuntu@YOUR_VM_IP

# Test local API access
curl http://127.0.0.1:11434/api/tags

This allows you to use Ollama from local applications or install Open WebUI for a chat interface.

6

Cleanup Resources

When finished testing, terminate the VM to stop billing:

# Via API
curl -X DELETE https://api.massedcompute.com/v1/instances/INSTANCE_ID \
  -H "Authorization: Bearer YOUR_API_TOKEN"

# Or via dashboard
# Navigate to Instances → Actions → Terminate

Verify termination in your dashboard to ensure billing has stopped.

Troubleshooting

SSH Connection Issues: If you get host key warnings, refresh your known hosts with ssh-keygen -R YOUR_VM_IP. Confirm the VM is still running in your dashboard.

No GPU Detected: Verify you launched with image 184. Check GPU status with nvidia-smi. If no GPU is visible, the instance may need a restart or driver installation.

Model Pull Out of Memory: Choose a smaller model (3B or 7B) or upgrade to a larger GPU SKU. Check available VRAM with nvidia-smi before pulling large models.

Empty or Hanging Responses: Confirm the model pull completed fully with ollama list (should show non-zero size). Restart Ollama with sudo systemctl restart ollama.

API Binding Issues: Ollama defaults to local-only binding. Check service configuration in /etc/systemd/system/ollama.service.d/ if you need external access (not recommended without authentication).

Skip All of This: Deploy with an AI Agent

This entire workflow is available as a tested, machine-readable recipe in the Massed Compute MCP. Instead of manual setup, configure the MCP server and request deployment with natural language.

Add to your MCP settings:

{
  "mcpServers": {
    "massed-compute": {
      "type": "http",
      "url": "https://vm.massedcompute.com/api/mcp",
      "headers": { "Authorization": "Bearer MC_TOKEN" }
    }
  }
}

Then say:

“Launch a GPU VM and set up Ollama with Llama 3.1 8B for testing. I want to verify it works with a simple conversation, then tear down the VM when done.”

The agent matches your request against the recipe catalog, provisions the appropriate GPU instance, runs the setup and verification steps above, and reports back with connection details and test results. If any step fails, it stops and reports the specific error rather than continuing with broken infrastructure.

This recipe was last tested on June 10, 2026.

Ready to Deploy LLMs?

Launch GPU infrastructure in minutes and start running large language models with Ollama. Think it. Build it. Scale it.

Quick Setup Reference

For experienced users, here’s the condensed deployment sequence:

# 1. Launch GPU VM
curl -X POST https://api.massedcompute.com/v1/instances \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{"sku":"gpu_1x_l40","imageId":184}'

# 2. SSH and install
ssh ubuntu@VM_IP
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama

# 3. Pull model and test
ollama pull llama3.1:8b
ollama run llama3.1:8b "Test message"

# 4. Optional: tunnel for local access
ssh -L 11434:127.0.0.1:11434 ubuntu@VM_IP

# 5. Cleanup
curl -X DELETE https://api.massedcompute.com/v1/instances/ID

Frequently Asked Questions

01Which GPU should I choose for different model sizes?

For 7B models (Llama 3.1, Qwen2.5), the L40 (48GB VRAM) at $0.86/hr provides ample headroom. For 70B models, you need the full 48GB and should monitor memory usage closely. For quick testing with 3B models, even the A5000 (24GB) works fine at $0.44/hr.

02Can I run multiple models simultaneously on one GPU?

Yes, but VRAM is additive. Two 7B models will use ~16GB combined, leaving 32GB free on an L40. Use nvidia-smi to monitor usage and ollama ps to see active models. Models auto-unload after 5 minutes of inactivity by default.

03How do I connect Open WebUI or other frontends?

Set up an SSH tunnel with ssh -L 11434:127.0.0.1:11434 ubuntu@VM_IP, then point your frontend to http://127.0.0.1:11434. For Open WebUI, install it locally with Docker and it will auto-detect the tunneled Ollama API.

04What happens if I forget to terminate the VM?

You continue getting billed hourly until termination. Set calendar reminders or use the Massed Compute dashboard’s auto-shutdown feature. The API also supports scheduled termination when creating instances with the terminateAt parameter.

05Can I save models between VM sessions?

Models are stored in VM disk and lost on termination. For persistent storage, use Massed Compute block volumes mounted to /root/.ollama before pulling models. This preserves downloads across VM recreations but adds storage costs.