Deploy Ollama on GPU cloud infrastructure and test large language models with interactive model selection. Complete workflow from VM provisioning to automated teardown.
Skip this manual setup. The complete workflow below exists as a tested recipe in the Massed Compute MCP — provision GPU VM, install Ollama, pull your model, and verify with one natural language request.
Running large language models requires GPU infrastructure that’s both powerful and cost-effective. This guide walks through deploying Ollama on Massed Compute GPU VMs with interactive model selection and proper teardown.
You’ll provision a GPU instance, bootstrap Ollama with NVIDIA drivers, select from popular model options based on your VRAM constraints, and run verification tests before cleanup.
Technology Stack
| Component | Purpose | Version |
|---|---|---|
| Ollama | LLM inference server | Latest stable |
| Ubuntu 24.04 | Base OS with NVIDIA drivers | Image 184 |
| NVIDIA GPU | Model acceleration | L40, A30, A6000, etc. |
| CUDA Toolkit | GPU compute framework | Pre-installed |
Requirements
| Requirement | Specification | Notes |
|---|---|---|
| Massed Compute Account | API credentials configured | For VM provisioning |
| GPU VM | 8+ vCPU, 32+ GB RAM | Minimum for 7B models |
| VRAM | 8+ GB for 7B, 48+ GB for 70B | Model-dependent |
| Storage | 256+ GB | For model downloads |
| SSH Client | Terminal or SSH app | For VM access |
Massed Compute VM Pricing
| SKU | Description | vCPU | RAM | Storage | Price | Capacity |
|---|---|---|---|---|---|---|
gpu_1x_A30 |
1x A30 (24GB) | 16 | 48 GiB | 256 GB | $0.35/hr | 0 |
gpu_1x_a5000 |
1x RTX A5000 (24GB) | 10 | 32 GiB | 256 GB | $0.44/hr | 0 |
gpu_2x_A30 |
2x A30 (24GB) | 30 | 96 GiB | 512 GB | $0.70/hr | 0 |
gpu_1x_l40_spot |
1x L40 (48GB) [Spot] | 14 | 72 GiB | 625 GB | $0.78/hr | 19 |
gpu_1x_6000_ada |
1x RTX 6000 ADA (48GB) | 12 | 72 GiB | 350 GB | $0.79/hr | 0 |
gpu_1x_l40 |
1x L40 (48GB) | 14 | 72 GiB | 625 GB | $0.86/hr | 19 |
Step-by-Step Deployment
Launch GPU VM
Provision a GPU instance using the Massed Compute API or dashboard. Select image 184 (Ubuntu 24.04 with NVIDIA drivers pre-installed):
curl -X POST https://api.massedcompute.com/v1/instances \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sku": "gpu_1x_l40",
"imageId": 184,
"name": "ollama-llm-test"
}'
Wait for the instance status to show running and note the SSH connection details.
Connect and Bootstrap
SSH into the VM and install Ollama with GPU support:
ssh ubuntu@YOUR_VM_IP # Verify GPU is available nvidia-smi # Install Ollama curl -fsSL https://ollama.com/install.sh | sh # Start Ollama service sudo systemctl enable ollama sudo systemctl start ollama # Verify installation ollama --version
Select Model
Choose an appropriate model based on your GPU’s VRAM capacity:
| Model Tag | Parameters | ~VRAM | Use Case | Pull Size |
|---|---|---|---|---|
llama3.2:3b |
3B | ~4-7 GB | Quick testing only | Small/fastest |
llama3.1:8b |
8B | ~8-10 GB | General chat + coding | Medium/fast |
llama3.1:70b |
70B | ~40+ GB Q4 | Advanced reasoning | Large/slow |
qwen2.5:7b |
7B | ~8 GB | Multilingual assistant | Medium/fast |
qwen2.5-coder:7b |
7B | ~8 GB | Code-focused tasks | Medium/fast |
Pull and Test Model
Download your chosen model and run a verification test:
# Replace MODEL_TAG with your selection ollama pull llama3.1:8b # Verify model is available ollama list # Run smoke test ollama run llama3.1:8b "Hello, can you confirm you're working properly?" # Check GPU utilization nvidia-smi
The model should respond normally, and nvidia-smi should show VRAM usage.
Optional: Setup SSH Tunnel
Create a local tunnel for API access or Open WebUI integration:
# From your local machine ssh -L 11434:127.0.0.1:11434 ubuntu@YOUR_VM_IP # Test local API access curl http://127.0.0.1:11434/api/tags
This allows you to use Ollama from local applications or install Open WebUI for a chat interface.
Cleanup Resources
When finished testing, terminate the VM to stop billing:
# Via API curl -X DELETE https://api.massedcompute.com/v1/instances/INSTANCE_ID \ -H "Authorization: Bearer YOUR_API_TOKEN" # Or via dashboard # Navigate to Instances → Actions → Terminate
Verify termination in your dashboard to ensure billing has stopped.
Troubleshooting
SSH Connection Issues: If you get host key warnings, refresh your known hosts with ssh-keygen -R YOUR_VM_IP. Confirm the VM is still running in your dashboard.
No GPU Detected: Verify you launched with image 184. Check GPU status with nvidia-smi. If no GPU is visible, the instance may need a restart or driver installation.
Model Pull Out of Memory: Choose a smaller model (3B or 7B) or upgrade to a larger GPU SKU. Check available VRAM with nvidia-smi before pulling large models.
Empty or Hanging Responses: Confirm the model pull completed fully with ollama list (should show non-zero size). Restart Ollama with sudo systemctl restart ollama.
API Binding Issues: Ollama defaults to local-only binding. Check service configuration in /etc/systemd/system/ollama.service.d/ if you need external access (not recommended without authentication).
Skip All of This: Deploy with an AI Agent
This entire workflow is available as a tested, machine-readable recipe in the Massed Compute MCP. Instead of manual setup, configure the MCP server and request deployment with natural language.
Add to your MCP settings:
{
"mcpServers": {
"massed-compute": {
"type": "http",
"url": "https://vm.massedcompute.com/api/mcp",
"headers": { "Authorization": "Bearer MC_TOKEN" }
}
}
}
Then say:
The agent matches your request against the recipe catalog, provisions the appropriate GPU instance, runs the setup and verification steps above, and reports back with connection details and test results. If any step fails, it stops and reports the specific error rather than continuing with broken infrastructure.
This recipe was last tested on June 10, 2026.
Ready to Deploy LLMs?
Launch GPU infrastructure in minutes and start running large language models with Ollama. Think it. Build it. Scale it.
Quick Setup Reference
For experienced users, here’s the condensed deployment sequence:
# 1. Launch GPU VM
curl -X POST https://api.massedcompute.com/v1/instances \
-H "Authorization: Bearer YOUR_TOKEN" \
-d '{"sku":"gpu_1x_l40","imageId":184}'
# 2. SSH and install
ssh ubuntu@VM_IP
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
# 3. Pull model and test
ollama pull llama3.1:8b
ollama run llama3.1:8b "Test message"
# 4. Optional: tunnel for local access
ssh -L 11434:127.0.0.1:11434 ubuntu@VM_IP
# 5. Cleanup
curl -X DELETE https://api.massedcompute.com/v1/instances/ID
Frequently Asked Questions
01Which GPU should I choose for different model sizes?
For 7B models (Llama 3.1, Qwen2.5), the L40 (48GB VRAM) at $0.86/hr provides ample headroom. For 70B models, you need the full 48GB and should monitor memory usage closely. For quick testing with 3B models, even the A5000 (24GB) works fine at $0.44/hr.
02Can I run multiple models simultaneously on one GPU?
Yes, but VRAM is additive. Two 7B models will use ~16GB combined, leaving 32GB free on an L40. Use nvidia-smi to monitor usage and ollama ps to see active models. Models auto-unload after 5 minutes of inactivity by default.
03How do I connect Open WebUI or other frontends?
Set up an SSH tunnel with ssh -L 11434:127.0.0.1:11434 ubuntu@VM_IP, then point your frontend to http://127.0.0.1:11434. For Open WebUI, install it locally with Docker and it will auto-detect the tunneled Ollama API.
04What happens if I forget to terminate the VM?
You continue getting billed hourly until termination. Set calendar reminders or use the Massed Compute dashboard’s auto-shutdown feature. The API also supports scheduled termination when creating instances with the terminateAt parameter.
05Can I save models between VM sessions?
Models are stored in VM disk and lost on termination. For persistent storage, use Massed Compute block volumes mounted to /root/.ollama before pulling models. This preserves downloads across VM recreations but adds storage costs.











