Train LLM LoRA Models with QLoRA on GPU Cloud (2026 Guide) banner image

Train LLM LoRA Models with QLoRA on GPU Cloud (2026 Guide)

Set up a complete QLoRA training environment with Hugging Face PEFT and bitsandbytes on a GPU VM. This guide walks you through launching a VM, installing the full training stack, and running efficient LoRA fine-tuning with 4-bit quantization.

LoRA
QLoRA
GPU
Training
Hugging Face
Fine-tuning

💡 AI AGENT SHORTCUT

This entire setup is available as a tested recipe in the Massed Compute MCP. Skip the manual steps and let an AI agent provision the VM and run the complete training pipeline for you.

LoRA (Low-Rank Adaptation) training with QLoRA quantization lets you fine-tune large language models efficiently on single GPUs. This guide sets up a complete training environment using Hugging Face’s PEFT library, bitsandbytes for 4-bit quantization, and validates the setup with a quick smoke test.

The smoke test uses the ungated Qwen/Qwen2.5-0.5B-Instruct model and a synthetic dataset to verify your training stack works without downloading large models or hitting API limits.

Required Stack
Component Version Purpose
PyTorch 2.6.0 (CUDA 12.4) Neural network framework
Transformers 4.49.0 Hugging Face model library
PEFT 0.14.0 Parameter-efficient fine-tuning
bitsandbytes 0.45.3 4-bit quantization
TRL 0.15.2 Training utilities
System Requirements
Resource Minimum Recommended
GPU Memory 24 GB 48 GB+ for larger models
System RAM 32 GB 64 GB+ for batch processing
vCPUs 8 12+ for data loading
Storage 256 GB 500 GB+ for model caching
OS Ubuntu 24.04 With NVIDIA drivers pre-installed

Massed Compute VM Pricing

Here are the current GPU VM options that meet the requirements for LoRA training:

Pricing fetched from the Massed Compute inventory API on June 11, 2026.
SKU Description vCPU RAM Storage Price Capacity
gpu_1x_A30 1x A30 (24GB) 16 48 GiB 256 GB $0.35/hr 0
gpu_1x_a5000 1x RTX A5000 (24GB) 10 32 GiB 256 GB $0.44/hr 0
gpu_2x_A30 2x A30 (24GB) 30 96 GiB 512 GB $0.70/hr 0
gpu_1x_l40_spot 1x L40 (48GB) [Spot] 14 72 GiB 625 GB $0.78/hr 15
gpu_1x_6000_ada 1x RTX 6000 ADA (48GB) 12 72 GiB 350 GB $0.79/hr 14
gpu_1x_l40 1x L40 (48GB) 14 72 GiB 625 GB $0.86/hr 15
Recommended: Start with gpu_1x_l40 for the extra VRAM and compute headroom. The L40 spot option offers significant savings for experimental workloads.

Step-by-Step Deployment

1

Launch GPU VM

Create a new VM with GPU support and pre-installed NVIDIA drivers:

  • SKU: gpu_1x_l40 or similar
  • Image ID: 184 (Ubuntu Server 24.04 w/ Drivers)
  • Instance name: lora-train
  • Attach your SSH key at launch

Wait for the VM to reach running status and become SSH accessible.

2

Verify GPU Access

Connect to your VM and confirm the GPU is visible:

ssh ubuntu@YOUR_VM_IP
nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader

You should see output like:

NVIDIA L40, 45634 MiB, 45628 MiB

3

Bootstrap Training Environment

Install system dependencies and create the Python environment:

sudo apt-get update
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y \
  python3-venv python3-pip git curl

mkdir -p ~/lora-training/scripts ~/lora-training/outputs
python3 -m venv ~/lora-training-env
source ~/lora-training-env/bin/activate

python -m pip install --upgrade pip setuptools wheel

4

Install Training Stack

Install the pinned versions of PyTorch and training libraries:

python -m pip install \
  --extra-index-url https://download.pytorch.org/whl/cu124 \
  'torch==2.6.0' \
  'transformers==4.49.0' \
  'datasets==3.3.2' \
  'accelerate==1.4.0' \
  'peft==0.14.0' \
  'trl==0.15.2' \
  'bitsandbytes==0.45.3' \
  sentencepiece protobuf packaging

5

Create Environment Script

Set up environment activation and configuration:

cat > ~/lora-training/env.sh <<'ENV'
source ~/lora-training-env/bin/activate
export TOKENIZERS_PARALLELISM=false
ENV

source ~/lora-training/env.sh

6

Verify Installation

Test that all components are working correctly:

python -c "
import torch, transformers, peft, trl, bitsandbytes
print('torch', torch.__version__, 'cuda', torch.cuda.is_available(), torch.cuda.get_device_name(0))
print('transformers', transformers.__version__)
print('peft', peft.__version__)
print('trl', trl.__version__)
print('bitsandbytes', bitsandbytes.__version__)
"

7

Create Training Script

Create the QLoRA training script with proper configuration:

cat > ~/lora-training/scripts/smoke_lora_train.py <<'PY'
import os
from pathlib import Path
import torch
from datasets import Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

base_model = os.environ.get('BASE_MODEL', 'Qwen/Qwen2.5-0.5B-Instruct')
output_dir = Path(os.environ.get('OUTPUT_DIR', '~/lora-training/outputs/smoke')).expanduser()
max_seq_length = int(os.environ.get('MAX_SEQ_LENGTH', '512'))
max_steps = int(os.environ.get('SMOKE_MAX_STEPS', '10'))
output_dir.mkdir(parents=True, exist_ok=True)

# Create synthetic training data
texts = [
    '### Instruction: Say hello in one short sentence.\n### Response: Hello from a tiny LoRA smoke test.',
    '### Instruction: Name one GPU vendor.\n### Response: NVIDIA.',
    '### Instruction: What is QLoRA used for?\n### Response: Efficient fine tuning of language models.',
    '### Instruction: Reply with the word adapter.\n### Response: adapter',
] * 16
raw_dataset = Dataset.from_dict({'text': texts})

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Tokenize dataset
def tokenize(batch):
    return tokenizer(batch['text'], truncation=True, max_length=max_seq_length)

train_dataset = raw_dataset.map(tokenize, batched=True, remove_columns=['text'])

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map='auto',
    torch_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
)
model.config.use_cache = False
model = prepare_model_for_kbit_training(model)

# Configure LoRA
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM',
    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'],
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

# Configure training
training_args = TrainingArguments(
    output_dir=str(output_dir),
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    learning_rate=2e-4,
    max_steps=max_steps,
    logging_steps=1,
    save_steps=max_steps,
    save_total_limit=1,
    bf16=torch.cuda.is_bf16_supported(),
    fp16=not torch.cuda.is_bf16_supported(),
    optim='paged_adamw_8bit',
    report_to=[],
    remove_unused_columns=False,
)

collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=collator,
)

# Run training
trainer.train()
trainer.model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

# Verify output
assert (output_dir / 'adapter_config.json').exists()
print('adapter_config', output_dir / 'adapter_config.json')
print('LORA_SMOKE_OK')
PY

8

Run Training

Execute the smoke test to validate your setup:

source ~/lora-training/env.sh
rm -rf ~/lora-training/outputs/smoke
export BASE_MODEL="Qwen/Qwen2.5-0.5B-Instruct"
export OUTPUT_DIR="$HOME/lora-training/outputs/smoke"
export MAX_SEQ_LENGTH="512"
export SMOKE_MAX_STEPS="10"
python ~/lora-training/scripts/smoke_lora_train.py

The training should complete in 2-3 minutes and print LORA_SMOKE_OK.

9

Verify Adapter Files

Check that the LoRA adapter was created successfully:

ls -lah ~/lora-training/outputs/smoke
test -f ~/lora-training/outputs/smoke/adapter_config.json && echo "Adapter created successfully"

You should see files like adapter_config.json and adapter_model.safetensors.

Troubleshooting

CUDA Out of Memory

If you hit OOM errors, try these adjustments:

  • Reduce MAX_SEQ_LENGTH from 512 to 256
  • Lower SMOKE_MAX_STEPS from 10 to 5
  • Use a smaller base model like microsoft/DialoGPT-small

Gated Model Access

If using gated models, set up authentication:

pip install huggingface_hub
huggingface-cli login

TRL API Changes

This guide uses explicit tokenization with Transformers Trainer instead of TRL's SFTTrainer to avoid API drift issues in trl==0.15.2.

Mixed Precision Issues

The script automatically detects bf16 support. If you encounter gradient scaler errors, ensure bf16 is enabled on supported hardware.

Skip All of This: Deploy with an AI Agent

This entire guide exists as a tested, machine-readable recipe in the Massed Compute MCP. Instead of running each step manually, you can have an AI agent provision the right VM shape and execute the complete setup for you.

Add this server config to your MCP client:

{
  "mcpServers": {
    "massed-compute": {
      "type": "http",
      "url": "https://vm.massedcompute.com/api/mcp",
      "headers": { "Authorization": "Bearer MC_TOKEN" }
    }
  }
}

Then say:

"Set up a complete LoRA training environment with QLoRA on a GPU VM. I need Hugging Face PEFT, bitsandbytes, and PyTorch 2.6 with CUDA 12.4. Run the smoke test with Qwen2.5-0.5B to verify everything works."

The agent will match your request against the recipe catalog, provision the right VM shape, install the complete training stack, run the verification steps above, and report back with connection details and confirmation that the LoRA adapter was created successfully. The process stops immediately if any step fails, so you get reliable deployment or clear error details.

Recipe last tested: June 10, 2026

Ready to Train Your Models?

Get started with GPU VMs optimized for machine learning workloads. Launch your training environment in minutes.

Think it. Build it. Scale it.

Quick Setup Reference

For experienced users, here's the complete setup in one script:

#!/bin/bash
set -euxo pipefail

# Install system deps
sudo apt-get update && sudo DEBIAN_FRONTEND=noninteractive apt-get install -y python3-venv python3-pip git

# Create environment
python3 -m venv ~/lora-env && source ~/lora-env/bin/activate

# Install training stack
pip install --upgrade pip
pip install --extra-index-url https://download.pytorch.org/whl/cu124 \
  torch==2.6.0 transformers==4.49.0 datasets==3.3.2 \
  accelerate==1.4.0 peft==0.14.0 trl==0.15.2 bitsandbytes==0.45.3

# Verify installation
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0)}')"

echo "LoRA training environment ready"

Frequently Asked Questions

01 What's the difference between LoRA and QLoRA?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that adds small adapter layers to a frozen base model. QLoRA extends LoRA by using 4-bit quantization to reduce memory usage, allowing you to train larger models on smaller GPUs while maintaining quality.

02 How much GPU memory do I need for different model sizes?

With QLoRA 4-bit quantization: 7B models need ~12GB, 13B models need ~20GB, and 30B+ models need 40GB+. The smoke test uses a 0.5B model that runs comfortably in 8GB. Always account for batch size and sequence length in your memory calculations.

03 Can I use custom datasets instead of the smoke test data?

Yes, replace the synthetic texts list in the training script with your own instruction-response pairs. For larger datasets, load from files using Hugging Face datasets library. Format your data as instruction-following pairs for best results with chat models.

04 How do I use the trained adapter for inference?

Load the base model and adapter using PEFT: model = AutoModelForCausalLM.from_pretrained(base_model); model = PeftModel.from_pretrained(model, adapter_path). The adapter files are saved in the output directory specified during training.

05 What if I need to train on multiple GPUs?

Use Accelerate for multi-GPU training. Update your script to use accelerate launch and configure the training arguments with appropriate device mapping. The L40 dual-GPU SKUs work well for distributed training of larger models.