Set up a complete QLoRA training environment with Hugging Face PEFT and bitsandbytes on a GPU VM. This guide walks you through launching a VM, installing the full training stack, and running efficient LoRA fine-tuning with 4-bit quantization.
This entire setup is available as a tested recipe in the Massed Compute MCP. Skip the manual steps and let an AI agent provision the VM and run the complete training pipeline for you.
LoRA (Low-Rank Adaptation) training with QLoRA quantization lets you fine-tune large language models efficiently on single GPUs. This guide sets up a complete training environment using Hugging Face’s PEFT library, bitsandbytes for 4-bit quantization, and validates the setup with a quick smoke test.
The smoke test uses the ungated Qwen/Qwen2.5-0.5B-Instruct model and a synthetic dataset to verify your training stack works without downloading large models or hitting API limits.
| Component | Version | Purpose |
|---|---|---|
| PyTorch | 2.6.0 (CUDA 12.4) | Neural network framework |
| Transformers | 4.49.0 | Hugging Face model library |
| PEFT | 0.14.0 | Parameter-efficient fine-tuning |
| bitsandbytes | 0.45.3 | 4-bit quantization |
| TRL | 0.15.2 | Training utilities |
| Resource | Minimum | Recommended |
|---|---|---|
| GPU Memory | 24 GB | 48 GB+ for larger models |
| System RAM | 32 GB | 64 GB+ for batch processing |
| vCPUs | 8 | 12+ for data loading |
| Storage | 256 GB | 500 GB+ for model caching |
| OS | Ubuntu 24.04 | With NVIDIA drivers pre-installed |
Massed Compute VM Pricing
Here are the current GPU VM options that meet the requirements for LoRA training:
| SKU | Description | vCPU | RAM | Storage | Price | Capacity |
|---|---|---|---|---|---|---|
gpu_1x_A30 |
1x A30 (24GB) | 16 | 48 GiB | 256 GB | $0.35/hr | 0 |
gpu_1x_a5000 |
1x RTX A5000 (24GB) | 10 | 32 GiB | 256 GB | $0.44/hr | 0 |
gpu_2x_A30 |
2x A30 (24GB) | 30 | 96 GiB | 512 GB | $0.70/hr | 0 |
gpu_1x_l40_spot |
1x L40 (48GB) [Spot] | 14 | 72 GiB | 625 GB | $0.78/hr | 15 |
gpu_1x_6000_ada |
1x RTX 6000 ADA (48GB) | 12 | 72 GiB | 350 GB | $0.79/hr | 14 |
gpu_1x_l40 |
1x L40 (48GB) | 14 | 72 GiB | 625 GB | $0.86/hr | 15 |
gpu_1x_l40 for the extra VRAM and compute headroom. The L40 spot option offers significant savings for experimental workloads.
Step-by-Step Deployment
Launch GPU VM
Create a new VM with GPU support and pre-installed NVIDIA drivers:
- SKU:
gpu_1x_l40or similar - Image ID:
184(Ubuntu Server 24.04 w/ Drivers) - Instance name:
lora-train - Attach your SSH key at launch
Wait for the VM to reach running status and become SSH accessible.
Verify GPU Access
Connect to your VM and confirm the GPU is visible:
ssh ubuntu@YOUR_VM_IP
nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader
You should see output like:
NVIDIA L40, 45634 MiB, 45628 MiB
Bootstrap Training Environment
Install system dependencies and create the Python environment:
sudo apt-get update
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y \
python3-venv python3-pip git curl
mkdir -p ~/lora-training/scripts ~/lora-training/outputs
python3 -m venv ~/lora-training-env
source ~/lora-training-env/bin/activate
python -m pip install --upgrade pip setuptools wheel
Install Training Stack
Install the pinned versions of PyTorch and training libraries:
python -m pip install \
--extra-index-url https://download.pytorch.org/whl/cu124 \
'torch==2.6.0' \
'transformers==4.49.0' \
'datasets==3.3.2' \
'accelerate==1.4.0' \
'peft==0.14.0' \
'trl==0.15.2' \
'bitsandbytes==0.45.3' \
sentencepiece protobuf packaging
Create Environment Script
Set up environment activation and configuration:
cat > ~/lora-training/env.sh <<'ENV'
source ~/lora-training-env/bin/activate
export TOKENIZERS_PARALLELISM=false
ENV
source ~/lora-training/env.sh
Verify Installation
Test that all components are working correctly:
python -c "
import torch, transformers, peft, trl, bitsandbytes
print('torch', torch.__version__, 'cuda', torch.cuda.is_available(), torch.cuda.get_device_name(0))
print('transformers', transformers.__version__)
print('peft', peft.__version__)
print('trl', trl.__version__)
print('bitsandbytes', bitsandbytes.__version__)
"
Create Training Script
Create the QLoRA training script with proper configuration:
cat > ~/lora-training/scripts/smoke_lora_train.py <<'PY'
import os
from pathlib import Path
import torch
from datasets import Dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
DataCollatorForLanguageModeling,
Trainer,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
base_model = os.environ.get('BASE_MODEL', 'Qwen/Qwen2.5-0.5B-Instruct')
output_dir = Path(os.environ.get('OUTPUT_DIR', '~/lora-training/outputs/smoke')).expanduser()
max_seq_length = int(os.environ.get('MAX_SEQ_LENGTH', '512'))
max_steps = int(os.environ.get('SMOKE_MAX_STEPS', '10'))
output_dir.mkdir(parents=True, exist_ok=True)
# Create synthetic training data
texts = [
'### Instruction: Say hello in one short sentence.\n### Response: Hello from a tiny LoRA smoke test.',
'### Instruction: Name one GPU vendor.\n### Response: NVIDIA.',
'### Instruction: What is QLoRA used for?\n### Response: Efficient fine tuning of language models.',
'### Instruction: Reply with the word adapter.\n### Response: adapter',
] * 16
raw_dataset = Dataset.from_dict({'text': texts})
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Tokenize dataset
def tokenize(batch):
return tokenizer(batch['text'], truncation=True, max_length=max_seq_length)
train_dataset = raw_dataset.map(tokenize, batched=True, remove_columns=['text'])
# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
bnb_4bit_use_double_quant=True,
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb_config,
device_map='auto',
torch_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
)
model.config.use_cache = False
model = prepare_model_for_kbit_training(model)
# Configure LoRA
peft_config = LoraConfig(
r=8,
lora_alpha=16,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM',
target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'],
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# Configure training
training_args = TrainingArguments(
output_dir=str(output_dir),
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
learning_rate=2e-4,
max_steps=max_steps,
logging_steps=1,
save_steps=max_steps,
save_total_limit=1,
bf16=torch.cuda.is_bf16_supported(),
fp16=not torch.cuda.is_bf16_supported(),
optim='paged_adamw_8bit',
report_to=[],
remove_unused_columns=False,
)
collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=collator,
)
# Run training
trainer.train()
trainer.model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
# Verify output
assert (output_dir / 'adapter_config.json').exists()
print('adapter_config', output_dir / 'adapter_config.json')
print('LORA_SMOKE_OK')
PY
Run Training
Execute the smoke test to validate your setup:
source ~/lora-training/env.sh
rm -rf ~/lora-training/outputs/smoke
export BASE_MODEL="Qwen/Qwen2.5-0.5B-Instruct"
export OUTPUT_DIR="$HOME/lora-training/outputs/smoke"
export MAX_SEQ_LENGTH="512"
export SMOKE_MAX_STEPS="10"
python ~/lora-training/scripts/smoke_lora_train.py
The training should complete in 2-3 minutes and print LORA_SMOKE_OK.
Verify Adapter Files
Check that the LoRA adapter was created successfully:
ls -lah ~/lora-training/outputs/smoke
test -f ~/lora-training/outputs/smoke/adapter_config.json && echo "Adapter created successfully"
You should see files like adapter_config.json and adapter_model.safetensors.
Troubleshooting
CUDA Out of Memory
If you hit OOM errors, try these adjustments:
- Reduce
MAX_SEQ_LENGTHfrom 512 to 256 - Lower
SMOKE_MAX_STEPSfrom 10 to 5 - Use a smaller base model like
microsoft/DialoGPT-small
Gated Model Access
If using gated models, set up authentication:
pip install huggingface_hub
huggingface-cli login
TRL API Changes
This guide uses explicit tokenization with Transformers Trainer instead of TRL's SFTTrainer to avoid API drift issues in trl==0.15.2.
Mixed Precision Issues
The script automatically detects bf16 support. If you encounter gradient scaler errors, ensure bf16 is enabled on supported hardware.
Skip All of This: Deploy with an AI Agent
This entire guide exists as a tested, machine-readable recipe in the Massed Compute MCP. Instead of running each step manually, you can have an AI agent provision the right VM shape and execute the complete setup for you.
Add this server config to your MCP client:
{
"mcpServers": {
"massed-compute": {
"type": "http",
"url": "https://vm.massedcompute.com/api/mcp",
"headers": { "Authorization": "Bearer MC_TOKEN" }
}
}
}
Then say:
The agent will match your request against the recipe catalog, provision the right VM shape, install the complete training stack, run the verification steps above, and report back with connection details and confirmation that the LoRA adapter was created successfully. The process stops immediately if any step fails, so you get reliable deployment or clear error details.
Recipe last tested: June 10, 2026
Quick Setup Reference
For experienced users, here's the complete setup in one script:
#!/bin/bash
set -euxo pipefail
# Install system deps
sudo apt-get update && sudo DEBIAN_FRONTEND=noninteractive apt-get install -y python3-venv python3-pip git
# Create environment
python3 -m venv ~/lora-env && source ~/lora-env/bin/activate
# Install training stack
pip install --upgrade pip
pip install --extra-index-url https://download.pytorch.org/whl/cu124 \
torch==2.6.0 transformers==4.49.0 datasets==3.3.2 \
accelerate==1.4.0 peft==0.14.0 trl==0.15.2 bitsandbytes==0.45.3
# Verify installation
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0)}')"
echo "LoRA training environment ready"
Frequently Asked Questions
01 What's the difference between LoRA and QLoRA?
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that adds small adapter layers to a frozen base model. QLoRA extends LoRA by using 4-bit quantization to reduce memory usage, allowing you to train larger models on smaller GPUs while maintaining quality.
02 How much GPU memory do I need for different model sizes?
With QLoRA 4-bit quantization: 7B models need ~12GB, 13B models need ~20GB, and 30B+ models need 40GB+. The smoke test uses a 0.5B model that runs comfortably in 8GB. Always account for batch size and sequence length in your memory calculations.
03 Can I use custom datasets instead of the smoke test data?
Yes, replace the synthetic texts list in the training script with your own instruction-response pairs. For larger datasets, load from files using Hugging Face datasets library. Format your data as instruction-following pairs for best results with chat models.
04 How do I use the trained adapter for inference?
Load the base model and adapter using PEFT: model = AutoModelForCausalLM.from_pretrained(base_model); model = PeftModel.from_pretrained(model, adapter_path). The adapter files are saved in the output directory specified during training.
05 What if I need to train on multiple GPUs?
Use Accelerate for multi-GPU training. Update your script to use accelerate launch and configure the training arguments with appropriate device mapping. The L40 dual-GPU SKUs work well for distributed training of larger models.











