AI CUDA LLM

CUDA + Python for LLMs: Real Performance Gains and Practical Examples

By Alex Kholodniak

February 29, 2024 15 min read

Last month, I moved our GPT-style model inference from CPU to GPU and saw a 15x speedup. Today I want to show you exactly how to get similar results with your own LLM projects.

We’ll skip the theoretical stuff and focus on practical implementations, real benchmarks, and the gotchas I learned the hard way. By the end, you’ll know how to properly set up CUDA acceleration and when it actually makes sense to use it.

Why CUDA Actually Matters for LLMs

Let’s start with the numbers that matter. Here’s what I measured running inference on a 7B parameter model:

# Performance comparison (tokens/second)
CPU (16 cores, 64GB RAM): 12 tokens/sec
GPU (RTX 4090, 24GB VRAM): 186 tokens/sec
GPU (A100, 80GB VRAM): 312 tokens/sec

# Memory usage for 7B model
CPU: 28GB system RAM
GPU: 14GB VRAM (with optimizations)

The GPU doesn’t just run faster – it uses memory more efficiently and frees up your CPU for other tasks. But there’s a catch: you need to implement it correctly to see these gains.

Setting Up CUDA for LLM Development

Before jumping into code, let’s get your environment right. I’ve seen too many developers struggle with version mismatches and poor performance because they skipped this step.

Essential Installation Steps

# 1. Check your GPU compatibility
nvidia-smi

# 2. Install CUDA Toolkit (match your PyTorch version)
# For PyTorch 2.1+, use CUDA 11.8 or 12.1
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
sudo sh cuda_11.8.0_520.61.05_linux.run

# 3. Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# 4. Verify installation
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "import torch; print(f'CUDA version: {torch.version.cuda}')"

Environment Validation Script

#!/usr/bin/env python3
"""
Quick script to verify your CUDA setup for LLM work
"""
import torch
import psutil
import subprocess

def check_cuda_setup():
    print("=== CUDA Setup Verification ===")
    
    # Check PyTorch CUDA
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    
    if torch.cuda.is_available():
        print(f"CUDA version: {torch.version.cuda}")
        print(f"GPU count: {torch.cuda.device_count()}")
        
        for i in range(torch.cuda.device_count()):
            gpu = torch.cuda.get_device_properties(i)
            print(f"GPU {i}: {gpu.name}")
            print(f"  Memory: {gpu.total_memory / 1024**3:.1f} GB")
            print(f"  Compute capability: {gpu.major}.{gpu.minor}")
    
    # Check system memory
    ram = psutil.virtual_memory()
    print(f"System RAM: {ram.total / 1024**3:.1f} GB")
    
    # Quick CUDA test
    if torch.cuda.is_available():
        try:
            # Test basic CUDA operations
            x = torch.randn(1000, 1000).cuda()
            y = torch.randn(1000, 1000).cuda()
            z = torch.mm(x, y)
            print("✓ Basic CUDA operations working")
            
            # Test memory allocation
            torch.cuda.empty_cache()
            print("✓ CUDA memory management working")
            
        except Exception as e:
            print(f"✗ CUDA test failed: {e}")
    
    print("=== Setup Complete ===")

if __name__ == "__main__":
    check_cuda_setup()

Practical LLM Inference with CUDA

Now let’s build a production-ready inference pipeline. This isn’t just a toy example – it’s code you can use in real applications.

Optimized Inference Class

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM
import time
from contextlib import contextmanager

class OptimizedLLMInference:
    def __init__(self, model_name, device=None, half_precision=True):
        """
        Initialize optimized LLM inference
        
        Args:
            model_name: HuggingFace model name
            device: Target device (auto-detect if None)
            half_precision: Use fp16 for memory efficiency
        """
        self.device = self._get_optimal_device(device)
        self.half_precision = half_precision
        
        print(f"Loading model {model_name} on {self.device}")
        
        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        
        # Load model with optimizations
        self.model = self._load_optimized_model(model_name)
        
        # Warm up the model
        self._warmup()
        
    def _get_optimal_device(self, device):
        """Select the best available device"""
        if device is not None:
            return torch.device(device)
        
        if torch.cuda.is_available():
            # Use the GPU with most memory
            gpu_memory = []
            for i in range(torch.cuda.device_count()):
                gpu_memory.append(torch.cuda.get_device_properties(i).total_memory)
            best_gpu = gpu_memory.index(max(gpu_memory))
            return torch.device(f'cuda:{best_gpu}')
        
        return torch.device('cpu')
    
    def _load_optimized_model(self, model_name):
        """Load model with memory and speed optimizations"""
        # Model loading optimizations
        model_kwargs = {
            'torch_dtype': torch.float16 if self.half_precision else torch.float32,
            'device_map': 'auto' if self.device.type == 'cuda' else None,
            'low_cpu_mem_usage': True,
        }
        
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            **model_kwargs
        )
        
        if self.device.type == 'cuda':
            model = model.to(self.device)
            
            # Enable optimizations
            if hasattr(model, 'half') and self.half_precision:
                model = model.half()
            
            # Compile for better performance (PyTorch 2.0+)
            if hasattr(torch, 'compile'):
                try:
                    model = torch.compile(model)
                    print("✓ Model compiled for faster inference")
                except Exception as e:
                    print(f"Warning: Could not compile model: {e}")
        
        model.eval()
        return model
    
    def _warmup(self):
        """Warm up the model with a dummy forward pass"""
        print("Warming up model...")
        dummy_input = torch.tensor([[1, 2, 3, 4, 5]], device=self.device)
        
        with torch.no_grad():
            _ = self.model(dummy_input)
        
        if self.device.type == 'cuda':
            torch.cuda.synchronize()
        
        print("✓ Model warmed up")
    
    @contextmanager
    def inference_mode(self):
        """Context manager for optimized inference"""
        original_grad = torch.is_grad_enabled()
        try:
            torch.set_grad_enabled(False)
            if self.device.type == 'cuda':
                with torch.cuda.amp.autocast(enabled=self.half_precision):
                    yield
            else:
                yield
        finally:
            torch.set_grad_enabled(original_grad)
    
    def generate_text(self, prompt, max_length=100, temperature=0.7, top_p=0.9):
        """
        Generate text with performance monitoring
        """
        start_time = time.time()
        
        # Tokenize input
        inputs = self.tokenizer(prompt, return_tensors="pt", truncation=True)
        input_ids = inputs.input_ids.to(self.device)
        attention_mask = inputs.attention_mask.to(self.device)
        
        input_length = input_ids.shape[1]
        
        with self.inference_mode():
            # Generate with optimized settings
            outputs = self.model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                max_length=min(input_length + max_length, 2048),  # Respect model limits
                temperature=temperature,
                top_p=top_p,
                do_sample=True,
                pad_token_id=self.tokenizer.pad_token_id,
                use_cache=True,  # Enable KV caching
                repetition_penalty=1.1,
            )
        
        # Decode output
        generated_text = self.tokenizer.decode(
            outputs[0][input_length:], 
            skip_special_tokens=True
        )
        
        # Performance metrics
        end_time = time.time()
        generation_time = end_time - start_time
        tokens_generated = len(outputs[0]) - input_length
        tokens_per_second = tokens_generated / generation_time
        
        return {
            'text': generated_text,
            'tokens_generated': tokens_generated,
            'time_taken': generation_time,
            'tokens_per_second': tokens_per_second,
            'memory_used': self._get_memory_usage()
        }
    
    def _get_memory_usage(self):
        """Get current memory usage"""
        if self.device.type == 'cuda':
            return {
                'gpu_memory_used': torch.cuda.memory_allocated(self.device) / 1024**3,
                'gpu_memory_cached': torch.cuda.memory_reserved(self.device) / 1024**3,
            }
        else:
            import psutil
            process = psutil.Process()
            return {
                'cpu_memory_used': process.memory_info().rss / 1024**3
            }

# Usage example
if __name__ == "__main__":
    # Initialize with a smaller model for testing
    llm = OptimizedLLMInference("gpt2", half_precision=True)
    
    # Generate text
    result = llm.generate_text(
        "The future of AI development will be",
        max_length=50,
        temperature=0.8
    )
    
    print(f"Generated: {result['text']}")
    print(f"Performance: {result['tokens_per_second']:.1f} tokens/sec")
    print(f"Memory: {result['memory_used']}")

Memory Optimization Techniques

Memory is often the limiting factor when working with LLMs. Here are the techniques that actually work:

Gradient Checkpointing for Training

def setup_memory_efficient_training(model, optimizer):
    """Configure model for memory-efficient training"""
    
    # Enable gradient checkpointing
    if hasattr(model, 'gradient_checkpointing_enable'):
        model.gradient_checkpointing_enable()
        print("✓ Gradient checkpointing enabled")
    
    # Use mixed precision training
    scaler = torch.cuda.amp.GradScaler()
    
    def training_step(batch):
        optimizer.zero_grad()
        
        with torch.cuda.amp.autocast():
            outputs = model(**batch)
            loss = outputs.loss
        
        # Backward pass with scaling
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        
        return loss.item()
    
    return training_step, scaler

Dynamic Batching for Inference

class DynamicBatchInference:
    def __init__(self, model, tokenizer, max_batch_size=8, max_sequence_length=512):
        self.model = model
        self.tokenizer = tokenizer
        self.max_batch_size = max_batch_size
        self.max_sequence_length = max_sequence_length
        
    def batch_generate(self, prompts, **generate_kwargs):
        """Process multiple prompts efficiently"""
        results = []
        
        for i in range(0, len(prompts), self.max_batch_size):
            batch = prompts[i:i + self.max_batch_size]
            batch_results = self._process_batch(batch, **generate_kwargs)
            results.extend(batch_results)
        
        return results
    
    def _process_batch(self, prompts, **generate_kwargs):
        """Process a single batch of prompts"""
        # Tokenize with padding
        inputs = self.tokenizer(
            prompts,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=self.max_sequence_length
        )
        
        input_ids = inputs.input_ids.to(self.model.device)
        attention_mask = inputs.attention_mask.to(self.model.device)
        
        with torch.no_grad():
            outputs = self.model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                **generate_kwargs
            )
        
        # Decode outputs
        results = []
        for i, output in enumerate(outputs):
            # Skip the input tokens
            generated_tokens = output[len(input_ids[i]):]
            text = self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
            results.append(text)
        
        return results

Performance Benchmarking

Here’s a comprehensive benchmarking script to measure your setup’s performance:

import time
import torch
import matplotlib.pyplot as plt
from typing import List, Dict
import json

class LLMBenchmark:
    def __init__(self, model, tokenizer, device):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device
        
    def benchmark_inference(self, prompts: List[str], 
                          sequence_lengths: List[int] = [50, 100, 200, 500]) -> Dict:
        """Benchmark inference performance across different sequence lengths"""
        results = {
            'sequence_lengths': sequence_lengths,
            'tokens_per_second': [],
            'memory_usage': [],
            'latency': []
        }
        
        for seq_len in sequence_lengths:
            print(f"Benchmarking sequence length: {seq_len}")
            
            # Clear cache
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
            
            times = []
            memory_usage = []
            
            for prompt in prompts:
                start_time = time.time()
                
                inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
                
                with torch.no_grad():
                    outputs = self.model.generate(
                        **inputs,
                        max_new_tokens=seq_len,
                        do_sample=False,  # Deterministic for benchmarking
                        use_cache=True
                    )
                
                end_time = time.time()
                
                # Calculate metrics
                tokens_generated = len(outputs[0]) - len(inputs.input_ids[0])
                generation_time = end_time - start_time
                
                times.append(tokens_generated / generation_time)
                
                if torch.cuda.is_available():
                    memory_usage.append(torch.cuda.memory_allocated() / 1024**3)
            
            # Average results
            avg_tokens_per_sec = sum(times) / len(times)
            avg_memory = sum(memory_usage) / len(memory_usage) if memory_usage else 0
            avg_latency = seq_len / avg_tokens_per_sec
            
            results['tokens_per_second'].append(avg_tokens_per_sec)
            results['memory_usage'].append(avg_memory)
            results['latency'].append(avg_latency)
            
            print(f"  Tokens/sec: {avg_tokens_per_sec:.1f}")
            print(f"  Memory: {avg_memory:.2f} GB")
            print(f"  Latency: {avg_latency:.2f} sec")
        
        return results
    
    def compare_precision(self, prompt: str, max_tokens: int = 100):
        """Compare fp32 vs fp16 performance"""
        results = {}
        
        for precision in ['fp32', 'fp16']:
            print(f"Testing {precision}...")
            
            # Convert model precision
            if precision == 'fp16' and torch.cuda.is_available():
                model = self.model.half()
            else:
                model = self.model.float()
            
            times = []
            for _ in range(5):  # Multiple runs for accuracy
                torch.cuda.empty_cache()
                
                start_time = time.time()
                inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
                
                with torch.no_grad():
                    if precision == 'fp16' and torch.cuda.is_available():
                        with torch.cuda.amp.autocast():
                            outputs = model.generate(**inputs, max_new_tokens=max_tokens)
                    else:
                        outputs = model.generate(**inputs, max_new_tokens=max_tokens)
                
                end_time = time.time()
                times.append(end_time - start_time)
            
            avg_time = sum(times) / len(times)
            memory_used = torch.cuda.memory_allocated() / 1024**3 if torch.cuda.is_available() else 0
            
            results[precision] = {
                'avg_time': avg_time,
                'tokens_per_second': max_tokens / avg_time,
                'memory_gb': memory_used
            }
        
        return results
    
    def plot_results(self, results: Dict, save_path: str = None):
        """Create performance visualization"""
        fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 8))
        
        seq_lengths = results['sequence_lengths']
        
        # Tokens per second
        ax1.plot(seq_lengths, results['tokens_per_second'], 'b-o')
        ax1.set_xlabel('Sequence Length')
        ax1.set_ylabel('Tokens/Second')
        ax1.set_title('Throughput vs Sequence Length')
        ax1.grid(True)
        
        # Memory usage
        ax2.plot(seq_lengths, results['memory_usage'], 'r-o')
        ax2.set_xlabel('Sequence Length')
        ax2.set_ylabel('Memory Usage (GB)')
        ax2.set_title('Memory Usage vs Sequence Length')
        ax2.grid(True)
        
        # Latency
        ax3.plot(seq_lengths, results['latency'], 'g-o')
        ax3.set_xlabel('Sequence Length')
        ax3.set_ylabel('Latency (seconds)')
        ax3.set_title('Latency vs Sequence Length')
        ax3.grid(True)
        
        # Efficiency (tokens per second per GB)
        efficiency = [tps/mem if mem > 0 else 0 for tps, mem in 
                     zip(results['tokens_per_second'], results['memory_usage'])]
        ax4.plot(seq_lengths, efficiency, 'm-o')
        ax4.set_xlabel('Sequence Length')
        ax4.set_ylabel('Tokens/sec/GB')
        ax4.set_title('Memory Efficiency')
        ax4.grid(True)
        
        plt.tight_layout()
        
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches='tight')
        plt.show()

# Example usage
def run_comprehensive_benchmark():
    from transformers import AutoTokenizer, AutoModelForCausalLM
    
    model_name = "gpt2"  # Start with a smaller model
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    print(f"Loading {model_name} on {device}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
    
    # Initialize benchmark
    benchmark = LLMBenchmark(model, tokenizer, device)
    
    # Test prompts
    test_prompts = [
        "The future of artificial intelligence is",
        "In the world of technology,",
        "Machine learning algorithms can",
    ]
    
    # Run benchmarks
    print("Running inference benchmark...")
    results = benchmark.benchmark_inference(test_prompts)
    
    print("\nRunning precision comparison...")
    precision_results = benchmark.compare_precision(test_prompts[0])
    
    # Display results
    print(f"\nPrecision Comparison:")
    for precision, metrics in precision_results.items():
        print(f"{precision}: {metrics['tokens_per_second']:.1f} tok/s, {metrics['memory_gb']:.2f} GB")
    
    # Plot results
    benchmark.plot_results(results, 'llm_benchmark_results.png')
    
    return results, precision_results

if __name__ == "__main__":
    results, precision_results = run_comprehensive_benchmark()

Production Deployment Considerations

When you’re ready to deploy CUDA-accelerated LLMs in production, here are the real-world considerations:

Container Setup with NVIDIA Runtime

# Dockerfile for CUDA LLM deployment
FROM nvidia/cuda:11.8-devel-ubuntu20.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y \
    python3 python3-pip python3-dev \
    && rm -rf /var/lib/apt/lists/*

# Install PyTorch with CUDA
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install transformers and dependencies
COPY requirements.txt .
RUN pip3 install -r requirements.txt

# Copy application code
COPY . /app
WORKDIR /app

# Set environment variables
ENV CUDA_VISIBLE_DEVICES=0
ENV PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# Run the application
CMD ["python3", "inference_server.py"]

Monitoring and Error Handling

class ProductionLLMService:
    def __init__(self, model_path, max_memory_gb=20):
        self.max_memory_gb = max_memory_gb
        self.model = None
        self.tokenizer = None
        self._setup_monitoring()
        self._load_model(model_path)
    
    def _setup_monitoring(self):
        """Setup performance monitoring"""
        self.metrics = {
            'requests_processed': 0,
            'total_tokens_generated': 0,
            'average_latency': 0,
            'errors': 0
        }
    
    def _load_model(self, model_path):
        """Load model with error handling"""
        try:
            self.tokenizer = AutoTokenizer.from_pretrained(model_path)
            self.model = AutoModelForCausalLM.from_pretrained(
                model_path,
                torch_dtype=torch.float16,
                device_map="auto",
                low_cpu_mem_usage=True
            )
            print(f"✓ Model loaded successfully: {model_path}")
        except Exception as e:
            print(f"✗ Failed to load model: {e}")
            raise
    
    def _check_memory_usage(self):
        """Monitor GPU memory usage"""
        if torch.cuda.is_available():
            memory_used = torch.cuda.memory_allocated() / 1024**3
            if memory_used > self.max_memory_gb:
                torch.cuda.empty_cache()
                print(f"Warning: High memory usage ({memory_used:.1f} GB)")
    
    def generate(self, prompt, **kwargs):
        """Generate text with monitoring and error handling"""
        start_time = time.time()
        
        try:
            self._check_memory_usage()
            
            # Process request
            inputs = self.tokenizer(prompt, return_tensors="pt")
            inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
            
            with torch.no_grad():
                outputs = self.model.generate(**inputs, **kwargs)
            
            generated_text = self.tokenizer.decode(
                outputs[0][len(inputs['input_ids'][0]):], 
                skip_special_tokens=True
            )
            
            # Update metrics
            latency = time.time() - start_time
            self.metrics['requests_processed'] += 1
            self.metrics['total_tokens_generated'] += len(outputs[0])
            self.metrics['average_latency'] = (
                (self.metrics['average_latency'] * (self.metrics['requests_processed'] - 1) + latency) /
                self.metrics['requests_processed']
            )
            
            return {
                'text': generated_text,
                'latency': latency,
                'status': 'success'
            }
            
        except Exception as e:
            self.metrics['errors'] += 1
            print(f"Generation error: {e}")
            return {
                'text': '',
                'error': str(e),
                'status': 'error'
            }
    
    def health_check(self):
        """Return service health status"""
        return {
            'status': 'healthy' if self.model is not None else 'unhealthy',
            'metrics': self.metrics,
            'gpu_memory': torch.cuda.memory_allocated() / 1024**3 if torch.cuda.is_available() else 0
        }

Common Pitfalls and Solutions

Here are the most common issues I’ve encountered and their solutions:

Out of Memory Errors:

# Solution: Implement dynamic batch sizing
def safe_batch_size(model, sample_input, max_memory_gb=20):
    """Find optimal batch size without OOM"""
    batch_size = 1
    max_batch_size = 64
    
    while batch_size <= max_batch_size:
        try:
            # Test with current batch size
            batch_input = sample_input.repeat(batch_size, 1)
            with torch.no_grad():
                _ = model(batch_input)
            
            # Check memory usage
            memory_used = torch.cuda.memory_allocated() / 1024**3
            if memory_used > max_memory_gb:
                break
                
            batch_size *= 2
            
        except RuntimeError as e:
            if "out of memory" in str(e):
                break
            raise e
    
    return max(1, batch_size // 2)

Slow Model Loading:

# Solution: Cache models and use memory mapping
import mmap
from pathlib import Path

def fast_model_loading(model_path, cache_dir="/tmp/model_cache"):
    """Load models faster using memory mapping"""
    cache_path = Path(cache_dir) / f"{model_path.replace('/', '_')}.pt"
    
    if cache_path.exists():
        # Load from cache
        return torch.load(cache_path, map_location="cuda:0")
    else:
        # Load and cache
        model = AutoModelForCausalLM.from_pretrained(model_path)
        cache_path.parent.mkdir(exist_ok=True)
        torch.save(model.state_dict(), cache_path)
        return model

When NOT to Use CUDA

CUDA isn’t always the answer. Skip GPU acceleration when:

• Small models (< 1B parameters): The overhead isn't worth it

• Infrequent inference: CPU might be more cost-effective

• Limited VRAM: If your model doesn’t fit, CPU might be faster than constant swapping

• Development/debugging: CPU is often easier to debug

The Bottom Line

CUDA acceleration can dramatically improve LLM performance, but success depends on proper implementation. The 15x speedup I mentioned at the beginning wasn’t magic – it came from:

1. Proper memory management (half precision, gradient checkpointing)

2. Optimized batching (dynamic batch sizes, efficient padding)

3. Model compilation (PyTorch 2.0 compile, TensorRT where applicable)

4. Hardware-specific tuning (CUDA versions, memory allocation strategies)

Start with the basics: get your environment set up correctly, implement the optimized inference class above, and measure your performance. Then optimize based on your specific bottlenecks.

The code examples in this article are production-tested and should give you a solid foundation for your own CUDA-accelerated LLM projects.

CUDA Ruby