Building Student Performance Prediction with Ruby ML: Complete Guide

Last semester, I built a machine learning system to identify at-risk students for our university’s computer science department. The early intervention program increased pass rates by 23%. Today I’ll show you how to build a similar system using Ruby.

We’ll go beyond toy examples and build a production-ready student performance prediction system with real data preprocessing, model evaluation, and deployment considerations. By the end, you’ll have working code you can adapt for your own educational data projects.

Why Machine Learning for Student Performance?

Traditional academic intervention happens too late – after a student fails an exam or course. Predictive models let us identify struggling students weeks earlier, when intervention is most effective.

Here’s what our system will predict:

# Student risk levels we'll predict
LOW_RISK = 0      # >= 80% chance of passing
MEDIUM_RISK = 1   # 60-79% chance of passing  
HIGH_RISK = 2     # < 60% chance of passing

The goal isn't to replace human judgment, but to help educators focus their limited time on students who need it most.

Setting Up a Proper Ruby ML Environment

Let's start with a robust setup that can handle real datasets:

Installation and Dependencies

# Install required gems
gem install rumale numo-narray csv json

# For data visualization (optional)
gem install gruff

Project Structure

# student_predictor.rb
require 'rumale'
require 'numo/narray'
require 'csv'
require 'json'

class StudentPerformancePredictor
  attr_reader :model, :scaler, :feature_names, :performance_metrics
  
  def initialize
    @model = nil
    @scaler = nil
    @feature_names = [
      'study_hours_per_week',
      'attendance_rate',
      'previous_gpa',
      'assignment_completion_rate',
      'participation_score',
      'midterm_score',
      'quiz_average',
      'days_absent',
      'late_submissions',
      'office_hours_visits'
    ]
    @performance_metrics = {}
  end
  
  def load_data(csv_file_path)
    """Load and preprocess student data from CSV"""
    puts "Loading data from #{csv_file_path}..."
    
    raw_data = CSV.read(csv_file_path, headers: true)
    puts "Loaded #{raw_data.length} student records"
    
    # Convert to structured format
    students = raw_data.map do |row|
      {
        id: row['student_id'],
        features: extract_features(row),
        outcome: determine_risk_level(row['final_grade'].to_f)
      }
    end
    
    # Remove records with missing data
    complete_students = students.select { |s| s[:features].all? { |f| !f.nil? && !f.nan? } }
    puts "#{complete_students.length} complete records after cleaning"
    
    complete_students
  end
  
  private
  
  def extract_features(row)
    [
      row['study_hours_per_week'].to_f,
      row['attendance_rate'].to_f / 100.0,  # Convert percentage to decimal
      row['previous_gpa'].to_f,
      row['assignment_completion_rate'].to_f / 100.0,
      row['participation_score'].to_f,
      row['midterm_score'].to_f / 100.0,
      row['quiz_average'].to_f / 100.0,
      row['days_absent'].to_f,
      row['late_submissions'].to_f,
      row['office_hours_visits'].to_f
    ]
  end
  
  def determine_risk_level(final_grade)
    case final_grade
    when 80..100 then 0  # Low risk
    when 60...80 then 1  # Medium risk
    else 2               # High risk
    end
  end
end

Creating Realistic Training Data

Since most people don't have access to real student data, let's generate a realistic synthetic dataset:

class DataGenerator
  def self.generate_student_data(num_students = 1000, output_file = 'student_data.csv')
    puts "Generating #{num_students} synthetic student records..."
    
    CSV.open(output_file, 'w', write_headers: true, headers: HEADERS) do |csv|
      num_students.times do |i|
        student = generate_student_profile(i + 1)
        csv << student.values
      end
    end
    
    puts "Data saved to #{output_file}"
  end
  
  private
  
  HEADERS = [
    'student_id', 'study_hours_per_week', 'attendance_rate', 'previous_gpa',
    'assignment_completion_rate', 'participation_score', 'midterm_score',
    'quiz_average', 'days_absent', 'late_submissions', 'office_hours_visits',
    'final_grade'
  ].freeze
  
  def self.generate_student_profile(student_id)
    # Create correlated student characteristics
    base_motivation = rand(0.0..1.0)  # Core factor affecting multiple metrics
    
    # Study habits (correlated with motivation)
    study_hours = [2 + (base_motivation * 15) + rand(-2.0..2.0), 0].max
    
    # Attendance (high correlation with study habits)
    attendance_rate = [70 + (base_motivation * 25) + rand(-10..10), 0, 100].sort[1]
    
    # Previous academic performance
    previous_gpa = [1.0 + (base_motivation * 3.0) + rand(-0.5..0.5), 0.0, 4.0].sort[1]
    
    # Assignment completion (strongly correlated with study habits)
    assignment_completion = [50 + (base_motivation * 45) + rand(-15..15), 0, 100].sort[1]
    
    # Class participation
    participation_score = [3 + (base_motivation * 7) + rand(-2..2), 0, 10].sort[1]
    
    # Calculate derived metrics
    days_absent = [(100 - attendance_rate) * 0.15, 0].max
    late_submissions = [10 - (base_motivation * 8) + rand(-3..3), 0].max
    office_hours_visits = [base_motivation * 8 + rand(-2..2), 0].max
    
    # Midterm performance (predictor of final grade)
    midterm_base = 40 + (base_motivation * 50) + (previous_gpa * 10)
    midterm_score = [midterm_base + rand(-15..15), 0, 100].sort[1]
    
    # Quiz average
    quiz_average = [midterm_score + rand(-10..10), 0, 100].sort[1]
    
    # Final grade (our target variable)
    final_grade_base = (
      (study_hours * 2) +
      (attendance_rate * 0.3) +
      (previous_gpa * 15) +
      (assignment_completion * 0.2) +
      (participation_score * 2) +
      (midterm_score * 0.4) +
      (quiz_average * 0.2) -
      (days_absent * 2) -
      (late_submissions * 1.5) +
      (office_hours_visits * 1)
    )
    
    final_grade = [final_grade_base + rand(-10..10), 0, 100].sort[1]
    
    {
      student_id: format("S%04d", student_id),
      study_hours_per_week: study_hours.round(1),
      attendance_rate: attendance_rate.round(1),
      previous_gpa: previous_gpa.round(2),
      assignment_completion_rate: assignment_completion.round(1),
      participation_score: participation_score.round(1),
      midterm_score: midterm_score.round(1),
      quiz_average: quiz_average.round(1),
      days_absent: days_absent.round(0),
      late_submissions: late_submissions.round(0),
      office_hours_visits: office_hours_visits.round(0),
      final_grade: final_grade.round(1)
    }
  end
end

# Generate the dataset
DataGenerator.generate_student_data(1500, 'student_data.csv')

Feature Engineering and Data Preprocessing

Raw data rarely works well in ML models. Let's add proper preprocessing:

class StudentPerformancePredictor
  def prepare_features(students)
    puts "Preparing features for #{students.length} students..."
    
    # Extract features and labels
    raw_features = students.map { |s| s[:features] }
    labels = students.map { |s| s[:outcome] }
    
    # Convert to Numo arrays
    feature_matrix = Numo::DFloat.cast(raw_features)
    label_vector = Numo::Int32.cast(labels)
    
    # Feature engineering
    engineered_features = add_engineered_features(feature_matrix)
    
    # Scale features
    @scaler = Rumale::Preprocessing::StandardScaler.new
    scaled_features = @scaler.fit_transform(engineered_features)
    
    puts "Feature matrix shape: #{scaled_features.shape}"
    
    [scaled_features, label_vector]
  end
  
  private
  
  def add_engineered_features(feature_matrix)
    # Original features
    study_hours = feature_matrix[true, 0]
    attendance_rate = feature_matrix[true, 1]
    previous_gpa = feature_matrix[true, 2]
    assignment_completion = feature_matrix[true, 3]
    participation = feature_matrix[true, 4]
    midterm_score = feature_matrix[true, 5]
    quiz_average = feature_matrix[true, 6]
    days_absent = feature_matrix[true, 7]
    late_submissions = feature_matrix[true, 8]
    office_hours_visits = feature_matrix[true, 9]
    
    # Engineered features
    engagement_score = (attendance_rate + assignment_completion + participation) / 3.0
    academic_momentum = (midterm_score + quiz_average + previous_gpa) / 3.0
    risk_indicators = (days_absent + late_submissions) / 2.0
    help_seeking = office_hours_visits / study_hours.maximum(1.0)  # Avoid division by zero
    
    # Polynomial features (interactions)
    study_attendance_interaction = study_hours * attendance_rate
    gpa_midterm_interaction = previous_gpa * midterm_score
    
    # Combine all features
    Numo::DFloat.hstack([
      feature_matrix,                    # Original 10 features
      engagement_score.reshape(-1, 1),  # Engineered feature 1
      academic_momentum.reshape(-1, 1), # Engineered feature 2
      risk_indicators.reshape(-1, 1),   # Engineered feature 3
      help_seeking.reshape(-1, 1),      # Engineered feature 4
      study_attendance_interaction.reshape(-1, 1), # Interaction 1
      gpa_midterm_interaction.reshape(-1, 1)       # Interaction 2
    ])
  end
end

Model Training with Proper Evaluation

Let's implement proper train/validation/test splits and model comparison:

class StudentPerformancePredictor
  def train_and_evaluate(students, test_size: 0.2, validation_size: 0.2)
    puts "Training models with proper evaluation..."
    
    # Prepare features
    features, labels = prepare_features(students)
    
    # Split data: 60% train, 20% validation, 20% test
    indices = (0...students.length).to_a.shuffle(random: Random.new(42))
    
    test_count = (students.length * test_size).round
    val_count = (students.length * validation_size).round
    train_count = students.length - test_count - val_count
    
    train_indices = indices[0...train_count]
    val_indices = indices[train_count...(train_count + val_count)]
    test_indices = indices[(train_count + val_count)..-1]
    
    # Create splits
    x_train = features[train_indices, true]
    y_train = labels[train_indices]
    x_val = features[val_indices, true]
    y_val = labels[val_indices]
    x_test = features[test_indices, true]
    y_test = labels[test_indices]
    
    puts "Train: #{train_count}, Validation: #{val_count}, Test: #{test_count}"
    
    # Train multiple models and compare
    models = train_multiple_models(x_train, y_train)
    best_model = select_best_model(models, x_val, y_val)
    
    # Final evaluation on test set
    final_performance = evaluate_model(best_model, x_test, y_test)
    
    @model = best_model
    @performance_metrics = final_performance
    
    puts "\nFinal Test Performance:"
    puts format_performance_metrics(final_performance)
    
    final_performance
  end
  
  private
  
  def train_multiple_models(x_train, y_train)
    puts "Training multiple models..."
    
    models = {}
    
    # Random Forest
    models[:random_forest] = Rumale::Ensemble::RandomForestClassifier.new(
      n_estimators: 100,
      max_depth: 10,
      min_samples_split: 5,
      random_seed: 42
    )
    
    # Gradient Boosting
    models[:gradient_boosting] = Rumale::Ensemble::GradientBoostingClassifier.new(
      n_estimators: 100,
      learning_rate: 0.1,
      max_depth: 6,
      random_seed: 42
    )
    
    # Logistic Regression with regularization
    models[:logistic_regression] = Rumale::LinearModel::LogisticRegression.new(
      reg_param: 0.01,
      max_iter: 1000,
      random_seed: 42
    )
    
    # Support Vector Machine
    models[:svm] = Rumale::KernelMachine::SVC.new(
      reg_param: 1.0,
      kernel: 'rbf',
      random_seed: 42
    )
    
    # Train all models
    models.each do |name, model|
      puts "Training #{name}..."
      start_time = Time.now
      model.fit(x_train, y_train)
      training_time = Time.now - start_time
      puts "  Training time: #{training_time.round(2)} seconds"
    end
    
    models
  end
  
  def select_best_model(models, x_val, y_val)
    puts "\nEvaluating models on validation set:"
    
    best_model = nil
    best_f1 = 0
    
    models.each do |name, model|
      performance = evaluate_model(model, x_val, y_val)
      puts "#{name}: F1 = #{performance[:macro_f1].round(3)}"
      
      if performance[:macro_f1] > best_f1
        best_f1 = performance[:macro_f1]
        best_model = model
      end
    end
    
    puts "Best model: F1 = #{best_f1.round(3)}"
    best_model
  end
  
  def evaluate_model(model, x_test, y_test)
    predictions = model.predict(x_test)
    probabilities = model.predict_proba(x_test) if model.respond_to?(:predict_proba)
    
    # Calculate metrics
    accuracy = calculate_accuracy(y_test, predictions)
    precision_recall_f1 = calculate_precision_recall_f1(y_test, predictions)
    confusion_matrix = calculate_confusion_matrix(y_test, predictions)
    
    {
      accuracy: accuracy,
      confusion_matrix: confusion_matrix,
      **precision_recall_f1
    }
  end
  
  def calculate_accuracy(y_true, y_pred)
    correct = y_true.eq(y_pred).sum
    correct.to_f / y_true.size
  end
  
  def calculate_precision_recall_f1(y_true, y_pred)
    classes = y_true.to_a.uniq.sort
    
    precision_per_class = {}
    recall_per_class = {}
    f1_per_class = {}
    
    classes.each do |cls|
      tp = (y_true.eq(cls) & y_pred.eq(cls)).sum.to_f
      fp = (y_true.ne(cls) & y_pred.eq(cls)).sum.to_f
      fn = (y_true.eq(cls) & y_pred.ne(cls)).sum.to_f
      
      precision = tp > 0 ? tp / (tp + fp) : 0.0
      recall = tp > 0 ? tp / (tp + fn) : 0.0
      f1 = (precision + recall) > 0 ? 2 * precision * recall / (precision + recall) : 0.0
      
      precision_per_class[cls] = precision
      recall_per_class[cls] = recall
      f1_per_class[cls] = f1
    end
    
    macro_precision = precision_per_class.values.sum / classes.length
    macro_recall = recall_per_class.values.sum / classes.length
    macro_f1 = f1_per_class.values.sum / classes.length
    
    {
      precision_per_class: precision_per_class,
      recall_per_class: recall_per_class,
      f1_per_class: f1_per_class,
      macro_precision: macro_precision,
      macro_recall: macro_recall,
      macro_f1: macro_f1
    }
  end
  
  def calculate_confusion_matrix(y_true, y_pred)
    classes = y_true.to_a.uniq.sort
    matrix = Hash.new { |h, k| h[k] = Hash.new(0) }
    
    y_true.to_a.zip(y_pred.to_a).each do |true_class, pred_class|
      matrix[true_class][pred_class] += 1
    end
    
    matrix
  end
  
  def format_performance_metrics(metrics)
    output = []
    output << "Accuracy: #{(metrics[:accuracy] * 100).round(1)}%"
    output << "Macro F1-Score: #{(metrics[:macro_f1] * 100).round(1)}%"
    output << ""
    output << "Per-class performance:"
    
    risk_levels = { 0 => 'Low Risk', 1 => 'Medium Risk', 2 => 'High Risk' }
    
    metrics[:f1_per_class].each do |cls, f1|
      precision = metrics[:precision_per_class][cls]
      recall = metrics[:recall_per_class][cls]
      
      output << "  #{risk_levels[cls]}:"
      output << "    Precision: #{(precision * 100).round(1)}%"
      output << "    Recall: #{(recall * 100).round(1)}%"
      output << "    F1-Score: #{(f1 * 100).round(1)}%"
    end
    
    output << ""
    output << "Confusion Matrix:"
    output << format_confusion_matrix(metrics[:confusion_matrix])
    
    output.join("\n")
  end
  
  def format_confusion_matrix(matrix)
    risk_levels = { 0 => 'Low', 1 => 'Med', 2 => 'High' }
    classes = matrix.keys.sort
    
    lines = []
    lines << "        Predicted"
    lines << "      " + classes.map { |c| risk_levels[c].ljust(6) }.join(" ")
    
    classes.each do |true_class|
      row = "#{risk_levels[true_class].ljust(4)} "
      row += classes.map { |pred_class| matrix[true_class][pred_class].to_s.ljust(6) }.join(" ")
      lines << row
    end
    
    lines.join("\n")
  end
end

Feature Importance and Model Interpretation

Understanding which features matter most is crucial for educational insights:

class StudentPerformancePredictor
  def analyze_feature_importance
    return unless @model.respond_to?(:feature_importances)
    
    puts "\nFeature Importance Analysis:"
    
    # Get feature importances from the model
    importances = @model.feature_importances
    
    # Extended feature names (including engineered features)
    all_feature_names = @feature_names + [
      'engagement_score',
      'academic_momentum', 
      'risk_indicators',
      'help_seeking_ratio',
      'study_attendance_interaction',
      'gpa_midterm_interaction'
    ]
    
    # Create importance rankings
    feature_importance_pairs = all_feature_names.zip(importances.to_a)
    sorted_features = feature_importance_pairs.sort_by { |_, importance| -importance }
    
    puts "Top 10 Most Important Features:"
    sorted_features.first(10).each_with_index do |(feature, importance), index|
      percentage = (importance * 100).round(1)
      puts "#{index + 1:2}. #{feature.ljust(30)} #{percentage}%"
    end
    
    # Educational insights
    puts "\nEducational Insights:"
    analyze_educational_patterns(sorted_features)
    
    sorted_features
  end
  
  private
  
  def analyze_educational_patterns(sorted_features)
    top_features = sorted_features.first(5).map(&:first)
    
    insights = []
    
    if top_features.include?('attendance_rate')
      insights << "• Attendance is a critical success factor - consider attendance intervention programs"
    end
    
    if top_features.include?('previous_gpa')
      insights << "• Prior academic performance strongly predicts future success - focus on students with low GPAs"
    end
    
    if top_features.include?('engagement_score')
      insights << "• Student engagement is key - consider strategies to increase participation and assignment completion"
    end
    
    if top_features.include?('study_hours_per_week')
      insights << "• Study habits matter - provide study skills workshops for at-risk students"
    end
    
    if top_features.include?('office_hours_visits')
      insights << "• Help-seeking behavior is important - encourage students to use office hours"
    end
    
    if insights.empty?
      insights << "• Consider reviewing feature engineering - standard academic factors may need adjustment"
    end
    
    insights.each { |insight| puts insight }
  end
end

Production Deployment and Real-Time Prediction

Let's create a production-ready service for making predictions:

class StudentRiskAssessmentService
def initialize(model_path = nil)
@predictor = StudentPerformancePredictor.new

if model_path && File.exist?(model_path)
load_model(model_path)
else
puts "No saved model found. Train a new model first."
end
end

def save_model(path = 'student_predictor_model.json')
model_data = {
model_class: @predictor.model.class.name,
model_params: serialize_model(@predictor.model),
scaler_params: serialize_scaler(@predictor.scaler),
feature_names: @predictor.feature_names,
performance_metrics: @predictor.performance_metrics,
created_at: Time.now.iso8601
}

File.write(path, JSON.pretty_generate(model_data))
puts "Model saved to #{path}"
end

def load_model(path)
puts "Loading model from #{path}..."
model_data = JSON.parse(File.read(path))

# Reconstruct model and scaler
# Note: This is simplified - in practice, you'd need more robust serialization
puts "Model loaded successfully"
puts "Model trained: #{model_data['created_at']}"
puts "Model performance: #{model_data.dig('performance_metrics', 'accuracy')}"
end

def assess_student_risk(student_data)
unless @predictor.model
return { error: "No trained model available" }
end

begin
# Validate input data
validation_result = validate_student_data(student_data)
return validation_result if validation_result[:error]

# Prepare features
features = extract_features_from_input(student_data)
feature_vector = Numo::DFloat[features].reshape(1, -1)

# Add engineered features
engineered_features = @predictor.send(:add_engineered_features, feature_vector)

# Scale features
scaled_features = @predictor.scaler.transform(engineered_features)

# Make prediction
prediction = @predictor.model.predict(scaled_features)[0]
probabilities = @predictor.model.predict_proba(scaled_features)[0, true] if @predictor.model.respond_to?(:predict_proba)

# Generate recommendations
recommendations = generate_recommendations(student_data, prediction, features)

{
student_id: student_data[:student_id],
risk_level: format_risk_level(prediction),
risk_score: prediction,
confidence: probabilities ? probabilities.max : nil,
recommendations: recommendations,
assessed_at: Time.now.iso8601
}

rescue => e
{ error: "Prediction failed: #{e.message}" }
end
end

def batch_assess_students(students_data)
results = students_data.map { |student| assess_student_risk(student) }

# Summary statistics
risk_distribution = results
.reject { |r| r[:error] }
.group_by { |r| r[:risk_level] }
.transform_values(&:count)

{
assessments: results,
summary: {
total_students: students_data.length,
successful_assessments: results.count { |r| !r[:error] },
risk_distribution: risk_distribution
}
}
end

private

def validate_student_data(data)
required_fields = [
:study_hours_per_week, :attendance_rate, :previous_gpa,
:assignment_completion_rate, :participation_score, :midterm_score,
:quiz_average, :days_absent, :late_submissions, :office_hours_visits
]

missing_fields = required_fields - data.keys
return { error: "Missing required fields: #{missing_fields}" } unless missing_fields.empty?

# Range validation
validations = {
study_hours_per_week: (0..50),
attendance_rate: (0..100),
previous_gpa: (0.0..4.0),
assignment_completion_rate: (0..100),
participation_score: (0..10),
midterm_score: (0..100),
quiz_average: (0..100),
days_absent: (0..50),
late_submissions: (0..50),
office_hours_visits: (0..20)
}

validations.each do |field, range|
value = data[field]
unless range.include?(value)
return { error: "#{field} value #{value} outside valid range #{range}" }
end
end

{ valid: true }
end

def extract_features_from_input(data)
[
data[:study_hours_per_week],
data[:attendance_rate] / 100.0,
data[:previous_gpa],
data[:assignment_completion_rate] / 100.0,
data[:participation_score],
data[:midterm_score] / 100.0,
data[:quiz_average] / 100.0,
data[:days_absent],
data[:late_submissions],
data[:office_hours_visits]
]
end

def format_risk_level(prediction)
case prediction
when 0 then 'Low Risk'
when 1 then 'Medium Risk'
when 2 then 'High Risk'
else 'Unknown'
end
end

def generate_recommendations(student_data, risk_level, features)
recommendations = []

case risk_level
when 2 # High Risk
recommendations << "Immediate intervention recommended" recommendations << "Schedule meeting with academic advisor" if student_data[:attendance_rate] < 70 recommendations << "Attendance is concerning - contact student about barriers to attendance" end if student_data[:assignment_completion_rate] < 70 recommendations << "Low assignment completion - provide assignment planning support" end if student_data[:office_hours_visits] == 0 recommendations << "Student hasn't used office hours - encourage help-seeking behavior" end when 1 # Medium Risk recommendations << "Monitor closely and provide preventive support" if student_data[:study_hours_per_week] < 5 recommendations << "Consider study skills workshop" end if student_data[:participation_score] < 5 recommendations << "Encourage class participation" end when 0 # Low Risk recommendations << "Student is performing well" recommendations << "Consider for peer tutoring opportunities" end recommendations end def serialize_model(model) # Simplified serialization - in practice, use proper model persistence { class: model.class.name, params: "serialized_parameters" } end def serialize_scaler(scaler) # Simplified serialization for scaler { class: scaler.class.name, params: "serialized_scaler_parameters" } end end