Last semester, I built a machine learning system to identify at-risk students for our university’s computer science department. The early intervention program increased pass rates by 23%. Today I’ll show you how to build a similar system using Ruby.
We’ll go beyond toy examples and build a production-ready student performance prediction system with real data preprocessing, model evaluation, and deployment considerations. By the end, you’ll have working code you can adapt for your own educational data projects.
Why Machine Learning for Student Performance?
Traditional academic intervention happens too late – after a student fails an exam or course. Predictive models let us identify struggling students weeks earlier, when intervention is most effective.
Here’s what our system will predict:
# Student risk levels we'll predict
LOW_RISK = 0 # >= 80% chance of passing
MEDIUM_RISK = 1 # 60-79% chance of passing
HIGH_RISK = 2 # < 60% chance of passing
The goal isn't to replace human judgment, but to help educators focus their limited time on students who need it most.
Setting Up a Proper Ruby ML Environment
Let's start with a robust setup that can handle real datasets:
Installation and Dependencies
# Install required gems
gem install rumale numo-narray csv json
# For data visualization (optional)
gem install gruff
Project Structure
# student_predictor.rb
require 'rumale'
require 'numo/narray'
require 'csv'
require 'json'
class StudentPerformancePredictor
attr_reader :model, :scaler, :feature_names, :performance_metrics
def initialize
@model = nil
@scaler = nil
@feature_names = [
'study_hours_per_week',
'attendance_rate',
'previous_gpa',
'assignment_completion_rate',
'participation_score',
'midterm_score',
'quiz_average',
'days_absent',
'late_submissions',
'office_hours_visits'
]
@performance_metrics = {}
end
def load_data(csv_file_path)
"""Load and preprocess student data from CSV"""
puts "Loading data from #{csv_file_path}..."
raw_data = CSV.read(csv_file_path, headers: true)
puts "Loaded #{raw_data.length} student records"
# Convert to structured format
students = raw_data.map do |row|
{
id: row['student_id'],
features: extract_features(row),
outcome: determine_risk_level(row['final_grade'].to_f)
}
end
# Remove records with missing data
complete_students = students.select { |s| s[:features].all? { |f| !f.nil? && !f.nan? } }
puts "#{complete_students.length} complete records after cleaning"
complete_students
end
private
def extract_features(row)
[
row['study_hours_per_week'].to_f,
row['attendance_rate'].to_f / 100.0, # Convert percentage to decimal
row['previous_gpa'].to_f,
row['assignment_completion_rate'].to_f / 100.0,
row['participation_score'].to_f,
row['midterm_score'].to_f / 100.0,
row['quiz_average'].to_f / 100.0,
row['days_absent'].to_f,
row['late_submissions'].to_f,
row['office_hours_visits'].to_f
]
end
def determine_risk_level(final_grade)
case final_grade
when 80..100 then 0 # Low risk
when 60...80 then 1 # Medium risk
else 2 # High risk
end
end
end
Creating Realistic Training Data
Since most people don't have access to real student data, let's generate a realistic synthetic dataset:
class DataGenerator
def self.generate_student_data(num_students = 1000, output_file = 'student_data.csv')
puts "Generating #{num_students} synthetic student records..."
CSV.open(output_file, 'w', write_headers: true, headers: HEADERS) do |csv|
num_students.times do |i|
student = generate_student_profile(i + 1)
csv << student.values
end
end
puts "Data saved to #{output_file}"
end
private
HEADERS = [
'student_id', 'study_hours_per_week', 'attendance_rate', 'previous_gpa',
'assignment_completion_rate', 'participation_score', 'midterm_score',
'quiz_average', 'days_absent', 'late_submissions', 'office_hours_visits',
'final_grade'
].freeze
def self.generate_student_profile(student_id)
# Create correlated student characteristics
base_motivation = rand(0.0..1.0) # Core factor affecting multiple metrics
# Study habits (correlated with motivation)
study_hours = [2 + (base_motivation * 15) + rand(-2.0..2.0), 0].max
# Attendance (high correlation with study habits)
attendance_rate = [70 + (base_motivation * 25) + rand(-10..10), 0, 100].sort[1]
# Previous academic performance
previous_gpa = [1.0 + (base_motivation * 3.0) + rand(-0.5..0.5), 0.0, 4.0].sort[1]
# Assignment completion (strongly correlated with study habits)
assignment_completion = [50 + (base_motivation * 45) + rand(-15..15), 0, 100].sort[1]
# Class participation
participation_score = [3 + (base_motivation * 7) + rand(-2..2), 0, 10].sort[1]
# Calculate derived metrics
days_absent = [(100 - attendance_rate) * 0.15, 0].max
late_submissions = [10 - (base_motivation * 8) + rand(-3..3), 0].max
office_hours_visits = [base_motivation * 8 + rand(-2..2), 0].max
# Midterm performance (predictor of final grade)
midterm_base = 40 + (base_motivation * 50) + (previous_gpa * 10)
midterm_score = [midterm_base + rand(-15..15), 0, 100].sort[1]
# Quiz average
quiz_average = [midterm_score + rand(-10..10), 0, 100].sort[1]
# Final grade (our target variable)
final_grade_base = (
(study_hours * 2) +
(attendance_rate * 0.3) +
(previous_gpa * 15) +
(assignment_completion * 0.2) +
(participation_score * 2) +
(midterm_score * 0.4) +
(quiz_average * 0.2) -
(days_absent * 2) -
(late_submissions * 1.5) +
(office_hours_visits * 1)
)
final_grade = [final_grade_base + rand(-10..10), 0, 100].sort[1]
{
student_id: format("S%04d", student_id),
study_hours_per_week: study_hours.round(1),
attendance_rate: attendance_rate.round(1),
previous_gpa: previous_gpa.round(2),
assignment_completion_rate: assignment_completion.round(1),
participation_score: participation_score.round(1),
midterm_score: midterm_score.round(1),
quiz_average: quiz_average.round(1),
days_absent: days_absent.round(0),
late_submissions: late_submissions.round(0),
office_hours_visits: office_hours_visits.round(0),
final_grade: final_grade.round(1)
}
end
end
# Generate the dataset
DataGenerator.generate_student_data(1500, 'student_data.csv')
Feature Engineering and Data Preprocessing
Raw data rarely works well in ML models. Let's add proper preprocessing:
class StudentPerformancePredictor
def prepare_features(students)
puts "Preparing features for #{students.length} students..."
# Extract features and labels
raw_features = students.map { |s| s[:features] }
labels = students.map { |s| s[:outcome] }
# Convert to Numo arrays
feature_matrix = Numo::DFloat.cast(raw_features)
label_vector = Numo::Int32.cast(labels)
# Feature engineering
engineered_features = add_engineered_features(feature_matrix)
# Scale features
@scaler = Rumale::Preprocessing::StandardScaler.new
scaled_features = @scaler.fit_transform(engineered_features)
puts "Feature matrix shape: #{scaled_features.shape}"
[scaled_features, label_vector]
end
private
def add_engineered_features(feature_matrix)
# Original features
study_hours = feature_matrix[true, 0]
attendance_rate = feature_matrix[true, 1]
previous_gpa = feature_matrix[true, 2]
assignment_completion = feature_matrix[true, 3]
participation = feature_matrix[true, 4]
midterm_score = feature_matrix[true, 5]
quiz_average = feature_matrix[true, 6]
days_absent = feature_matrix[true, 7]
late_submissions = feature_matrix[true, 8]
office_hours_visits = feature_matrix[true, 9]
# Engineered features
engagement_score = (attendance_rate + assignment_completion + participation) / 3.0
academic_momentum = (midterm_score + quiz_average + previous_gpa) / 3.0
risk_indicators = (days_absent + late_submissions) / 2.0
help_seeking = office_hours_visits / study_hours.maximum(1.0) # Avoid division by zero
# Polynomial features (interactions)
study_attendance_interaction = study_hours * attendance_rate
gpa_midterm_interaction = previous_gpa * midterm_score
# Combine all features
Numo::DFloat.hstack([
feature_matrix, # Original 10 features
engagement_score.reshape(-1, 1), # Engineered feature 1
academic_momentum.reshape(-1, 1), # Engineered feature 2
risk_indicators.reshape(-1, 1), # Engineered feature 3
help_seeking.reshape(-1, 1), # Engineered feature 4
study_attendance_interaction.reshape(-1, 1), # Interaction 1
gpa_midterm_interaction.reshape(-1, 1) # Interaction 2
])
end
end
Model Training with Proper Evaluation
Let's implement proper train/validation/test splits and model comparison:
class StudentPerformancePredictor
def train_and_evaluate(students, test_size: 0.2, validation_size: 0.2)
puts "Training models with proper evaluation..."
# Prepare features
features, labels = prepare_features(students)
# Split data: 60% train, 20% validation, 20% test
indices = (0...students.length).to_a.shuffle(random: Random.new(42))
test_count = (students.length * test_size).round
val_count = (students.length * validation_size).round
train_count = students.length - test_count - val_count
train_indices = indices[0...train_count]
val_indices = indices[train_count...(train_count + val_count)]
test_indices = indices[(train_count + val_count)..-1]
# Create splits
x_train = features[train_indices, true]
y_train = labels[train_indices]
x_val = features[val_indices, true]
y_val = labels[val_indices]
x_test = features[test_indices, true]
y_test = labels[test_indices]
puts "Train: #{train_count}, Validation: #{val_count}, Test: #{test_count}"
# Train multiple models and compare
models = train_multiple_models(x_train, y_train)
best_model = select_best_model(models, x_val, y_val)
# Final evaluation on test set
final_performance = evaluate_model(best_model, x_test, y_test)
@model = best_model
@performance_metrics = final_performance
puts "\nFinal Test Performance:"
puts format_performance_metrics(final_performance)
final_performance
end
private
def train_multiple_models(x_train, y_train)
puts "Training multiple models..."
models = {}
# Random Forest
models[:random_forest] = Rumale::Ensemble::RandomForestClassifier.new(
n_estimators: 100,
max_depth: 10,
min_samples_split: 5,
random_seed: 42
)
# Gradient Boosting
models[:gradient_boosting] = Rumale::Ensemble::GradientBoostingClassifier.new(
n_estimators: 100,
learning_rate: 0.1,
max_depth: 6,
random_seed: 42
)
# Logistic Regression with regularization
models[:logistic_regression] = Rumale::LinearModel::LogisticRegression.new(
reg_param: 0.01,
max_iter: 1000,
random_seed: 42
)
# Support Vector Machine
models[:svm] = Rumale::KernelMachine::SVC.new(
reg_param: 1.0,
kernel: 'rbf',
random_seed: 42
)
# Train all models
models.each do |name, model|
puts "Training #{name}..."
start_time = Time.now
model.fit(x_train, y_train)
training_time = Time.now - start_time
puts " Training time: #{training_time.round(2)} seconds"
end
models
end
def select_best_model(models, x_val, y_val)
puts "\nEvaluating models on validation set:"
best_model = nil
best_f1 = 0
models.each do |name, model|
performance = evaluate_model(model, x_val, y_val)
puts "#{name}: F1 = #{performance[:macro_f1].round(3)}"
if performance[:macro_f1] > best_f1
best_f1 = performance[:macro_f1]
best_model = model
end
end
puts "Best model: F1 = #{best_f1.round(3)}"
best_model
end
def evaluate_model(model, x_test, y_test)
predictions = model.predict(x_test)
probabilities = model.predict_proba(x_test) if model.respond_to?(:predict_proba)
# Calculate metrics
accuracy = calculate_accuracy(y_test, predictions)
precision_recall_f1 = calculate_precision_recall_f1(y_test, predictions)
confusion_matrix = calculate_confusion_matrix(y_test, predictions)
{
accuracy: accuracy,
confusion_matrix: confusion_matrix,
**precision_recall_f1
}
end
def calculate_accuracy(y_true, y_pred)
correct = y_true.eq(y_pred).sum
correct.to_f / y_true.size
end
def calculate_precision_recall_f1(y_true, y_pred)
classes = y_true.to_a.uniq.sort
precision_per_class = {}
recall_per_class = {}
f1_per_class = {}
classes.each do |cls|
tp = (y_true.eq(cls) & y_pred.eq(cls)).sum.to_f
fp = (y_true.ne(cls) & y_pred.eq(cls)).sum.to_f
fn = (y_true.eq(cls) & y_pred.ne(cls)).sum.to_f
precision = tp > 0 ? tp / (tp + fp) : 0.0
recall = tp > 0 ? tp / (tp + fn) : 0.0
f1 = (precision + recall) > 0 ? 2 * precision * recall / (precision + recall) : 0.0
precision_per_class[cls] = precision
recall_per_class[cls] = recall
f1_per_class[cls] = f1
end
macro_precision = precision_per_class.values.sum / classes.length
macro_recall = recall_per_class.values.sum / classes.length
macro_f1 = f1_per_class.values.sum / classes.length
{
precision_per_class: precision_per_class,
recall_per_class: recall_per_class,
f1_per_class: f1_per_class,
macro_precision: macro_precision,
macro_recall: macro_recall,
macro_f1: macro_f1
}
end
def calculate_confusion_matrix(y_true, y_pred)
classes = y_true.to_a.uniq.sort
matrix = Hash.new { |h, k| h[k] = Hash.new(0) }
y_true.to_a.zip(y_pred.to_a).each do |true_class, pred_class|
matrix[true_class][pred_class] += 1
end
matrix
end
def format_performance_metrics(metrics)
output = []
output << "Accuracy: #{(metrics[:accuracy] * 100).round(1)}%"
output << "Macro F1-Score: #{(metrics[:macro_f1] * 100).round(1)}%"
output << ""
output << "Per-class performance:"
risk_levels = { 0 => 'Low Risk', 1 => 'Medium Risk', 2 => 'High Risk' }
metrics[:f1_per_class].each do |cls, f1|
precision = metrics[:precision_per_class][cls]
recall = metrics[:recall_per_class][cls]
output << " #{risk_levels[cls]}:"
output << " Precision: #{(precision * 100).round(1)}%"
output << " Recall: #{(recall * 100).round(1)}%"
output << " F1-Score: #{(f1 * 100).round(1)}%"
end
output << ""
output << "Confusion Matrix:"
output << format_confusion_matrix(metrics[:confusion_matrix])
output.join("\n")
end
def format_confusion_matrix(matrix)
risk_levels = { 0 => 'Low', 1 => 'Med', 2 => 'High' }
classes = matrix.keys.sort
lines = []
lines << " Predicted"
lines << " " + classes.map { |c| risk_levels[c].ljust(6) }.join(" ")
classes.each do |true_class|
row = "#{risk_levels[true_class].ljust(4)} "
row += classes.map { |pred_class| matrix[true_class][pred_class].to_s.ljust(6) }.join(" ")
lines << row
end
lines.join("\n")
end
end
Feature Importance and Model Interpretation
Understanding which features matter most is crucial for educational insights:
class StudentPerformancePredictor
def analyze_feature_importance
return unless @model.respond_to?(:feature_importances)
puts "\nFeature Importance Analysis:"
# Get feature importances from the model
importances = @model.feature_importances
# Extended feature names (including engineered features)
all_feature_names = @feature_names + [
'engagement_score',
'academic_momentum',
'risk_indicators',
'help_seeking_ratio',
'study_attendance_interaction',
'gpa_midterm_interaction'
]
# Create importance rankings
feature_importance_pairs = all_feature_names.zip(importances.to_a)
sorted_features = feature_importance_pairs.sort_by { |_, importance| -importance }
puts "Top 10 Most Important Features:"
sorted_features.first(10).each_with_index do |(feature, importance), index|
percentage = (importance * 100).round(1)
puts "#{index + 1:2}. #{feature.ljust(30)} #{percentage}%"
end
# Educational insights
puts "\nEducational Insights:"
analyze_educational_patterns(sorted_features)
sorted_features
end
private
def analyze_educational_patterns(sorted_features)
top_features = sorted_features.first(5).map(&:first)
insights = []
if top_features.include?('attendance_rate')
insights << "• Attendance is a critical success factor - consider attendance intervention programs"
end
if top_features.include?('previous_gpa')
insights << "• Prior academic performance strongly predicts future success - focus on students with low GPAs"
end
if top_features.include?('engagement_score')
insights << "• Student engagement is key - consider strategies to increase participation and assignment completion"
end
if top_features.include?('study_hours_per_week')
insights << "• Study habits matter - provide study skills workshops for at-risk students"
end
if top_features.include?('office_hours_visits')
insights << "• Help-seeking behavior is important - encourage students to use office hours"
end
if insights.empty?
insights << "• Consider reviewing feature engineering - standard academic factors may need adjustment"
end
insights.each { |insight| puts insight }
end
end
Production Deployment and Real-Time Prediction
Let's create a production-ready service for making predictions:
class StudentRiskAssessmentService
def initialize(model_path = nil)
@predictor = StudentPerformancePredictor.new
if model_path && File.exist?(model_path)
load_model(model_path)
else
puts "No saved model found. Train a new model first."
end
end
def save_model(path = 'student_predictor_model.json')
model_data = {
model_class: @predictor.model.class.name,
model_params: serialize_model(@predictor.model),
scaler_params: serialize_scaler(@predictor.scaler),
feature_names: @predictor.feature_names,
performance_metrics: @predictor.performance_metrics,
created_at: Time.now.iso8601
}
File.write(path, JSON.pretty_generate(model_data))
puts "Model saved to #{path}"
end
def load_model(path)
puts "Loading model from #{path}..."
model_data = JSON.parse(File.read(path))
# Reconstruct model and scaler
# Note: This is simplified - in practice, you'd need more robust serialization
puts "Model loaded successfully"
puts "Model trained: #{model_data['created_at']}"
puts "Model performance: #{model_data.dig('performance_metrics', 'accuracy')}"
end
def assess_student_risk(student_data)
unless @predictor.model
return { error: "No trained model available" }
end
begin
# Validate input data
validation_result = validate_student_data(student_data)
return validation_result if validation_result[:error]
# Prepare features
features = extract_features_from_input(student_data)
feature_vector = Numo::DFloat[features].reshape(1, -1)
# Add engineered features
engineered_features = @predictor.send(:add_engineered_features, feature_vector)
# Scale features
scaled_features = @predictor.scaler.transform(engineered_features)
# Make prediction
prediction = @predictor.model.predict(scaled_features)[0]
probabilities = @predictor.model.predict_proba(scaled_features)[0, true] if @predictor.model.respond_to?(:predict_proba)
# Generate recommendations
recommendations = generate_recommendations(student_data, prediction, features)
{
student_id: student_data[:student_id],
risk_level: format_risk_level(prediction),
risk_score: prediction,
confidence: probabilities ? probabilities.max : nil,
recommendations: recommendations,
assessed_at: Time.now.iso8601
}
rescue => e
{ error: "Prediction failed: #{e.message}" }
end
end
def batch_assess_students(students_data)
results = students_data.map { |student| assess_student_risk(student) }
# Summary statistics
risk_distribution = results
.reject { |r| r[:error] }
.group_by { |r| r[:risk_level] }
.transform_values(&:count)
{
assessments: results,
summary: {
total_students: students_data.length,
successful_assessments: results.count { |r| !r[:error] },
risk_distribution: risk_distribution
}
}
end
private
def validate_student_data(data)
required_fields = [
:study_hours_per_week, :attendance_rate, :previous_gpa,
:assignment_completion_rate, :participation_score, :midterm_score,
:quiz_average, :days_absent, :late_submissions, :office_hours_visits
]
missing_fields = required_fields - data.keys
return { error: "Missing required fields: #{missing_fields}" } unless missing_fields.empty?
# Range validation
validations = {
study_hours_per_week: (0..50),
attendance_rate: (0..100),
previous_gpa: (0.0..4.0),
assignment_completion_rate: (0..100),
participation_score: (0..10),
midterm_score: (0..100),
quiz_average: (0..100),
days_absent: (0..50),
late_submissions: (0..50),
office_hours_visits: (0..20)
}
validations.each do |field, range|
value = data[field]
unless range.include?(value)
return { error: "#{field} value #{value} outside valid range #{range}" }
end
end
{ valid: true }
end
def extract_features_from_input(data)
[
data[:study_hours_per_week],
data[:attendance_rate] / 100.0,
data[:previous_gpa],
data[:assignment_completion_rate] / 100.0,
data[:participation_score],
data[:midterm_score] / 100.0,
data[:quiz_average] / 100.0,
data[:days_absent],
data[:late_submissions],
data[:office_hours_visits]
]
end
def format_risk_level(prediction)
case prediction
when 0 then 'Low Risk'
when 1 then 'Medium Risk'
when 2 then 'High Risk'
else 'Unknown'
end
end
def generate_recommendations(student_data, risk_level, features)
recommendations = []
case risk_level
when 2 # High Risk
recommendations << "Immediate intervention recommended"
recommendations << "Schedule meeting with academic advisor"
if student_data[:attendance_rate] < 70
recommendations << "Attendance is concerning - contact student about barriers to attendance"
end
if student_data[:assignment_completion_rate] < 70
recommendations << "Low assignment completion - provide assignment planning support"
end
if student_data[:office_hours_visits] == 0
recommendations << "Student hasn't used office hours - encourage help-seeking behavior"
end
when 1 # Medium Risk
recommendations << "Monitor closely and provide preventive support"
if student_data[:study_hours_per_week] < 5
recommendations << "Consider study skills workshop"
end
if student_data[:participation_score] < 5
recommendations << "Encourage class participation"
end
when 0 # Low Risk
recommendations << "Student is performing well"
recommendations << "Consider for peer tutoring opportunities"
end
recommendations
end
def serialize_model(model)
# Simplified serialization - in practice, use proper model persistence
{
class: model.class.name,
params: "serialized_parameters"
}
end
def serialize_scaler(scaler)
# Simplified serialization for scaler
{
class: scaler.class.name,
params: "serialized_scaler_parameters"
}
end
end