aboutsummaryrefslogtreecommitdiffstats

dslr

A logistic regression project that classifies Hogwarts students into houses based on course scores.

Usage

# Describe the dataset
python3 describe.py datasets/dataset_train.csv

# Generate histograms for each course, colored by house
python3 histogram.py datasets/dataset_train.csv

# Generate scatter plots for each pair of courses
python3 scatter_plot.py datasets/dataset_train.csv

# Generate a scatter matrix of all courses
python3 scatter_matrix.py datasets/dataset_train.csv

# Train the logistic regression model (learning rate: 1.0, iterations: 20000)
python3 logreg_train.py datasets/dataset_train.csv 1.0 20000

Output images are saved to output/<graph_type>/.

Data Visualization

  • Histogram: Shows the score distribution for each course, split by house. Useful for identifying which courses have similar distributions across houses (homogeneous features) and which ones separate houses well.
  • Scatter Plot: Plots one course against another for every pair. Useful for spotting correlations between courses and identifying which feature pairs best separate the houses.
  • Scatter Matrix: A grid of all pairwise scatter plots. Gives a high-level overview of relationships between all features at once, helping select the best features for the logistic regression model.

Logistic Regression

The model uses one-vs-all logistic regression to classify students into one of four Hogwarts houses. Since logistic regression is a binary classifier (it predicts yes/no), we train four separate models — one per house. Each model learns to answer: "does this student belong to house X?"

For a given student, all four models produce a probability. The house whose model outputs the highest probability wins.

Training

  1. Feature selection: Pick courses whose score distributions differ across houses (visible on histograms). Avoid redundant features (correlated pairs visible on scatter plots).
  2. Preprocessing: Standardize feature values and impute missing data.
  3. Gradient descent: For each house, iteratively adjust weights to minimize the log-loss cost function. The result is a weight vector per house, saved to output/logreg/weights.csv.

Prediction

Load the saved weights, compute each house model's probability for a student's scores, and assign the house with the highest probability.

Standardization and Imputation

Standardizing data transforms features to have a mean of 0 and a standard deviation of 1, preventing features with larger scales from dominating the model.

Imputing data replaces missing values with substitutes (e.g., the mean or median) so that no observation is excluded from training.

Definition of Terms

  • Feature: A measurable property used as input to the model (e.g., a course score).
  • Column Vector: All values of a single feature across all observations.
  • Feature Vector: All feature values for a single observation, e.g., [x₁, x₂, x₃].
  • Feature Matrix: A matrix of dimensions m x n, where each row is an observation’s feature vector and each column is a feature.
  • Class: One of the possible categories the model predicts (e.g., one of the four Hogwarts houses).
  • Label: The actual known outcome for a given observation (the ground truth).
  • Hypothesis/Prediction: The predicted probability that an input belongs to a class. Thresholded (e.g., at 0.5) to produce a final classification.
  • Weights: Coefficients learned during training, reflecting each feature’s importance. Positive weights increase the likelihood of the positive class, negative weights decrease it.
  • Linear Combination: The weighted sum z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b, passed through the sigmoid function to produce the prediction.
  • Gradient: The vector of partial derivatives of the cost function with respect to the weights, used by gradient descent to iteratively update weights and minimize the cost.