README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70

# dslr

A logistic regression project that classifies Hogwarts students into houses based on course scores.

## Usage

```bash
# Describe the dataset
python3 describe.py datasets/dataset_train.csv

# Generate histograms for each course, colored by house
python3 histogram.py datasets/dataset_train.csv

# Generate scatter plots for each pair of courses
python3 scatter_plot.py datasets/dataset_train.csv

# Generate a scatter matrix of all courses
python3 scatter_matrix.py datasets/dataset_train.csv

# Train the logistic regression model (learning rate: 1.0, iterations: 20000)
python3 logreg_train.py datasets/dataset_train.csv 1.0 20000
```

Output images are saved to `output/<graph_type>/`.

## Data Visualization

+ **Histogram**: Shows the score distribution for each course, split by house. Useful for identifying which courses have similar distributions across houses (homogeneous features) and which ones separate houses well.
+ **Scatter Plot**: Plots one course against another for every pair. Useful for spotting correlations between courses and identifying which feature pairs best separate the houses.
+ **Scatter Matrix**: A grid of all pairwise scatter plots. Gives a high-level overview of relationships between all features at once, helping select the best features for the logistic regression model.

## Logistic Regression

The model uses **one-vs-all logistic regression** to classify students into one of four Hogwarts houses. Since logistic regression is a binary classifier (it predicts yes/no), we train four separate models — one per house. Each model learns to answer: "does this student belong to house X?"

For a given student, all four models produce a probability. The house whose model outputs the highest probability wins.

### Training

1. **Feature selection**: Pick courses whose score distributions differ across houses (visible on histograms). Avoid redundant features (correlated pairs visible on scatter plots).
2. **Preprocessing**: Standardize feature values and impute missing data.
3. **Gradient descent**: For each house, iteratively adjust weights to minimize the log-loss cost function. The result is a weight vector per house, saved to `output/logreg/weights.csv`.

### Prediction

Load the saved weights, compute each house model's probability for a student's scores, and assign the house with the highest probability.

## Standardization and Imputation

**Standardizing** data transforms features to have a mean of 0 and a standard deviation of 1, preventing features with larger scales from dominating the model.

**Imputing** data replaces missing values with substitutes (e.g., the mean or median) so that no observation is excluded from training.

## Definition of Terms

+ **Feature**: A measurable property used as input to the model (e.g., a course score).
+ **Column Vector**: All values of a single feature across all observations.
+ **Feature Vector**: All feature values for a single observation, e.g., [x₁, x₂, x₃].
+ **Feature Matrix**: A matrix of dimensions m x n, where each row is an observation’s feature vector and each column is a feature.
+ **Class**: One of the possible categories the model predicts (e.g., one of the four Hogwarts houses).
+ **Label**: The actual known outcome for a given observation (the ground truth).
+ **Hypothesis/Prediction**: The predicted probability that an input belongs to a class. Thresholded (e.g., at 0.5) to produce a final classification.
+ **Weights**: Coefficients learned during training, reflecting each feature’s importance. Positive weights increase the likelihood of the positive class, negative weights decrease it.
+ **Linear Combination**: The weighted sum z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b, passed through the sigmoid function to produce the prediction.
+ **Gradient**: The vector of partial derivatives of the cost function with respect to the weights, used by gradient descent to iteratively update weights and minimize the cost.

## Useful Links

+ [Matplotlib documentation](https://matplotlib.org/stable/contents.html)
+ [pandas scatter_matrix](https://pandas.pydata.org/docs/reference/api/pandas.plotting.scatter_matrix.html)