diff options
| author | Thomas Vanbesien <tvanbesi@proton.me> | 2026-04-01 17:42:04 +0200 |
|---|---|---|
| committer | Thomas Vanbesien <tvanbesi@proton.me> | 2026-04-01 17:42:04 +0200 |
| commit | 32cd9b2be1763f872c800b17e1fa63f852fe91c1 (patch) | |
| tree | 8aee9bd7e81d8204faca701c0a852bcf7dc45de6 /README.md | |
| download | DSLR-master.tar.gz DSLR-master.zip | |
Diffstat (limited to 'README.md')
| -rw-r--r-- | README.md | 70 |
1 files changed, 70 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..2723786 --- /dev/null +++ b/README.md @@ -0,0 +1,70 @@ +# dslr + +A logistic regression project that classifies Hogwarts students into houses based on course scores. + +## Usage + +```bash +# Describe the dataset +python3 describe.py datasets/dataset_train.csv + +# Generate histograms for each course, colored by house +python3 histogram.py datasets/dataset_train.csv + +# Generate scatter plots for each pair of courses +python3 scatter_plot.py datasets/dataset_train.csv + +# Generate a scatter matrix of all courses +python3 scatter_matrix.py datasets/dataset_train.csv + +# Train the logistic regression model (learning rate: 1.0, iterations: 20000) +python3 logreg_train.py datasets/dataset_train.csv 1.0 20000 +``` + +Output images are saved to `output/<graph_type>/`. + +## Data Visualization + ++ **Histogram**: Shows the score distribution for each course, split by house. Useful for identifying which courses have similar distributions across houses (homogeneous features) and which ones separate houses well. ++ **Scatter Plot**: Plots one course against another for every pair. Useful for spotting correlations between courses and identifying which feature pairs best separate the houses. ++ **Scatter Matrix**: A grid of all pairwise scatter plots. Gives a high-level overview of relationships between all features at once, helping select the best features for the logistic regression model. + +## Logistic Regression + +The model uses **one-vs-all logistic regression** to classify students into one of four Hogwarts houses. Since logistic regression is a binary classifier (it predicts yes/no), we train four separate models — one per house. Each model learns to answer: "does this student belong to house X?" + +For a given student, all four models produce a probability. The house whose model outputs the highest probability wins. + +### Training + +1. **Feature selection**: Pick courses whose score distributions differ across houses (visible on histograms). Avoid redundant features (correlated pairs visible on scatter plots). +2. **Preprocessing**: Standardize feature values and impute missing data. +3. **Gradient descent**: For each house, iteratively adjust weights to minimize the log-loss cost function. The result is a weight vector per house, saved to `output/logreg/weights.csv`. + +### Prediction + +Load the saved weights, compute each house model's probability for a student's scores, and assign the house with the highest probability. + +## Standardization and Imputation + +**Standardizing** data transforms features to have a mean of 0 and a standard deviation of 1, preventing features with larger scales from dominating the model. + +**Imputing** data replaces missing values with substitutes (e.g., the mean or median) so that no observation is excluded from training. + +## Definition of Terms + ++ **Feature**: A measurable property used as input to the model (e.g., a course score). ++ **Column Vector**: All values of a single feature across all observations. ++ **Feature Vector**: All feature values for a single observation, e.g., [x₁, x₂, x₃]. ++ **Feature Matrix**: A matrix of dimensions m x n, where each row is an observation’s feature vector and each column is a feature. ++ **Class**: One of the possible categories the model predicts (e.g., one of the four Hogwarts houses). ++ **Label**: The actual known outcome for a given observation (the ground truth). ++ **Hypothesis/Prediction**: The predicted probability that an input belongs to a class. Thresholded (e.g., at 0.5) to produce a final classification. ++ **Weights**: Coefficients learned during training, reflecting each feature’s importance. Positive weights increase the likelihood of the positive class, negative weights decrease it. ++ **Linear Combination**: The weighted sum z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b, passed through the sigmoid function to produce the prediction. ++ **Gradient**: The vector of partial derivatives of the cost function with respect to the weights, used by gradient descent to iteratively update weights and minimize the cost. + +## Useful Links + ++ [Matplotlib documentation](https://matplotlib.org/stable/contents.html) ++ [pandas scatter_matrix](https://pandas.pydata.org/docs/reference/api/pandas.plotting.scatter_matrix.html) |
