aboutsummaryrefslogtreecommitdiffstats
path: root/README.md
diff options
context:
space:
mode:
authorThomas Vanbesien <tvanbesi@proton.me>2026-04-01 17:42:04 +0200
committerThomas Vanbesien <tvanbesi@proton.me>2026-04-01 17:42:04 +0200
commit32cd9b2be1763f872c800b17e1fa63f852fe91c1 (patch)
tree8aee9bd7e81d8204faca701c0a852bcf7dc45de6 /README.md
downloadDSLR-32cd9b2be1763f872c800b17e1fa63f852fe91c1.tar.gz
DSLR-32cd9b2be1763f872c800b17e1fa63f852fe91c1.zip
Import from github.comHEADmaster
Diffstat (limited to 'README.md')
-rw-r--r--README.md70
1 files changed, 70 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..2723786
--- /dev/null
+++ b/README.md
@@ -0,0 +1,70 @@
+# dslr
+
+A logistic regression project that classifies Hogwarts students into houses based on course scores.
+
+## Usage
+
+```bash
+# Describe the dataset
+python3 describe.py datasets/dataset_train.csv
+
+# Generate histograms for each course, colored by house
+python3 histogram.py datasets/dataset_train.csv
+
+# Generate scatter plots for each pair of courses
+python3 scatter_plot.py datasets/dataset_train.csv
+
+# Generate a scatter matrix of all courses
+python3 scatter_matrix.py datasets/dataset_train.csv
+
+# Train the logistic regression model (learning rate: 1.0, iterations: 20000)
+python3 logreg_train.py datasets/dataset_train.csv 1.0 20000
+```
+
+Output images are saved to `output/<graph_type>/`.
+
+## Data Visualization
+
++ **Histogram**: Shows the score distribution for each course, split by house. Useful for identifying which courses have similar distributions across houses (homogeneous features) and which ones separate houses well.
++ **Scatter Plot**: Plots one course against another for every pair. Useful for spotting correlations between courses and identifying which feature pairs best separate the houses.
++ **Scatter Matrix**: A grid of all pairwise scatter plots. Gives a high-level overview of relationships between all features at once, helping select the best features for the logistic regression model.
+
+## Logistic Regression
+
+The model uses **one-vs-all logistic regression** to classify students into one of four Hogwarts houses. Since logistic regression is a binary classifier (it predicts yes/no), we train four separate models — one per house. Each model learns to answer: "does this student belong to house X?"
+
+For a given student, all four models produce a probability. The house whose model outputs the highest probability wins.
+
+### Training
+
+1. **Feature selection**: Pick courses whose score distributions differ across houses (visible on histograms). Avoid redundant features (correlated pairs visible on scatter plots).
+2. **Preprocessing**: Standardize feature values and impute missing data.
+3. **Gradient descent**: For each house, iteratively adjust weights to minimize the log-loss cost function. The result is a weight vector per house, saved to `output/logreg/weights.csv`.
+
+### Prediction
+
+Load the saved weights, compute each house model's probability for a student's scores, and assign the house with the highest probability.
+
+## Standardization and Imputation
+
+**Standardizing** data transforms features to have a mean of 0 and a standard deviation of 1, preventing features with larger scales from dominating the model.
+
+**Imputing** data replaces missing values with substitutes (e.g., the mean or median) so that no observation is excluded from training.
+
+## Definition of Terms
+
++ **Feature**: A measurable property used as input to the model (e.g., a course score).
++ **Column Vector**: All values of a single feature across all observations.
++ **Feature Vector**: All feature values for a single observation, e.g., [x₁, x₂, x₃].
++ **Feature Matrix**: A matrix of dimensions m x n, where each row is an observation’s feature vector and each column is a feature.
++ **Class**: One of the possible categories the model predicts (e.g., one of the four Hogwarts houses).
++ **Label**: The actual known outcome for a given observation (the ground truth).
++ **Hypothesis/Prediction**: The predicted probability that an input belongs to a class. Thresholded (e.g., at 0.5) to produce a final classification.
++ **Weights**: Coefficients learned during training, reflecting each feature’s importance. Positive weights increase the likelihood of the positive class, negative weights decrease it.
++ **Linear Combination**: The weighted sum z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b, passed through the sigmoid function to produce the prediction.
++ **Gradient**: The vector of partial derivatives of the cost function with respect to the weights, used by gradient descent to iteratively update weights and minimize the cost.
+
+## Useful Links
+
++ [Matplotlib documentation](https://matplotlib.org/stable/contents.html)
++ [pandas scatter_matrix](https://pandas.pydata.org/docs/reference/api/pandas.plotting.scatter_matrix.html)