# dslr A logistic regression project that classifies Hogwarts students into houses based on course scores. ## Usage ```bash # Describe the dataset python3 describe.py datasets/dataset_train.csv # Generate histograms for each course, colored by house python3 histogram.py datasets/dataset_train.csv # Generate scatter plots for each pair of courses python3 scatter_plot.py datasets/dataset_train.csv # Generate a scatter matrix of all courses python3 scatter_matrix.py datasets/dataset_train.csv # Train the logistic regression model (learning rate: 1.0, iterations: 20000) python3 logreg_train.py datasets/dataset_train.csv 1.0 20000 ``` Output images are saved to `output//`. ## Data Visualization + **Histogram**: Shows the score distribution for each course, split by house. Useful for identifying which courses have similar distributions across houses (homogeneous features) and which ones separate houses well. + **Scatter Plot**: Plots one course against another for every pair. Useful for spotting correlations between courses and identifying which feature pairs best separate the houses. + **Scatter Matrix**: A grid of all pairwise scatter plots. Gives a high-level overview of relationships between all features at once, helping select the best features for the logistic regression model. ## Logistic Regression The model uses **one-vs-all logistic regression** to classify students into one of four Hogwarts houses. Since logistic regression is a binary classifier (it predicts yes/no), we train four separate models — one per house. Each model learns to answer: "does this student belong to house X?" For a given student, all four models produce a probability. The house whose model outputs the highest probability wins. ### Training 1. **Feature selection**: Pick courses whose score distributions differ across houses (visible on histograms). Avoid redundant features (correlated pairs visible on scatter plots). 2. **Preprocessing**: Standardize feature values and impute missing data. 3. **Gradient descent**: For each house, iteratively adjust weights to minimize the log-loss cost function. The result is a weight vector per house, saved to `output/logreg/weights.csv`. ### Prediction Load the saved weights, compute each house model's probability for a student's scores, and assign the house with the highest probability. ## Standardization and Imputation **Standardizing** data transforms features to have a mean of 0 and a standard deviation of 1, preventing features with larger scales from dominating the model. **Imputing** data replaces missing values with substitutes (e.g., the mean or median) so that no observation is excluded from training. ## Definition of Terms + **Feature**: A measurable property used as input to the model (e.g., a course score). + **Column Vector**: All values of a single feature across all observations. + **Feature Vector**: All feature values for a single observation, e.g., [x₁, x₂, x₃]. + **Feature Matrix**: A matrix of dimensions m x n, where each row is an observation’s feature vector and each column is a feature. + **Class**: One of the possible categories the model predicts (e.g., one of the four Hogwarts houses). + **Label**: The actual known outcome for a given observation (the ground truth). + **Hypothesis/Prediction**: The predicted probability that an input belongs to a class. Thresholded (e.g., at 0.5) to produce a final classification. + **Weights**: Coefficients learned during training, reflecting each feature’s importance. Positive weights increase the likelihood of the positive class, negative weights decrease it. + **Linear Combination**: The weighted sum z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b, passed through the sigmoid function to produce the prediction. + **Gradient**: The vector of partial derivatives of the cost function with respect to the weights, used by gradient descent to iteratively update weights and minimize the cost. ## Useful Links + [Matplotlib documentation](https://matplotlib.org/stable/contents.html) + [pandas scatter_matrix](https://pandas.pydata.org/docs/reference/api/pandas.plotting.scatter_matrix.html)