Monday, June 2, 2025

Detailed ML Learning Journey

6-Month Maths-Focused Machine Learning Program with Enhanced Resources

This program spreads the material from an intensive plan over 6 months, aiming for roughly 1-2 hours of focused study/coding per day on weekdays, with optional longer sessions on weekends for deeper dives or project work.

Month 1: Linear Algebra & Foundational Math

Goal: Build a solid understanding of vectors, matrices, and basic linear algebra operations crucial for ML.

Week 1: Introduction to Vectors
- Day 1: What are Vectors?
  - MMLL: Chapter 2.1 (Vectors)
  - SIA: Chapter 1.1 (Vectors and Linear Combinations)
  - KA: Vectors introduction¹
  - 3Blue1Brown: What is a vector?
  - Medium: Vectors for Machine Learning Explained Simply
  - Assignment: KA - Vector intro questions
- Day 2: Vector Operations
  - MMLL: Chapter 2.1 (Vector Addition, Scalar Multiplication)
  - SIA: Chapter 1.2 (Lengths and Dot Products)
  - KA: Adding and subtracting vectors
  - YouTube: Vector operations (Khan Academy)
  - Assignment: KA - Performing vector operations
- Day 3: Dot Product & Projections (Basic)
  - MMLL: Chapter 2.1.2 (Inner Product/Dot Product), Chapter 2.2.3 (Orthogonal Projection - Basic idea)
  - SIA: Chapter 1.2 (Lengths and Dot Products)
  - KA: Dot product
  - 3Blue1Brown: Dot products and duality
  - YouTube: Dot products and duality | Chapter 3, Essence of linear algebra (3Blue1Brown)
  - Assignment: KA - Dot product & vector projections
- Day 4: Linear Combinations & Span
  - MMLL: Chapter 2.1.3 (Linear Combination)
  - SIA: Chapter 1.3 (Matrices) - focus on how columns combine.
  - KA: Linear combinations and span
  - Medium: Linear Combinations and Span (CK-12)
  - YouTube: Linear combinations, span, and basis vectors | Chapter 2, Essence of linear algebra (3Blue1Brown)
  - Assignment: KA - Determine if a vector is in a given span
- Day 5: Hands-on Vector Operations
  - Assignment: Implement basic vector addition, scalar multiplication, and dot product using NumPy arrays. Create a function for vector magnitude.
- Day 6: Review & Practice
  - Review notes from Week 1. Redo any challenging KA problems.
  - Assignment: Search for "linear algebra vector problems" online, solve 5-10.
- Day 7: Rest/Catch-up
Week 2: Matrices - The Basics
- Day 8: Matrix Definition & Types
  - MMLL: Chapter 2.2 (Matrices)
  - SIA: Chapter 1.4 (Multiplying Matrices) - focus on basic definition.
  - KA: Introduction to matrices
  - Medium: A Complete Guide to Matrices for Machine Learning with Python (Machine Learning Mastery)
  - YouTube: Matrices (and how to multiply them) | Chapter 4, Essence of linear algebra (3Blue1Brown)
  - Assignment: KA - Matrix dimensions
- Day 9: Matrix Addition & Scalar Multiplication
  - MMLL: Chapter 2.2.1 (Matrix Operations)
  - SIA: Chapter 2.1 (Solving Linear Equations by Elimination) - focus on matrix representation.
  - KA: Adding & subtracting matrices
  - YouTube: Scalar multiplication and addition of matrices² (Khan Academy)
  - Assignment: KA - Add & subtract matrices
- Day 10: Matrix Multiplication (The Core)
  - MMLL: Chapter 2.2.1 (Matrix Operations) - Understand row-column dot product.
  - SIA: Chapter 1.4 (Multiplying Matrices)
  - KA: Matrix multiplication
  - 3Blue1Brown: Matrix multiplication as transformations
  - YouTube: Matrix multiplication as transformations | Chapter 4, Essence of linear algebra (3Blue1Brown)
  - Assignment: KA - Multiplying matrices
- Day 11: Transpose & Identity Matrix
  - MMLL: Chapter 2.2.1 (Matrix Operations), 2.2.2 (Identity Matrix)
  - SIA: Chapter 1.4 (Transpose)
  - KA: Transpose of a matrix
  - YouTube: Introduction to the identity matrix (Khan Academy)
  - Assignment: KA - Transposing a matrix
- Day 12: Hands-on Matrix Operations
  - Assignment: Implement matrix addition, scalar multiplication, and matrix multiplication using NumPy. Verify np.dot and np.transpose.
- Day 13: Review & Practice
  - Review notes from Week 2. Focus on matrix multiplication intuition.
  - Assignment: Search for "linear algebra matrix multiplication problems," solve 5-10.
- Day 14: Rest/Catch-up
Week 3: Systems of Linear Equations & Inverses
- Day 15: Systems of Linear Equations (Matrix Form)
  - MMLL: Chapter 2.3 (Solving Systems of Linear Equations)
  - SIA: Chapter 2.1 (Solving Linear Equations by Elimination)
  - KA: Solving systems of linear equations
  - YouTube: Solving linear systems with matrices (Khan Academy)
  - Assignment: KA - Solutions to systems of equations
- Day 16: Determinants (2x2, 3x3)
  - MMLL: Chapter 2.2.4 (Determinant)
  - SIA: Chapter 5.1 (The Properties of Determinants)
  - KA: Determinant of a 2x2 matrix
  - 3Blue1Brown: The determinant
  - YouTube: The determinant | Chapter 6, Essence of linear algebra (3Blue1Brown)
  - Assignment: KA - Determinant of a 2x2, 3x3 matrix
- Day 17: Inverse Matrices (Conceptual & Calculation)
  - MMLL: Chapter 2.2.5 (Matrix Inverse)
  - SIA: Chapter 2.2 (Matrix Inverse)
  - KA: Inverse of a 2x2 matrix
  - YouTube: Inverse matrices, column space and null space | Chapter 7, Essence of linear algebra (3Blue1Brown)
  - Assignment: KA - Inverse of a 2x2 matrix
- Day 18: Solving Systems with Inverses
  - MMLL: Chapter 2.3 (Using Matrix Inverse to Solve Systems)
  - SIA: Chapter 2.2 (Solving $A x = b$ with A−1)
  - Assignment: Use np.linalg.solve to solve a system of linear equations in Python. Practice calculating inverses with np.linalg.inv.
- Day 19: Hands-on Determinants & Inverses
  - Assignment: Write Python functions to calculate the determinant of a 2x2 matrix and the inverse of a 2x2 matrix from scratch (without np.linalg). Compare with NumPy.
- Day 20: Review & Practice
  - Review notes. Ensure you understand when an inverse exists (non-zero determinant).
  - Assignment: MIT OpenCourseware (MIT 18.06SC Linear Algebra) Problem Set 3, Problem 1 (or similar problems related to inverses).
- Day 21: Rest/Catch-up
Week 4: Eigenvalues, Eigenvectors & Review
- Day 22: Eigenvalues & Eigenvectors - Intuition
  - MMLL: Chapter 2.4 (Eigenvalues and Eigenvectors)
  - SIA: Chapter 6.1 (Introduction to Eigenvalues)
  - 3Blue1Brown: Eigenvectors and eigenvalues
  - Medium: Eigenvalues and Eigenvectors: The Most Intuitive Explanation
  - YouTube: Eigenvectors and eigenvalues | Chapter 8, Essence of linear algebra (3Blue1Brown)
  - Assignment: Watch 3Blue1Brown multiple times. Focus on geometric intuition.
- Day 23: Calculating Eigenvalues & Eigenvectors (2x2)
  - MMLL: Chapter 2.4 (Calculation examples)
  - SIA: Chapter 6.1 (Finding Eigenvalues and Eigenvectors)
  - KA: Eigenvalues and eigenvectors
  - Assignment: KA - Find eigenvalues and eigenvectors of a 2x2 matrix
- Day 24: Hands-on Eigen Decomposition
  - Assignment: Use np.linalg.eig to find eigenvalues and eigenvectors of a matrix. Verify Av=lambdav.
- Day 25: Orthogonality (Conceptual)
  - MMLL: Chapter 2.1.2 (Orthogonal Vectors)
  - SIA: Chapter 4.1 (Orthogonal Vectors and Subspaces)
  - KA: Orthogonal vectors
  - YouTube: Orthogonal vectors (Khan Academy)
  - Assignment: Identify orthogonal vectors.
- Day 26: Linear Algebra Review & Mini-Project Prep
  - Review: All concepts from Month 1. Revisit problem areas.
  - Project Prep: Understand the high-level goal of Principal Component Analysis (PCA) as a dimensionality reduction technique (you'll implement a basic version in Month 5). Focus on how it uses eigenvectors.
- Day 27: REST/Catch-up
- Day 28: Monthly Review & Catch-up

Month 2: Calculus, Probability & First ML Algorithm

Goal: Grasp essential calculus concepts for optimization, fundamental probability/statistics, and apply them to your first ML model.

Week 5: Calculus - Derivatives
- Day 29: Functions, Limits, Continuity
  - KA: Limits introduction
  - YouTube: Limits (full playlist) (Khan Academy)
  - Assignment: KA - Evaluating limits from graphs & functions, Continuity problems
- Day 30: Derivatives - Intuition & Power Rule
  - KA: Derivatives introduction
  - 3Blue1Brown: What is a derivative?
  - YouTube: What's a derivative? (3Blue1Brown)
  - Assignment: KA - Power rule problems
- Day 31: Product, Quotient, Chain Rule
  - KA: Product rule
  - YouTube: The Chain Rule (Khan Academy)
  - Assignment: KA - Practice applying product, quotient, and chain rules.
- Day 32: Critical Points & Local Min/Max
  - KA: Maxima & minima on an interval
  - YouTube: Finding critical points (Khan Academy)
  - Assignment: KA - Find critical points and determine local extrema.
- Day 33: Hands-on Symbolic Differentiation (Optional)
  - Assignment: Experiment with SymPy in Python to symbolically differentiate simple functions. (e.g., import sympy; x = sympy.symbols('x'); f = x**2 + 3*x; sympy.diff(f, x))
- Day 34: Review & Practice
  - Assignment: Search for "univariate calculus differentiation problems," solve 5-10.
- Day 35: Rest/Catch-up
Week 6: Calculus - Gradients & Optimization Intro
- Day 36: Functions of Multiple Variables & Partial Derivatives
  - MMLL: Chapter 3.1.2 (Partial Derivatives)
  - KA: Partial derivatives introduction
  - Medium: Derivatives for Multivariable Functions (CodeSignal Learn)
  - YouTube: Partial derivatives (Khan Academy)
  - Assignment: KA - Compute partial derivatives
- Day 37: Gradient Vector
  - MMLL: Chapter 3.1.3 (Gradient)
  - KA: The gradient
  - 3Blue1Brown: The gradient
  - YouTube: The gradient, explained | Chapter 10, Essence of calculus (3Blue1Brown)
  - Assignment: Compute gradient vectors for simple functions (e.g., $f (x, y) = x^{2} y + y^{3}$ ).
- Day 38: Hessian Matrix (Conceptual)
  - MMLL: Chapter 3.1.4 (Hessian) - Understand it represents curvature.
  - KA: The Hessian matrix
  - YouTube: Hessian Matrix (mathematicalmonk)
  - Assignment: No explicit problem, focus on conceptual understanding of second partial derivatives and the Hessian's purpose.
- Day 39: Introduction to Optimization
  - MMLL: Chapter 4 (Optimization) - Focus on 4.1 (Introduction) and 4.2 (Conditions for Optima).
  - KA: Optimization problems (conceptual)
  - Medium: Machine Learning Optimization: Best Techniques and Algorithms (Neural Concept)
  - Assignment: Understand the goal of optimization in ML (minimizing loss functions).
- Day 40: Hands-on Gradient Calculation
  - Assignment: Write a Python function that calculates the gradient of a simple multivariate function (e.g., $f (x, y) = (x - 2)^{2} + (y - 3)^{2}$ ).
- Day 41: Review & Practice
  - Assignment: Review partial derivatives and gradients. MIT OpenCourseware (MIT 18.02 Multivariable Calculus) Problem Set 1, Problem 1 (or similar).
- Day 42: Rest/Catch-up
Week 7: Probability Fundamentals
- Day 43: Basic Probability & Events
  - MMLL: Chapter 5.1 (Probability)
  - KA: Basic probability
  - YouTube: Basic Probability (Khan Academy)
  - Assignment: KA - Simple probability
- Day 44: Conditional Probability & Bayes' Theorem
  - MMLL: Chapter 5.1.2 (Conditional Probability), 5.1.3 (Bayes' Theorem)
  - KA: Conditional probability
  - YouTube: Conditional Probability and Bayes' Theorem (3Blue1Brown - highly visual)
  - Assignment: KA - Conditional probability, Bayes' theorem
- Day 45: Random Variables & Distributions Intro
  - MMLL: Chapter 5.2 (Random Variables)
  - KA: Random variables
  - YouTube: Random Variables (Khan Academy)
  - Assignment: Distinguish between discrete and continuous random variables.
- Day 46: PMF, PDF, CDF
  - MMLL: Chapter 5.2.1 (Probability Mass Function), 5.2.2 (Probability Density Function), 5.2.3 (Cumulative Distribution Function)
  - KA: PMF, PDF, CDF (conceptual)
  - Medium: Understanding Probability Distributions for Machine Learning with Python (Machine Learning Mastery)
  - YouTube: Probability Density Functions (PDFs) and Probability Mass Functions (PMFs) (StatQuest with Josh Starmer)
  - Assignment: Match descriptions to the correct function type (PMF, PDF, CDF).
- Day 47: Hands-on Probability Simulations
  - Assignment: Write Python code to simulate coin flips, dice rolls. Calculate empirical probabilities and compare to theoretical.
- Day 48: Review & Practice
  - Assignment: Review probability concepts. Search for "probability problems with Bayes' theorem" and solve 2-3.
- Day 49: Rest/Catch-up
Week 8: Probability Distributions & Linear Regression (Math)
- Day 50: Common Discrete Distributions (Bernoulli, Binomial)
  - MMLL: Chapter 5.2.5 (Bernoulli, Binomial)
  - KA: Bernoulli
  - YouTube: The Binomial Distribution (StatQuest with Josh Starmer)
  - Assignment: KA - Bernoulli, Binomial distribution problems
- Day 51: Normal (Gaussian) Distribution
  - MMLL: Chapter 5.2.6 (Gaussian Distribution)
  - KA: Normal distribution introduction
  - YouTube: The Normal Distribution (StatQuest with Josh Starmer)
  - Assignment: KA - Z-scores and normal distribution probabilities
- Day 52: Expectation & Variance
  - MMLL: Chapter 5.3 (Expectation), 5.4 (Variance and Covariance)
  - KA: Expected value
  - YouTube: Expected Value and Variance Explained (The Organic Chemistry Tutor)
  - Assignment: KA - Calculate expected value and variance for simple distributions.
- Day 53: Simple Linear Regression - Mathematical Formulation
  - MMLL: Chapter 6.1 (Linear Regression) - Focus on 6.1.1 (Model Definition) and 6.1.2 (Squared Error Loss).
  - Medium: Linear regression | Machine Learning (Google for Developers)
  - YouTube: Linear Regression - Fun and Easy Machine Learning (StatQuest with Josh Starmer)
  - Assignment: Understand the equation y=beta_0+beta_1x and the concept of minimizing the Sum of Squared Errors (SSE).
- Day 54: Least Squares Method - Conceptual
  - MMLL: Chapter 6.1.3 (Least Squares Estimation) - Focus on the intuition of finding the "best fit" line.
  - YouTube: Least Squares Regression (Khan Academy)
  - Assignment: No coding, just understanding the conceptual goal of Least Squares.
- Day 55: Monthly Review & Catch-up
  - Assignment: Review all concepts from Month 2. Focus on the intuition of derivatives, gradients, and how probability distributions describe data.
- Day 56: Rest/Catch-up

Month 3: Regression & Core ML Concepts

Goal: Deepen understanding of regression, explore optimization for ML, and grasp bias-variance.

Week 9: Linear Regression Implementation
- Day 57: Derivation of Coefficients (Calculus)
  - MMLL: Chapter 6.1.3 (Least Squares Estimation) - Understand the partial derivatives of the SSE.
  - YouTube: Deriving the Normal Equation for Linear Regression (StatQuest with Josh Starmer)
  - Assignment: Walk through the derivation of the coefficients for simple linear regression (or watch a video explaining it).
- Day 58: Multiple Linear Regression & Normal Equation (Matrix Form)
  - MMLL: Chapter 6.1.4 (Multiple Linear Regression), 6.1.5 (Normal Equation).
  - YouTube: The Normal Equation (Andrew Ng's ML Course)
  - Assignment: Understand y=Xbeta and the Normal Equation beta=(XTX)−1XTy.
- Day 59: Hands-on Simple Linear Regression from Scratch
  - Assignment: Implement simple linear regression from scratch using NumPy. Plot the regression line on a small dataset.
- Day 60: Hands-on Multiple Linear Regression (Normal Equation)
  - Assignment: Implement multiple linear regression using the Normal Equation with NumPy. Test on a synthetic dataset.
- Day 61: Assumptions of Linear Regression
  - MMLL: Chapter 6.1.6 (Assumptions)
  - YouTube: Assumptions of Linear Regression (StatQuest with Josh Starmer)
  - Assignment: List and understand the key assumptions (linearity, independence, homoscedasticity, normality of errors).
- Day 62: Review & Practice
  - Assignment: Review linear regression. Search for "linear regression normal equation problems" and solve one.
- Day 63: Rest/Catch-up
Week 10: Gradient Descent in Depth
- Day 64: Gradient Descent for Linear Regression
  - MMLL: Chapter 6.1.7 (Gradient Descent)
  - YouTube: Gradient Descent, Step-by-Step (StatQuest with Josh Starmer)
  - Assignment: Understand the update rule beta_new=beta_old−alphanablaJ(beta).
- Day 65: Learning Rate & Convergence
  - MMLL: Chapter 4.3 (Gradient Descent) - Focus on learning rate.
  - YouTube: How to choose a learning rate for gradient descent (sentdex)
  - Assignment: Experiment with different learning rates in your GD implementation from Day 60. Observe divergence/slow convergence.
- Day 66: Stochastic Gradient Descent (SGD) - Intuition
  - MMLL: Chapter 4.3.2 (Stochastic Gradient Descent)
  - Medium: Different Variants of Gradient Descent (GeeksforGeeks)
  - YouTube: Stochastic Gradient Descent, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand why SGD is faster for large datasets (updates per single example).
- Day 67: Mini-batch Gradient Descent
  - MMLL: Chapter 4.3.3 (Mini-batch Gradient Descent)
  - YouTube: Mini-Batch Gradient Descent (DeepLearning.AI)
  - Assignment: Understand the trade-off between GD and SGD.
- Day 68: Hands-on SGD for Linear Regression
  - Assignment: Implement SGD for linear regression in NumPy. Compare its performance to batch GD on a slightly larger dataset.
- Day 69: Review & Practice
  - Assignment: Review Gradient Descent variants. MIT 6.036 (Introduction to Machine Learning) problem set on Gradient Descent (search for recent versions).
- Day 70: Rest/Catch-up
Week 11: Regularization
- Day 71: Overfitting & Underfitting
  - MMLL: Chapter 6.3 (Regularization) - Intro.
  - YouTube: Overfitting vs. Underfitting (StatQuest with Josh Starmer)
  - Assignment: Understand the concepts of overfitting and underfitting. Identify them visually.
- Day 72: Ridge Regression (L2 Regularization) - Math
  - MMLL: Chapter 6.3.1 (Ridge Regression)
  - Medium: Ridge Regression: L2 Regularization Explained with Examples
  - YouTube: L1 and L2 Regularization, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand the added lambda∣∣beta∣∣2_2 term in the loss function and its effect on coefficients.
- Day 73: Lasso Regression (L1 Regularization) - Math
  - MMLL: Chapter 6.3.2 (Lasso Regression)
  - Medium: Lasso Regression Explained with Examples
  - YouTube: L1 and L2 Regularization (StatQuest with Josh Starmer)
  - Assignment: Understand the added lambda∣∣beta∣∣_1 term and its ability to induce sparsity (feature selection).
- Day 74: Hands-on Regularization (Scikit-learn)
  - Assignment: Use sklearn.linear_model.Ridge and sklearn.linear_model.Lasso. Experiment with the alpha parameter on a dataset prone to overfitting.
- Day 75: Choosing Lambda (Regularization Strength)
  - Assignment: Research cross-validation as a method for selecting hyperparameters like lambda. (No implementation yet, just conceptual).
- Day 76: Review & Practice
  - Assignment: Search for "Ridge vs Lasso explained" or "regularization in linear regression problems."
- Day 77: Rest/Catch-up
Week 12: Polynomial Regression & Bias-Variance
- Day 78: Polynomial Regression
  - MMLL: Chapter 6.2 (Polynomial Regression)
  - YouTube: Polynomial Regression Explained (StatQuest with Josh Starmer)
  - Assignment: Understand how polynomial features are created. Implement polynomial regression using sklearn.preprocessing.PolynomialFeatures and LinearRegression.
- Day 79: Bias-Variance Trade-off - Intuition
  - MMLL: Chapter 6.4 (Bias-Variance Decomposition) - Focus on 6.4.1 (Introduction).
  - Medium: Bias Variance Tradeoff (MLU-Explain - interactive)
  - YouTube: Bias and Variance (StatQuest with Josh Starmer)
  - Assignment: Conceptualize bias (model's simplifying assumptions) and variance (model's sensitivity to training data).
- Day 80: Mathematical Breakdown of Bias-Variance
  - MMLL: Chapter 6.4.2 (Derivation) - Go through the derivation of the MSE decomposition (if comfortable, otherwise understand the terms).
  - Assignment: Understand how MSE = Bias² + Variance + Noise.
- Day 81: Visualizing Bias-Variance
  - Assignment: Search for online visualizations of the bias-variance trade-off (e.g., using target practice analogy).
- Day 82: Hands-on Bias-Variance Example
  - Assignment: Create a synthetic dataset. Fit a low-degree polynomial (high bias) and a high-degree polynomial (high variance) to it. Plot and observe the fit and generalization.
- Day 83: Monthly Review & Project Prep
  - Review: All concepts from Month 3. Focus on regression, optimization, regularization, and bias-variance.
  - Project Prep: Brainstorm simple regression datasets you could use for a mini-project (e.g., house price prediction, car mileage).
- Day 84: Rest/Catch-up

Month 4: Classification Algorithms

Goal: Understand the mathematical underpinnings of key classification models.

Week 13: Logistic Regression
- Day 85: Logistic Regression - Concept & Sigmoid
  - MMLL: Chapter 7.1 (Logistic Regression) - Focus on 7.1.1 (Binary Classification) and 7.1.2 (Sigmoid function).
  - YouTube: Logistic Regression, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand how the sigmoid function maps linear output to a probability between 0 and 1. Plot the sigmoid.
- Day 86: Cross-Entropy Loss Function
  - MMLL: Chapter 7.1.3 (Likelihood and Loss Function) - Understand why squared error isn't suitable and why cross-entropy is used.
  - Medium: Cross-Entropy Loss Function in Machine Learning: Enhancing Model Accuracy (DataCamp)
  - YouTube: Cross Entropy Demystified (A.I. in 5)
  - Assignment: Understand the formula for binary cross-entropy loss.
- Day 87: Gradient Descent for Logistic Regression
  - MMLL: Chapter 7.1.4 (Gradient Descent) - Understand the update rule (similar to linear regression, but with different derivatives).
  - YouTube: Logistic Regression Details: Calculating the Gradient (Andrew Ng's ML Course)
  - Assignment: Walk through the derivation of the gradients (or watch a detailed explanation).
- Day 88: Hands-on Logistic Regression from Scratch
  - Assignment: Implement binary logistic regression from scratch using NumPy (forward pass, loss, and gradient descent update). Test on a simple synthetic dataset.
- Day 89: Hands-on Logistic Regression (Scikit-learn)
  - Assignment: Use sklearn.linear_model.LogisticRegression. Compare results with your custom implementation. Understand predict_proba.
- Day 90: Review & Practice
  - Assignment: Review logistic regression. Search for "logistic regression explained math" and re-read.
- Day 91: Rest/Catch-up
Week 14: Softmax & SVMs
- Day 92: Softmax Regression (Multinomial Logistic Regression)
  - MMLL: Chapter 7.2 (Softmax Regression)
  - YouTube: Softmax Regression (Andrew Ng's ML Course)
  - Assignment: Understand the softmax function and its use for multi-class classification.
- Day 93: Categorical Cross-Entropy Loss
  - MMLL: Chapter 7.2 (Loss function).
  - YouTube: Categorical Cross Entropy Explained (StatQuest with Josh Starmer - same as binary, just extended)
  - Assignment: Understand the extension of cross-entropy to multiple classes.
- Day 94: Support Vector Machines (SVMs) - Hyperplane & Margin
  - MMLL: Chapter 8.1 (Support Vector Machines) - Focus on 8.1.1 (The Linear Classifier).
  - YouTube: Support Vector Machines (SVM) — The math of intelligence (mathematics for machine learning)
  - Assignment: Understand the goal of SVM: finding the maximum margin hyperplane.
- Day 95: Kernel Trick (Conceptual)
  - MMLL: Chapter 8.2 (Non-linear SVM) - Focus on the idea of mapping data to higher dimensions implicitly.
  - Medium: Kernel Trick in Support Vector Classification (GeeksforGeeks)
  - YouTube: SVMs and the Kernel Trick (StatQuest with Josh Starmer)
  - Assignment: No explicit math problem, focus on understanding the concept of making linearly inseparable data separable.
- Day 96: Hands-on SVM (Scikit-learn)
  - Assignment: Use sklearn.svm.SVC. Experiment with different kernels (linear, rbf, poly) on a dataset like Iris or circles/moons.
- Day 97: Review & Practice
  - Assignment: Review Softmax and SVMs. Search for "SVM kernel trick explained" if needed.
- Day 98: Rest/Catch-up
Week 15: Decision Trees & Ensembles Intro
- Day 99: Decision Trees - Basics & Splitting
  - MMLL: Chapter 10.1 (Decision Trees)
  - YouTube: Decision Trees, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand how decision trees make predictions by splitting data.
- Day 100: Gini Impurity & Entropy (Mathematical Definitions)
  - MMLL: Chapter 10.1.1 (Impurity Measures)
  - YouTube: Decision Trees Part 2: Gini Impurity and Information Gain (StatQuest with Josh Starmer)
  - Assignment: Understand the formulas for Gini impurity and entropy. Calculate them for simple data splits.
- Day 101: Information Gain
  - MMLL: Chapter 10.1.2 (Information Gain)
  - YouTube: Information Gain in Decision Tree (Machine Learning with Phil)
  - Assignment: Understand how information gain is used to choose the best split.
- Day 102: Introduction to Ensemble Methods (Bagging)
  - MMLL: Chapter 10.2 (Ensemble Methods) - Focus on 10.2.1 (Bagging/Random Forests).
  - Medium: Random Forest Algorithm in Machine Learning (GeeksforGeeks)
  - YouTube: Bagging and Random Forests, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand the concept of "wisdom of crowds" and how bagging works.
- Day 103: Boosting (Conceptual)
  - MMLL: Chapter 10.2.2 (Boosting).
  - YouTube: Boosting (AdaBoost), Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand the sequential nature of boosting (correcting previous errors).
- Day 104: Hands-on Decision Tree & Random Forest (Scikit-learn)
  - Assignment: Use sklearn.tree.DecisionTreeClassifier and sklearn.ensemble.RandomForestClassifier. Compare their performance.
- Day 105: Rest/Catch-up
Week 16: KNN & Naive Bayes, Classification Project
- Day 106: K-Nearest Neighbors (KNN)
  - Medium: K-Nearest Neighbors Algorithm: An Intuitive Guide
  - YouTube: K-nearest Neighbors (KNN), Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand the algorithm: "lazy learner," distance metrics (Euclidean, Manhattan). Implement KNN from scratch for a tiny dataset.
- Day 107: Naive Bayes - Intuition
  - MMLL: Chapter 5.1.3 (Bayes' Theorem application - conceptually)
  - YouTube: Naive Bayes, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand the "naive" assumption of conditional independence.
- Day 108: Different Naive Bayes Variants (Conceptual)
  - Assignment: Research Gaussian, Multinomial, and Bernoulli Naive Bayes and when to use them.
- Day 109: Hands-on Naive Bayes (Scikit-learn)
  - Assignment: Use sklearn.naive_bayes.GaussianNB or MultinomialNB.
- Day 110: Monthly Review & Project Prep
  - Review: All classification algorithms.
  - Project Prep: Prepare for a classification project.
- Day 111: Classification Project
  - Assignment: Choose a classification dataset (e.g., Pima Indians Diabetes, Titanic). Apply at least 3 different classification algorithms learned (e.g., Logistic Regression, SVM, Random Forest). Evaluate performance using appropriate metrics (accuracy, precision, recall, F1-score).
- Day 112: Rest/Catch-up

Month 5: Unsupervised Learning & Optimization Deep Dive

Goal: Explore methods for finding patterns in unlabeled data and delve deeper into optimization techniques.

Week 17: Clustering
- Day 113: K-Means Clustering - Objective Function
  - MMLL: Chapter 9.1 (K-Means Clustering)
  - YouTube: K-Means Clustering, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand the goal: minimizing within-cluster sum of squares.
- Day 114: Lloyd's Algorithm
  - MMLL: Chapter 9.1.1 (Algorithm).
  - Assignment: Walk through the steps of Lloyd's algorithm.
- Day 115: Hands-on K-Means from Scratch
  - Assignment: Implement K-Means from scratch using NumPy. Test on a simple 2D dataset and visualize clusters.
- Day 116: Hierarchical Clustering (Conceptual)
  - MMLL: Chapter 9.2 (Hierarchical Clustering) - Focus on agglomerative and dendrograms.
  - YouTube: Hierarchical Clustering, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand how dendrograms are formed.
- Day 117: Hands-on Hierarchical Clustering (SciPy)
  - Assignment: Use scipy.cluster.hierarchy to perform hierarchical clustering and plot a dendrogram.
- Day 118: Review & Practice
  - Assignment: Review clustering algorithms. Search for "K-Means problems."
- Day 119: Rest/Catch-up
Week 18: Dimensionality Reduction (PCA)
- Day 120: PCA - Recap & Covariance Matrix
  - MMLL: Chapter 11.1 (Principal Component Analysis) - Revisit from Linear Algebra section. Focus on the role of the covariance matrix.
  - Medium: Principal component analysis (Wikipedia - good overview)
  - YouTube: PCA - Principal Component Analysis, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand the definition of covariance and the covariance matrix.
- Day 121: PCA - Eigenvalues & Eigenvectors for Reduction
  - MMLL: Chapter 11.1.1 (PCA Algorithm)
  - YouTube: PCA Algorithm (Mathematicalmonk)
  - Assignment: Understand how eigenvectors correspond to principal components and eigenvalues to explained variance.
- Day 122: Singular Value Decomposition (SVD) for PCA
  - MMLL: Chapter 11.1.2 (SVD for PCA)
  - SIA: Chapter 7.2 (Singular Value Decomposition)
  - YouTube: Singular Value Decomposition (SVD) and PCA (StatQuest with Josh Starmer)
  - Assignment: Understand that SVD provides a robust way to compute PCA.
- Day 123: Hands-on PCA from Scratch (using SVD)
  - Assignment: Implement PCA from scratch using NumPy's np.linalg.svd. Apply it to a high-dimensional dataset (e.g., MNIST digits) and visualize the first 2 components.
- Day 124: Hands-on PCA (Scikit-learn)
  - Assignment: Use sklearn.decomposition.PCA and compare results with your custom implementation. Understand explained_variance_ratio_.
- Day 125: Review & Practice
  - Assignment: Review PCA. Search for "PCA explained visually."
- Day 126: Rest/Catch-up
Week 19: Advanced Optimization
- Day 127: Limitations of Basic Gradient Descent
  - Assignment: Understand local minima, saddle points, and issues with learning rate (e.g., slow convergence, oscillations).
- Day 128: Momentum
  - MMLL: Chapter 4.3.4 (Momentum)
  - YouTube: Gradient Descent with Momentum, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand how momentum helps accelerate GD and overcome local minima.
- Day 129: Adagrad & RMSprop (Conceptual)
  - MMLL: Chapter 4.3.5 (AdaGrad), 4.3.6 (RMSprop)
  - Medium: Deep Learning Optimization Algorithms (Neptune.ai)
  - YouTube: Adagrad, RMSProp, Adam, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand the concept of adaptive learning rates per parameter.
- Day 130: Adam Optimizer (Conceptual)
  - MMLL: Chapter 4.3.7 (Adam)
  - YouTube: Adam Optimizer, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand Adam as a combination of Momentum and RMSprop ideas.
- Day 131: Hands-on Optimizers (TensorFlow/PyTorch)
  - Assignment: Build a simple linear regression model using TensorFlow or PyTorch. Experiment with SGD, Adam, and RMSprop optimizers, observing their convergence behavior.
- Day 132: Review & Practice
  - Assignment: Search for "deep learning optimizers explained" videos/articles. Focus on their mathematical update rules.
- Day 133: Rest/Catch-up
Week 20: Convex Optimization (Conceptual) & Unsupervised Project
- Day 134: Convex Sets & Functions (Conceptual)
  - MMLL: Chapter 4.2 (Conditions for Optima) - Focus on the definition of convex functions.
  - Medium: Convex Optimization: Why it is so important in Machine Learning?
  - YouTube: Convexity in Machine Learning (Andrew Ng's ML Course)
  - Assignment: Understand what makes a function convex and why it's desirable for optimization (guarantees a global minimum).
- Day 135: Why Convexity Matters in ML
  - Assignment: Understand that many traditional ML models (linear regression, logistic regression with cross-entropy) have convex loss functions, guaranteeing convergence to global optima.
- Day 136: Anomaly Detection (Brief Introduction)
  - MMLL: Chapter 9.3 (Anomaly Detection) - Basic overview.
  - YouTube: Anomaly Detection, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand the goal of anomaly detection. Explore simple statistical methods (e.g., Z-score).
- Day 137: Monthly Review & Project Prep
  - Review: All concepts from Month 5.
  - Project Prep: Prepare for an unsupervised learning project.
- Day 138: Unsupervised Learning Project
  - Assignment: Take a dataset (e.g., customers with features, or gene expression data). Perform K-Means clustering and PCA. Visualize the clusters in reduced dimensions. Interpret the results.
- Day 139: Rest/Catch-up
- Day 140: Monthly Review & Catch-up

Month 6: Deep Learning Fundamentals

Goal: Understand the core mathematical principles behind neural networks and popular architectures.

Week 21: Neural Network Basics & Forward Pass
- Day 141: Perceptrons & Biological Analogy
  - MMLL: Chapter 12.1 (Feedforward Neural Networks) - Focus on the basic unit.
  - YouTube: The Perceptron (Andrew Ng's ML Course)
  - Assignment: Understand how a single perceptron works.
- Day 142: Activation Functions (Sigmoid, Tanh, ReLU)
  - MMLL: Chapter 12.1.2 (Activation Functions)
  - YouTube: Activation Functions, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand the mathematical forms and properties of each activation function. Plot them.
- Day 143: Feedforward Neural Networks (MLPs) - Architecture
  - MMLL: Chapter 12.1.1 (Layered Architectures).
  - YouTube: Neural Networks Demystified [Part 1: Data & Network Representation] (Welch Labs)
  - Assignment: Draw an MLP architecture with input, hidden, and output layers.
- Day 144: Forward Propagation - Matrix Math
  - MMLL: Chapter 12.1.3 (Feedforward Pass)
  - YouTube: Forward Propagation, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand how each layer's output is calculated using matrix multiplication and activation functions.
- Day 145: Hands-on MLP Forward Pass (NumPy)
  - Assignment: Implement a simple 2-layer MLP (input, 1 hidden, output) forward pass using NumPy. Use random weights and biases.
- Day 146: Review & Practice
  - Assignment: Review the forward pass. Search for "neural network forward propagation example" and trace the calculations.
- Day 147: Rest/Catch-up
Week 22: Backpropagation - The Core of Learning
- Day 148: Backpropagation - The Chain Rule in Action
  - MMLL: Chapter 12.2 (Backpropagation) - Focus on 12.2.1 (Gradient Calculation).
  - 3Blue1Brown: Backpropagation Calculus
  - Medium: Deep Learning (Part 27)-Backpropagation Intuition (Scribd - based on a Medium article)
  - YouTube: Backpropagation Calculus | Chapter 4, Deep learning (3Blue1Brown)
  - Assignment: Understand that backpropagation is just repeated application of the chain rule.
- Day 149: Deriving Gradients for Output Layer
  - MMLL: Chapter 12.2.2 (Output Layer Gradients)
  - YouTube: Backpropagation for a Neural Network (Andrew Ng's ML Course)
  - Assignment: Walk through the derivation of gradients for the output layer's weights and biases (e.g., for MSE loss).
- Day 150: Deriving Gradients for Hidden Layers
  - MMLL: Chapter 12.2.3 (Hidden Layer Gradients)
  - Assignment: Understand how errors are "propagated backward" to update hidden layer weights.
- Day 151: Hands-on Backpropagation (Simple MLP in NumPy)
  - Assignment: Extend your MLP from Day 145 to include a loss function (e.g., MSE) and implement the backpropagation algorithm to update weights and biases. This is a significant challenge, but highly rewarding.
- Day 152: Loss Functions in Deep Learning (Recap)
  - MMLL: Chapter 12.1.4 (Loss Functions)
  - YouTube: Loss Functions for Machine Learning (StatQuest with Josh Starmer)
  - Assignment: Revisit MSE (for regression) and Cross-Entropy (for classification) in the context of NNs.
- Day 153: Hands-on Basic NN with Framework (TensorFlow/PyTorch)
  - Assignment: Build a basic MLP using TensorFlow Keras or PyTorch. Train it on a simple dataset (e.g., MNIST digits). Focus on model.compile/model.fit or the training loop.
- Day 154: Rest/Catch-up
Week 23: CNNs - Convolutions Explained
- Day 155: Convolution Operation - Mathematical Definition
  - 3Blue1Brown: What is a convolution?
  - YouTube: What is a convolution? | Chapter 2, Deep learning (3Blue1Brown)
  - Assignment: Understand the concept of a filter sliding over an image, performing element-wise multiplication and summation.
- Day 156: Padding & Stride
  - YouTube: CNNs Part 2: Padding and Strides (DeepLearning.AI)
  - Assignment: Understand how padding (same, valid) and stride affect the output dimensions of a convolution.
- Day 157: Pooling Layers (Max Pooling, Average Pooling)
  - YouTube: CNNs Part 3: Pooling Layers (DeepLearning.AI)
  - Assignment: Understand the purpose of pooling (downsampling, translation invariance).
- Day 158: Hands-on Basic Convolution (NumPy)
  - Assignment: Implement a simple 2D convolution operation (without padding/stride) using NumPy. Test it on a small matrix and a simple filter.
- Day 159: CNN Architecture Overview (Conceptual)
  - YouTube: Convolutional Neural Networks (CNNs), Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand the typical sequence of layers in a CNN (Conv -> ReLU -> Pool -> Conv -> ReLU -> Pool -> Flatten -> Dense).
- Day 160: Hands-on CNN (TensorFlow/PyTorch)
  - Assignment: Build a simple CNN for image classification (e.g., Fashion MNIST or CIFAR-10) using your chosen framework.
- Day 161: Rest/Catch-up
Week 24: RNNs & Advanced Concepts (High-Level)
- Day 162: Recurrent Neural Networks (RNNs) - Math of Recurrence
  - MMLL: Chapter 12.4 (Recurrent Neural Networks) - Focus on the concept of sequence and shared weights.
  - YouTube: Recurrent Neural Networks (RNN) and LSTMs, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand how RNNs process sequences by maintaining a hidden state.
- Day 163: Vanishing/Exploding Gradients in RNNs
  - YouTube: Vanishing and Exploding Gradients in Neural Networks (DeepLearning.AI)
  - Assignment: Understand why standard RNNs struggle with long-term dependencies (conceptual, related to repeated matrix multiplication in backprop).
- Day 164: LSTMs & GRUs (Conceptual)
  - Medium: The Ultimate Showdown: RNN vs LSTM vs GRU – Which is the Best? (Shiksha Online)
  - YouTube: LSTMs, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand that LSTMs and GRUs solve vanishing gradient issues using "gates" (input, forget, output). No need for full math, just the idea.
- Day 165: Attention Mechanisms & Transformers (Very High-Level)
  - YouTube: Attention Mechanism (for LSTMs and RNNs) - clearly explained!!! (StatQuest with Josh Starmer)
  - YouTube: The Transformer Neural Network, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand the basic concept of "attention" in neural networks (focusing on relevant parts of input). Know that Transformers use self-attention heavily.
- Day 166: Monthly Review & Capstone Project Prep
  - Review: All deep learning concepts, especially the role of math in NNs.
  - Project Prep: Brainstorm ideas for your final capstone project.
- Day 167: Capstone Project Work Day 1
  - Assignment: Begin working on your chosen project. Focus on data loading, preprocessing, and setting up the basic model.
- Day 168: Rest/Catch-up
Week 25 (Optional Extension/Buffer): Capstone Project & Future Learning
- Day 169: Capstone Project Work Day 2
  - Assignment: Continue implementing and training your model.
- Day 170: Capstone Project Work Day 3
  - Assignment: Evaluate your model, try different hyperparameters, and analyze results.
- Day 171: Information Theory for ML (Entropy, KL Divergence)
  - MMLL: Chapter 5.5 (Entropy and Mutual Information) - Focus on 5.5.1, 5.5.2 (Entropy) and 5.5.3 (KL Divergence).
  - Medium: Deep Learning and Information Theory (Deep and Shallow)
  - YouTube: Entropy (Information Theory), Clearly Explained!!! (StatQuest with Josh Starmer)
  - YouTube: Kullback-Leibler Divergence, Clearly Explained!!! (StatQuest with Josh Starmer)
  - Assignment: Understand their mathematical definitions and significance in loss functions (cross-entropy) and comparing distributions.
- Day 172: Causal Inference (Basic Concepts)
  - MMLL: Chapter 15.3 (Causal Inference) - Very high-level introduction.
  - Medium: Causal ML: What is it and what is its importance? (Plain Concepts)
  - YouTube: Causal Inference: The Basics (CrashCourse)
  - Assignment: Understand the difference between correlation and causation.
- Day 173: Comprehensive Math Review
  - Assignment: Go back through your notes and MMLL. Revisit any challenging math concepts from Linear Algebra, Calculus, Probability, and Statistics.
- Day 174: Capstone Project Finalization
  - Assignment: Prepare your project report/notebook. Clearly explain the problem, your approach, the models used, and the mathematical insights gained.
- Day 175: Final Project Presentation & Future Learning Plan
  - Assignment: Present your project to yourself or a peer. Outline your next steps in ML and math.
- Day 176-180: Buffer/Deep Dive/Review

Thursday, May 29, 2025

3B1B notes

Few points that I need to note:

3B1B’s ML explainer videos are pretty good

They explain the fact that there are in fact three stages of optimization that occurs. First the training data gives us a value for the different stages. Within these stages, weights determine the importance of each variable (I am using the wrong terms – need to look it up) and the error function gives us variance.

Mean squared error tells us how far off the prediction is from the actual value – a correct prediction leads to minimization of this value

Stochastic gradient descent is different from gradient descent wherein it only works on a small random batch of data to figure out the minima and not the entire data or the global minima since minimizing on the entire data is computationally expensive and finding out the global minima is not really possible.

Saturday, May 17, 2025

Random Forest Classifier Code

Source: https://www.kaggle.com/code/prashant111/random-forest-classifier-tutorial

-Based on ensemble learning

-Highlights the importance of feature selection - run once, see what is important, remove others, re-run, see test the increase in accuracy

-Remember that random forest can be used for both classifier and regression problems.

-In random forest classifier, the higher the number of trees in the forest, the higher the accuracy

Sunday, April 27, 2025

XGBoost Analysis Code

#Imports
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import cohen_kappa_score
from scipy.stats import mode

from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split

import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
import shap
import warnings warnings.filterwarnings('ignore')

#Import dataset
data = 'C:/datasets/Wholesale customers data.csv'
df = pd.read_csv(data)

#Exploring the dataset
df.shape
df.head()
df.info()
df.describe()

#missing value check
df.isnull().sum()

#Checking for types of values

df.status_value_counts()

#declaring dependent and independent variables
X = df.drop('Channel', axis=1) y = df['Channel']

#Var checks
X.head()

y.head()

#Label encoding (OHE)
X_features=X

encoded_df = pd.get_dummies(df[X_features], drop_first = True)

#Checking the columns created after encoding

list(encoded_df.columns)

#Null imputation

#Performing Logistic Regression

import statmodels.api as sm

logit=sm.Logit(y_train,X_train)

logit_model=logit.fit()

#Model Summary

logit_model.summary2()

def get_significant_vars(lm):

#Store the p-value and corresponding column names in a dataframe

var_p_vals_df=pd.DataFrame(lm.pvalues)

var_p_vals_df['vars'] = var_p_vals_df.index

var_p_vals_df.columns = ['pvals','vars]

#Filter the column names where the p value is less than 0/05

return list (var_p_vals_df[var_p_vals_df.pvals<0.05]['vars']

significant_vars =get_significiant_vars(logit_model)

significant_vars

# import XGBoost
#import xgboost as xgb
# define data_dmatrix
data_dmatrix = xgb.DMatrix(data=X,label=y)

# split X and y into training and testing sets
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
# import XGBClassifier
#from xgboost import XGBClassifier

# declare parameters
params = {
'objective':'binary:logistic',
'max_depth': 4,
'alpha': 10,
'learning_rate': 1.0,
'n_estimators':100
}



# instantiate the classifier
xgb_clf = XGBClassifier(**params)

# fit the classifier to the training data
xgb_clf.fit(X_train, y_train)
#output:
XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,
max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,
n_estimators=100, n_jobs=1, nthread=None,
objective='binary:logistic', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
subsample=1, verbosity=1)
# alternatively view the parameters of the xgb trained model
print(xgb_clf)
XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,
max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,
n_estimators=100, n_jobs=1, nthread=None,
objective='binary:logistic', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
subsample=1, verbosity=1)
# make predictions on test data
y_pred = xgb_clf.predict(X_test)
# check accuracy score
from sklearn.metrics import accuracy_score
print('XGBoost model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

https://gist.github.com/pb111/cc341409081dffa5e9eaf60d79562a03

https://medium.com/@rithpansanga/optimizing-xgboost-a-guide-to-hyperparameter-tuning-77b6e48e289d

https://medium.com/@sadafsaleem5815/neural-networks-in-10mins-simply-explained-9ec2ad9ea815

https://www.kaggle.com/code/prashant111/a-guide-on-xgboost-hyperparameters-tuning/notebook

https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

XGBoost Model Documentation: https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters

Remember: It is important to set a subsample value for our exercise since the dataset is imbalanced (https://xgboosting.com/configure-xgboost-subsample-parameter/)

Practical example on tuning: https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

Sklearn metrics for accuracy: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

-----------------

Update on 12.05.25:

#Installs

!pip install xgboost

!pip install shap

!pip install statsmodel

#Imports

import pandas as pd

import os

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import cohen_kappa_score

from scipy.stats import mode

from sklearn.feature_selection import SelectFromModel

from sklearn.model_selection import train_test_split

import statsmodels.api as sm

import xgboost as xgb

from xgboost import XGBClassifier

from xgboost import plot_importance

from matplotlib import pyplot

import shap

import warnings

warnings.filterwarnings('ignore')

#Import dataset

df = pd.read_csv('C.csv')

#Exploring the dataset

df.shape

df.head()

df.info()

df.describe()

#missing value check

df.isnull().sum()

#Checking for types of values

df.value_counts()

#declaring dependent and independent variables

X = df.drop('status', axis=1)

y = df['status']

#Label encoding (OHE)

X_features= X

encoded_df = pd.get_dummies(X_features, drop_first = True)

#note here df conversion for X-featuers was not required since it is already a dataframe

#Checking the columns created after encoding

list(encoded_df.columns)

#Sample check

#encoded_df.head()

#Var checks

#X.describe()

X.info()

encoded_df.info()

y.info()

#X.head()

#y.describe()

#y.head()

#Null Imputation

#https://www.geeksforgeeks.org/ml-handling-missing-values/

#Strategy 1

# Removing rows with missing values

df_cleaned = df.dropna()

print(df_cleaned)

#Strategy 2

#Mean, Median and Mode Imputation

mean_imputation = df['age'].fillna(df['age'].mean())

median_imputation = df['age'].fillna(df['age'].median())

mode_imputation = df['age'].fillna(df['age'].mode().iloc[0])

print("\nImputation using Mean:")

print(mean_imputation)

print("\nImputation using Median:")

print(median_imputation)

print("\nImputation using Mode:")

print(mode_imputation)

-----------------

# split X and y into training and testing sets

#from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(encoded_df, y, test_size = 0.3, random_state = 0)

#dont forget to use the X Df which has label encoding

#Performing Logistic Regression

#import statmodels.api as sm

#df.convert_objects(convert_numeric=True)

#Encoded DF - encoded_df

logit=sm.Logit(y_train,X_train)

logit_model=logit.fit()

#Functions for converting objects to Int or FLOAT

#df.convert_objects(convert_numeric=True)

#X.astype(float)).fit() - converting type

#Model Summary

logit_model.summary2()

def get_significant_vars(lm):

#Store the p-value and corresponding column names in a dataframe

var_p_vals_df=pd.DataFrame(lm.pvalues)

var_p_vals_df['vars'] = var_p_vals_df.index

var_p_vals_df.columns = ['pvals','vars']

#Filter the column names where the p value is less than 0/05

return list (var_p_vals_df[var_p_vals_df.pvals<0.05]['vars'])

significant_vars =get_significant_vars(logit_model)

significant_vars

-------------

# import XGBoost

#import xgboost as xgb

# define data_dmatrix

data_dmatrix = xgb.DMatrix(data=encoded_df,label=y)

# declare parameters

params = {

'objective':'binary:logistic',

'max_depth': 4,

'alpha': 10,

'learning_rate': 1.0,

'n_estimators':100

}

# instantiate the classifier

xgb_clf = XGBClassifier(**params)

# fit the classifier to the training data

xgb_clf.fit(X_train, y_train)

#output:

XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,

colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,

max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,

n_estimators=100, n_jobs=1, nthread=None,

objective='binary:logistic', random_state=0, reg_alpha=0,

reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,

subsample=1, verbosity=1)

# make predictions on test data

y_pred = xgb_clf.predict(X_test)

# check accuracy score

from sklearn.metrics import accuracy_score

print('XGBoost model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

-------------------------------------------------------------------------------

Few things to remember:

Sampling - We never actually stop to check how the data distribution actually is, we just assume normality.

Question I have on this:

Distribution of which variables specifically?

Isn't fraud by definition a rare event which would make the distribution skewed anyways?

Log normal vs Normal distribution: https://towardsdatascience.com/log-link-vs-log-transformation-in-r-the-difference-that-misleads-your-entire-data-analysis/ - very good article

Sunday, April 20, 2025

Interesting Reads

Conjoint Analysis: https://www.qualtrics.com/en-au/experience-management/research/types-of-conjoint/

Great article on Deep Tech (plus awesome visuals): https://www.bcg.com/publications/2021/deep-tech-innovation

Basics on Mathematical Modelling: https://ocw.tudelft.nl/courses/mathematical-modeling-basics/

Books on Operations: https://orc.mit.edu/impact/textbooks/

Solving Cool Math Problems: https://projecteuler.net/archives

XGBoost Resources

Hello,

This page is designed to collate popular resources on xgb for my personal reference. None of it is my work.

Survival Modelling code snippet:#pip install lifelines
!pip install lifelines
#conda install -c conda-forge lifelinesimport pandas as pd

import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
df2 = pd.read_csv(r'FILELINK.csv')
T2 = df2['time']
S2 = df2['status']
print(T2)
kmf2 = KaplanMeierFitter()
kmf2.fit(T2, S2)
print("Survival function:")
print(kmf2.survival_function_)
print("Survival function plot:")
kmf2.plot()
plt.title("Survival Curve: 6-MP")
plt.xlabel("Time")
plt.ylabel("Survival Probability")
plt.savefig("6-MP.pdf")
plt.show()

https://gist.github.com/pb111/cc341409081dffa5e9eaf60d79562a03

I have used the Wholesale customers data set for this project, downloaded from the UCI Machine learning repository. This dataset can be found at the following url:

https://archive.ics.uci.edu/ml/datasets/Wholesale+customers

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
import warnings warnings.filterwarnings('ignore')

#Import dataset data = 'C:/datasets/Wholesale customers data.csv' df = pd.read_csv(data)

#Exploring the dataset

df.shape
df.head()
df.info()
df.describe()#missing value check
df.isnull().sum()
#declaring dependent and independent variables
X = df.drop('Channel', axis=1) y = df['Channel']
#var checks
X.head()
y.head()
#Label encoding

# import XGBoost
import xgboost as xgb # define data_dmatrix
data_dmatrix = xgb.DMatrix(data=X,label=y)

# split X and y into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

General parameters

These parameters relate to which booster we are doing boosting. The common ones are tree or linear model.
Booster parameters

It depends on which booster we have chosen for boosting.
Learning task parameters

These parameters decide on the learning scenario. For example, regression tasks may use different parameters than ranking tasks.
Command line parameters

In addition there are command line parameters which relate to behaviour of CLI version of XGBoost.

The most important parameters that we should know about are as follows:-

learning_rate - It gives us the step size shrinkage which is used to prevent overfitting. Its range is [0,1].

max_depth - It determines how deeply each tree is allowed to grow during any boosting round.

subsample - It determines the percentage of samples used per tree. Low value of subsample can lead to underfitting.

colsample_bytree - It determines the percentage of features used per tree. High value of it can lead to overfitting.

n_estimators - It is the number of trees we want to build.

objective - It determines the loss function to be used in the process. For example, reg:linear for regression problems, reg:logistic for classification problems with only decision, binary:logistic for classification problems with probability.

XGBoost also supports regularization parameters to penalize models as they become more complex and reduce them to simple models. These regularization parameters are as follows:-

gamma - It controls whether a given node will split based on the expected reduction in loss after the split. A higher value leads to fewer splits. It is supported only for tree-based learners.

alpha - It gives us the L1 regularization on leaf weights. A large value of it leads to more regularization.

lambda - It gives us the L2 regularization on leaf weights and is smoother than L1 regularization.

Though we are using trees as our base learners, we can also use XGBoost’s relatively less popular linear base learners and one other tree learner known as dart. We have to set the booster parameter to either gbtree (default), gblinear or dart.

# import XGBClassifier
from xgboost import XGBClassifier


# declare parameters
params = {
            'objective':'binary:logistic',
            'max_depth': 4,
            'alpha': 10,
            'learning_rate': 1.0,
            'n_estimators':100
        }
            
            
            
# instantiate the classifier 
xgb_clf = XGBClassifier(**params)



# fit the classifier to the training data
xgb_clf.fit(X_train, y_train)

#output:

XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,
       max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)

# alternatively view the parameters of the xgb trained model
print(xgb_clf)
XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,
       max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)
# make predictions on test data
y_pred = xgb_clf.predict(X_test)

# check accuracy score
from sklearn.metrics import accuracy_score

print('XGBoost model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

k-fold Cross Validation using XGBoost
To build more robust models with XGBoost, we must do k-fold cross validation. In this way, we ensure that the original training dataset is used for both training and validation. Also, each entry is used for validation just once. XGBoost supports k-fold cross validation using the cv() method. In this method, we will specify several parameters which are as follows:-
nfolds - This parameter specifies the number of cross-validation sets we want to build.
num_boost_round - It denotes the number of trees we build.
metrics - It is the performance evaluation metrics to be considered during CV.
as_pandas - It is used to return the results in a pandas DataFrame.
early_stopping_rounds - This parameter stops training of the model early if the hold-out metric does not improve for a given number of rounds.
seed - This parameter is used for reproducibility of results.
We can use these parameters to build a k-fold cross-validation model by calling XGBoost's CV() method.
#K-fold Cross Val
from xgboost import cv

params = {"objective":"binary:logistic",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

xgb_cv = cv(dtrain=data_dmatrix, params=params, nfold=3,
                    num_boost_round=50, early_stopping_rounds=10, metrics="auc", as_pandas=True, seed=123) 
xgb_cv.head()
#xgb_cv contains train and test auc metrics for each boosting round. Let's preview xgb_cv.

14. Feature importance with XGBoost

XGBoost provides a way to examine the importance of each feature in the original dataset within the model. It involves counting the number of times each feature is split on across all boosting trees in the model. Then we visualize the result as a bar graph, with the features ordered according to how many times they appear.

XGBoost has a plot_importance() function that helps us to achieve this task. Then we can visualize the features that has been given the highest important score among all the features. Thus XGBoost provides us a way to do feature selection.

I will proceed as follows:-

xgb.plot_importance(xgb_clf)
plt.rcParams['figure.figsize'] = [6, 4]
plt.show()

Sidenote:
#Remember if categorical variable is there, input for xgb needs to be in dmatrix format
(ValueError: DataFrame.dtypes for data must be int, float, bool or categorical. When categorical type is supplied, DMatrix parameter enable_categorical must be set to True.Var1, Var2, Var3, Var4
Link: https://stackoverflow.com/questions/67080149/xgboost-error-when-categorical-type-is-supplied-dmatrix-parameter-enable-cat)
Code block For OHE:
import pandas as pd from xgboost

import XGBRegressor from sklearn.compose

import ColumnTransformer from sklearn.preprocessing

import OneHotEncoder

# define the input data

df = pd.DataFrame([ {'Var1': 'JP', 'Var2': 2009, 'Var3': 6581, 'Var4': 'OME', 'Var5': 325.787218, 'Ind_Var': 8.013558}, {'Var1': 'FR', 'Var2': 2018, 'Var3': 5783, 'Var4': 'I_S', 'Var5': 11.956326, 'Ind_Var': 4.105559}, {'Var1': 'BE', 'Var2': 2015, 'Var3': 6719, 'Var4': 'OME', 'Var5': 42.888565, 'Ind_Var': 7.830077}, {'Var1': 'DK', 'Var2': 2011, 'Var3': 3506, 'Var4': 'RPP', 'Var5': 70.094146, 'Ind_Var': 83.000000}, {'Var1': 'AT', 'Var2': 2019, 'Var3': 5474, 'Var4': 'NMM', 'Var5': 270.082738, 'Ind_Var': 51.710526} ])

# extract the features and target

X_train, y_train = df.iloc[:3, :-1], df.iloc[:3, -1]
X_test, y_test = df.iloc[3:, :-1], df.iloc[3:, -1]

# one-hot encode the categorical features cat_attribs = ['Var1', 'Var2', 'Var3', 'Var4']

full_pipeline = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_attribs)], remainder='passthrough')

encoder = full_pipeline.fit(X_train) X_train = encoder.transform(X_train) X_test = encoder.transform(X_test)

# train the model model = XGBRegressor(n_estimators=10, max_depth=20, verbosity=2) model.fit(X_train, y_train)

# extract the training set predictions model.predict(X_train)

# array([7.0887003, 3.7923286, 7.0887003], dtype=float32)

# extract the test set predictions model.predict(X_test)

# array([7.0887003, 7.0887003], dtype=float32)

#Null imputation strategy for different variables

#Strategy for handling categorical variables:
Strategy 1:
cat_attribs = ['var1','var2','var3','var4'] X_train[cat_attribs] = X_train[cat_attribs].astype('category') X_test[cat_attribs] = X_test[cat_attribs].astype('category')

model = XGBRegressor(n_estimators=10, max_depth=20, enable_categorical=True, verbosity=2) model.fit(X_train, y_train) y_pred = model.predict(X_test)

Strategy 2:
# Create a mapping of labels to encoded values from X_train training_encoded_mapping = X_train['var1'].astype('category').cat.codes training_encoded_mapping = dict(zip(X_train['var1'].cat.categories, training_encoded_mapping)) X_train['var1'] = X_train['var1'].astype('category').cat.codes # Apply the mapping to X_test X_test['var1'] = X_test['var1'].map(training_encoded_mapping) # Do the same for other vars as well
And now don't pass enable_categorical=True in model initialization

#Models to be used:
Naive Bayes, DTs, Logistic Regression (it is a classification technique and not regression), NN, SVM

Unsupervised elarning (clustering algorithms): K-means clustering, Mean-shift, DBSCan - but remember that we shall have to specify the number of clusters - to only allow for default and non-default behaviour - does not have direct application here but need to learn more

Saturday, April 19, 2025

Data Viz Resources

Hello,

Ideas on better plotting of distributions (banding and how the population lies for any variable):

https://rafalab.dfci.harvard.edu/dsbook/dataviz-distributions.html
https://seaborn.pydata.org/tutorial/distributions.html - Using seaborn

https://github.com/cuttlefishh/python-for-data-analysis/blob/master/lessons/lesson10.ipynb

How to connect big query with python: https://codelabs.developers.google.com/codelabs/cloud-bigquery-python#0

The idea here is to assimilate resources from across the internet that will help me level up my visualization game using python. I am pretty much a noob at this and want to get better.

Most helpful visualization libraries in python (https://www.kaggle.com/discussions/getting-started/1087922)

1- matplotlib

matplotlib is the O.G. of Python data visualization libraries. Despite being over a decade old, it’s still the most widely used library for plotting in the Python community. It was designed to closely resemble MATLAB, a proprietary programming language developed in the 1980s.

2- Seaborn

Seaborn harnesses the power of matplotlib to create beautiful charts in a few lines of code. The key difference is Seaborn’s default styles and color palettes, which are designed to be more aesthetically pleasing and modern. Since Seaborn is built on top of matplotlib, you’ll need to know matplotlib to tweak Seaborn’s defaults.

3- ggplot

ggplot is based on ggplot2, an R plotting system, and concepts from The Grammar of Graphics. ggplot operates differently than matplotlib: it lets you layer components to create a complete plot. For instance, you can start with axes, then add points, then a line, a trendline, etc. Although The Grammar of Graphics has been praised as an “intuitive” method for plotting, seasoned matplotlib users might need time to adjust to this new mindset.

4- Bokeh

Like ggplot, Bokeh is based on The Grammar of Graphics, but unlike ggplot, it’s native to Python, not ported over from R. Its strength lies in the ability to create interactive, web-ready plots, which can be easily outputted as JSON objects, HTML documents, or interactive web applications. Bokeh also supports streaming and real-time data.

5- pygal

Like Bokeh and Plotly, pygal offers interactive plots that can be embedded in the web browser. Its prime differentiator is the ability to output charts as SVGs. As long as you’re working with smaller datasets, SVGs will do you just fine. But if you’re making charts with hundreds of thousands of data points, they’ll have trouble rendering and become sluggish.

6- Plotly

You might know Plotly as an online platform for data visualization, but did you also know you can access its capabilities from a Python notebook? Like Bokeh, Plotly’s forte is making interactive plots, but it offers some charts you won’t find in most libraries, like contour plots, dendograms, and 3D charts.

7- geoplotlib

geoplotlib is a toolbox for creating maps and plotting geographical data. You can use it to create a variety of map-types, like choropleths, heatmaps, and dot density maps. You must have Pyglet (an object-oriented programming interface) installed to use geoplotlib. Nonetheless, since most Python data visualization libraries don’t offer maps, it’s nice to have a library dedicated solely to them.

8- Gleam

Gleam is inspired by R’s Shiny package. It allows you to turn analyses into interactive web apps using only Python scripts, so you don’t have to know any other languages like HTML, CSS, or JavaScript. Gleam works with any Python data visualization library. Once you’ve created a plot, you can build fields on top of it so users can filter and sort data.

9- missingno

Dealing with missing data is a pain. missingno allows you to quickly gauge the completeness of a dataset with a visual summary, instead of trudging through a table. You can filter and sort data based on completion or spot correlations with a heatmap or a dendrogram.

10- Leather

Leather’s creator, Christopher Groskopf, puts it best: “Leather is the Python charting library for those who need charts now and don’t care if they’re perfect.” It’s designed to work with all data types and produces charts as SVGs, so you can scale them without losing image quality.

https://github.com/mathisonian/awesome-visualization-research

https://mode.com/blog/python-data-visualization-libraries

1. Seaborn

Seaborn is built on top of the matplotlib library. it has many built-in functions using which you can create beautiful plots with just simple lines of codes. It provides a variety of advanced visualization plots with simple syntax like box plots, violin plots, dist plots, Joint plots, pair plots, heatmap, and many more.
Key Features:It can be used to determine the relationship between two variables.
Differentiate when analyzing uni-variate or bi-variate distributions.
Plot the linear regression model for the dependent variable.
Provides multi-grid plotting

Official website: https://seaborn.pydata.org/
2. Plotly

Plotly is an advanced Python analytics library that helps in building interactive dashboards. The graphs build using Plotly are interactive plots, which means you can easily find value at any particular point or session of the graphs. Plotly makes it super easy to generate dashboards and deploying them on the server. It supports Python, R, and the Julia programming language.
You can create a wide range of graphs using Plotly:Basic Charts
Statistical charts
Scientific charts
Financial Charts
Maps
Subplots
Transforms
Jupyter Widgets Interaction

Official website: https://plotly.com/
3. Geoplotlib

Geoplotlib is an open-source Python toolbox for visualizing geographical data. It supports the development of hardware-accelerated interactive visualizations in pure Python and provides implementations of dot maps, kernel density estimation, spatial graphs, Voronoi tesselation, shapefiles, and many more common spatial visualizations.

Geoplotlib can be used to make a variety of maps, such as equivalent area maps, heat maps, and point density maps. There are also several extended modules:geoplotlib
geoplotlib.layers
geoplotlib.utils
geoplotlib.core
geoplotlib.colors

Official website: https://andrea-cuttone.github.io/geoplotlib/
4. Gleam

Gleam is inspired by R’s Shiny package. It allows you to turn analyses into interactive web apps using only Python scripts, so you don’t have to know any other languages like HTML, CSS, or JavaScript. Gleam works with any Python data visualization library. Once you’ve created a plot, you can build fields on top of it so users can filter and sort data.

Official website: https://github.com/dgrtwo/gleam
5. ggplot/ggplot2

ggplot works differently from matplotlib. It lets you add multiple components as layers to create a complete graph or plot at the end. For example, at the start you can add an axis, then points, and other components like a trend line.
They always say that you should store your data in a data frame before using ggplot to get simpler and efficient results.

Official website: https://ggplot2.tidyverse.org/reference/ggplot.html

Key code snippets:

I guess I will get complacent and stagnate