Monday, June 2, 2025

Detailed ML Learning Journey

6-Month Maths-Focused Machine Learning Program with Enhanced Resources

This program spreads the material from an intensive plan over 6 months, aiming for roughly 1-2 hours of focused study/coding per day on weekdays, with optional longer sessions on weekends for deeper dives or project work.


Month 1: Linear Algebra & Foundational Math

Goal: Build a solid understanding of vectors, matrices, and basic linear algebra operations crucial for ML.

  • Week 1: Introduction to Vectors

  • Week 2: Matrices - The Basics

  • Week 3: Systems of Linear Equations & Inverses

    • Day 15: Systems of Linear Equations (Matrix Form)
    • Day 16: Determinants (2x2, 3x3)
    • Day 17: Inverse Matrices (Conceptual & Calculation)
    • Day 18: Solving Systems with Inverses
      • MMLL: Chapter 2.3 (Using Matrix Inverse to Solve Systems)
      • SIA: Chapter 2.2 (Solving with A1)
      • Assignment: Use np.linalg.solve to solve a system of linear equations in Python. Practice calculating inverses with np.linalg.inv.
    • Day 19: Hands-on Determinants & Inverses
      • Assignment: Write Python functions to calculate the determinant of a 2x2 matrix and the inverse of a 2x2 matrix from scratch (without np.linalg). Compare with NumPy.
    • Day 20: Review & Practice
      • Review notes. Ensure you understand when an inverse exists (non-zero determinant).
      • Assignment: MIT OpenCourseware (MIT 18.06SC Linear Algebra) Problem Set 3, Problem 1 (or similar problems related to inverses).
    • Day 21: Rest/Catch-up
  • Week 4: Eigenvalues, Eigenvectors & Review

    • Day 22: Eigenvalues & Eigenvectors - Intuition
    • Day 23: Calculating Eigenvalues & Eigenvectors (2x2)
      • MMLL: Chapter 2.4 (Calculation examples)
      • SIA: Chapter 6.1 (Finding Eigenvalues and Eigenvectors)
      • KA: Eigenvalues and eigenvectors
      • Assignment: KA - Find eigenvalues and eigenvectors of a 2x2 matrix
    • Day 24: Hands-on Eigen Decomposition
      • Assignment: Use np.linalg.eig to find eigenvalues and eigenvectors of a matrix. Verify Av=lambdav.
    • Day 25: Orthogonality (Conceptual)
      • MMLL: Chapter 2.1.2 (Orthogonal Vectors)
      • SIA: Chapter 4.1 (Orthogonal Vectors and Subspaces)
      • KA: Orthogonal vectors
      • YouTube: Orthogonal vectors (Khan Academy)
      • Assignment: Identify orthogonal vectors.
    • Day 26: Linear Algebra Review & Mini-Project Prep
      • Review: All concepts from Month 1. Revisit problem areas.
      • Project Prep: Understand the high-level goal of Principal Component Analysis (PCA) as a dimensionality reduction technique (you'll implement a basic version in Month 5). Focus on how it uses eigenvectors.
    • Day 27: REST/Catch-up
    • Day 28: Monthly Review & Catch-up

Month 2: Calculus, Probability & First ML Algorithm

Goal: Grasp essential calculus concepts for optimization, fundamental probability/statistics, and apply them to your first ML model.

  • Week 5: Calculus - Derivatives

    • Day 29: Functions, Limits, Continuity
    • Day 30: Derivatives - Intuition & Power Rule
    • Day 31: Product, Quotient, Chain Rule
    • Day 32: Critical Points & Local Min/Max
    • Day 33: Hands-on Symbolic Differentiation (Optional)
      • Assignment: Experiment with SymPy in Python to symbolically differentiate simple functions. (e.g., import sympy; x = sympy.symbols('x'); f = x**2 + 3*x; sympy.diff(f, x))
    • Day 34: Review & Practice
      • Assignment: Search for "univariate calculus differentiation problems," solve 5-10.
    • Day 35: Rest/Catch-up
  • Week 6: Calculus - Gradients & Optimization Intro

    • Day 36: Functions of Multiple Variables & Partial Derivatives
    • Day 37: Gradient Vector
    • Day 38: Hessian Matrix (Conceptual)
      • MMLL: Chapter 3.1.4 (Hessian) - Understand it represents curvature.
      • KA: The Hessian matrix
      • YouTube: Hessian Matrix (mathematicalmonk)
      • Assignment: No explicit problem, focus on conceptual understanding of second partial derivatives and the Hessian's purpose.
    • Day 39: Introduction to Optimization
    • Day 40: Hands-on Gradient Calculation
      • Assignment: Write a Python function that calculates the gradient of a simple multivariate function (e.g., ).
    • Day 41: Review & Practice
      • Assignment: Review partial derivatives and gradients. MIT OpenCourseware (MIT 18.02 Multivariable Calculus) Problem Set 1, Problem 1 (or similar).
    • Day 42: Rest/Catch-up
  • Week 7: Probability Fundamentals

  • Week 8: Probability Distributions & Linear Regression (Math)

    • Day 50: Common Discrete Distributions (Bernoulli, Binomial)
      • MMLL: Chapter 5.2.5 (Bernoulli, Binomial)
      • KA: Bernoulli
      • YouTube: The Binomial Distribution (StatQuest with Josh Starmer)
      • Assignment: KA - Bernoulli, Binomial distribution problems
    • Day 51: Normal (Gaussian) Distribution
    • Day 52: Expectation & Variance
    • Day 53: Simple Linear Regression - Mathematical Formulation
    • Day 54: Least Squares Method - Conceptual
      • MMLL: Chapter 6.1.3 (Least Squares Estimation) - Focus on the intuition of finding the "best fit" line.
      • YouTube: Least Squares Regression (Khan Academy)
      • Assignment: No coding, just understanding the conceptual goal of Least Squares.
    • Day 55: Monthly Review & Catch-up
      • Assignment: Review all concepts from Month 2. Focus on the intuition of derivatives, gradients, and how probability distributions describe data.
    • Day 56: Rest/Catch-up

Month 3: Regression & Core ML Concepts

Goal: Deepen understanding of regression, explore optimization for ML, and grasp bias-variance.

  • Week 9: Linear Regression Implementation

    • Day 57: Derivation of Coefficients (Calculus)
      • MMLL: Chapter 6.1.3 (Least Squares Estimation) - Understand the partial derivatives of the SSE.
      • YouTube: Deriving the Normal Equation for Linear Regression (StatQuest with Josh Starmer)
      • Assignment: Walk through the derivation of the coefficients for simple linear regression (or watch a video explaining it).
    • Day 58: Multiple Linear Regression & Normal Equation (Matrix Form)
      • MMLL: Chapter 6.1.4 (Multiple Linear Regression), 6.1.5 (Normal Equation).
      • YouTube: The Normal Equation (Andrew Ng's ML Course)
      • Assignment: Understand y=Xbeta and the Normal Equation beta=(XTX)1XTy.
    • Day 59: Hands-on Simple Linear Regression from Scratch
      • Assignment: Implement simple linear regression from scratch using NumPy. Plot the regression line on a small dataset.
    • Day 60: Hands-on Multiple Linear Regression (Normal Equation)
      • Assignment: Implement multiple linear regression using the Normal Equation with NumPy. Test on a synthetic dataset.
    • Day 61: Assumptions of Linear Regression
      • MMLL: Chapter 6.1.6 (Assumptions)
      • YouTube: Assumptions of Linear Regression (StatQuest with Josh Starmer)
      • Assignment: List and understand the key assumptions (linearity, independence, homoscedasticity, normality of errors).
    • Day 62: Review & Practice
      • Assignment: Review linear regression. Search for "linear regression normal equation problems" and solve one.
    • Day 63: Rest/Catch-up
  • Week 10: Gradient Descent in Depth

    • Day 64: Gradient Descent for Linear Regression
      • MMLL: Chapter 6.1.7 (Gradient Descent)
      • YouTube: Gradient Descent, Step-by-Step (StatQuest with Josh Starmer)
      • Assignment: Understand the update rule beta_new=beta_oldalphanablaJ(beta).
    • Day 65: Learning Rate & Convergence
      • MMLL: Chapter 4.3 (Gradient Descent) - Focus on learning rate.
      • YouTube: How to choose a learning rate for gradient descent (sentdex)
      • Assignment: Experiment with different learning rates in your GD implementation from Day 60. Observe divergence/slow convergence.
    • Day 66: Stochastic Gradient Descent (SGD) - Intuition
    • Day 67: Mini-batch Gradient Descent
      • MMLL: Chapter 4.3.3 (Mini-batch Gradient Descent)
      • YouTube: Mini-Batch Gradient Descent (DeepLearning.AI)
      • Assignment: Understand the trade-off between GD and SGD.
    • Day 68: Hands-on SGD for Linear Regression
      • Assignment: Implement SGD for linear regression in NumPy. Compare its performance to batch GD on a slightly larger dataset.
    • Day 69: Review & Practice
      • Assignment: Review Gradient Descent variants. MIT 6.036 (Introduction to Machine Learning) problem set on Gradient Descent (search for recent versions).
    • Day 70: Rest/Catch-up
  • Week 11: Regularization

    • Day 71: Overfitting & Underfitting
      • MMLL: Chapter 6.3 (Regularization) - Intro.
      • YouTube: Overfitting vs. Underfitting (StatQuest with Josh Starmer)
      • Assignment: Understand the concepts of overfitting and underfitting. Identify them visually.
    • Day 72: Ridge Regression (L2 Regularization) - Math
    • Day 73: Lasso Regression (L1 Regularization) - Math
    • Day 74: Hands-on Regularization (Scikit-learn)
      • Assignment: Use sklearn.linear_model.Ridge and sklearn.linear_model.Lasso. Experiment with the alpha parameter on a dataset prone to overfitting.
    • Day 75: Choosing Lambda (Regularization Strength)
      • Assignment: Research cross-validation as a method for selecting hyperparameters like lambda. (No implementation yet, just conceptual).
    • Day 76: Review & Practice
      • Assignment: Search for "Ridge vs Lasso explained" or "regularization in linear regression problems."
    • Day 77: Rest/Catch-up
  • Week 12: Polynomial Regression & Bias-Variance

    • Day 78: Polynomial Regression
      • MMLL: Chapter 6.2 (Polynomial Regression)
      • YouTube: Polynomial Regression Explained (StatQuest with Josh Starmer)
      • Assignment: Understand how polynomial features are created. Implement polynomial regression using sklearn.preprocessing.PolynomialFeatures and LinearRegression.
    • Day 79: Bias-Variance Trade-off - Intuition
      • MMLL: Chapter 6.4 (Bias-Variance Decomposition) - Focus on 6.4.1 (Introduction).
      • Medium: Bias Variance Tradeoff (MLU-Explain - interactive)
      • YouTube: Bias and Variance (StatQuest with Josh Starmer)
      • Assignment: Conceptualize bias (model's simplifying assumptions) and variance (model's sensitivity to training data).
    • Day 80: Mathematical Breakdown of Bias-Variance
      • MMLL: Chapter 6.4.2 (Derivation) - Go through the derivation of the MSE decomposition (if comfortable, otherwise understand the terms).
      • Assignment: Understand how MSE = Bias² + Variance + Noise.
    • Day 81: Visualizing Bias-Variance
      • Assignment: Search for online visualizations of the bias-variance trade-off (e.g., using target practice analogy).
    • Day 82: Hands-on Bias-Variance Example
      • Assignment: Create a synthetic dataset. Fit a low-degree polynomial (high bias) and a high-degree polynomial (high variance) to it. Plot and observe the fit and generalization.
    • Day 83: Monthly Review & Project Prep
      • Review: All concepts from Month 3. Focus on regression, optimization, regularization, and bias-variance.
      • Project Prep: Brainstorm simple regression datasets you could use for a mini-project (e.g., house price prediction, car mileage).
    • Day 84: Rest/Catch-up

Month 4: Classification Algorithms

Goal: Understand the mathematical underpinnings of key classification models.

  • Week 13: Logistic Regression

    • Day 85: Logistic Regression - Concept & Sigmoid
      • MMLL: Chapter 7.1 (Logistic Regression) - Focus on 7.1.1 (Binary Classification) and 7.1.2 (Sigmoid function).
      • YouTube: Logistic Regression, Clearly Explained!!! (StatQuest with Josh Starmer)
      • Assignment: Understand how the sigmoid function maps linear output to a probability between 0 and 1. Plot the sigmoid.
    • Day 86: Cross-Entropy Loss Function
    • Day 87: Gradient Descent for Logistic Regression
      • MMLL: Chapter 7.1.4 (Gradient Descent) - Understand the update rule (similar to linear regression, but with different derivatives).
      • YouTube: Logistic Regression Details: Calculating the Gradient (Andrew Ng's ML Course)
      • Assignment: Walk through the derivation of the gradients (or watch a detailed explanation).
    • Day 88: Hands-on Logistic Regression from Scratch
      • Assignment: Implement binary logistic regression from scratch using NumPy (forward pass, loss, and gradient descent update). Test on a simple synthetic dataset.
    • Day 89: Hands-on Logistic Regression (Scikit-learn)
      • Assignment: Use sklearn.linear_model.LogisticRegression. Compare results with your custom implementation. Understand predict_proba.
    • Day 90: Review & Practice
      • Assignment: Review logistic regression. Search for "logistic regression explained math" and re-read.
    • Day 91: Rest/Catch-up
  • Week 14: Softmax & SVMs

    • Day 92: Softmax Regression (Multinomial Logistic Regression)
      • MMLL: Chapter 7.2 (Softmax Regression)
      • YouTube: Softmax Regression (Andrew Ng's ML Course)
      • Assignment: Understand the softmax function and its use for multi-class classification.
    • Day 93: Categorical Cross-Entropy Loss
      • MMLL: Chapter 7.2 (Loss function).
      • YouTube: Categorical Cross Entropy Explained (StatQuest with Josh Starmer - same as binary, just extended)
      • Assignment: Understand the extension of cross-entropy to multiple classes.
    • Day 94: Support Vector Machines (SVMs) - Hyperplane & Margin
    • Day 95: Kernel Trick (Conceptual)
      • MMLL: Chapter 8.2 (Non-linear SVM) - Focus on the idea of mapping data to higher dimensions implicitly.
      • Medium: Kernel Trick in Support Vector Classification (GeeksforGeeks)
      • YouTube: SVMs and the Kernel Trick (StatQuest with Josh Starmer)
      • Assignment: No explicit math problem, focus on understanding the concept of making linearly inseparable data separable.
    • Day 96: Hands-on SVM (Scikit-learn)
      • Assignment: Use sklearn.svm.SVC. Experiment with different kernels (linear, rbf, poly) on a dataset like Iris or circles/moons.
    • Day 97: Review & Practice
      • Assignment: Review Softmax and SVMs. Search for "SVM kernel trick explained" if needed.
    • Day 98: Rest/Catch-up
  • Week 15: Decision Trees & Ensembles Intro

    • Day 99: Decision Trees - Basics & Splitting
      • MMLL: Chapter 10.1 (Decision Trees)
      • YouTube: Decision Trees, Clearly Explained!!! (StatQuest with Josh Starmer)
      • Assignment: Understand how decision trees make predictions by splitting data.
    • Day 100: Gini Impurity & Entropy (Mathematical Definitions)
    • Day 101: Information Gain
      • MMLL: Chapter 10.1.2 (Information Gain)
      • YouTube: Information Gain in Decision Tree (Machine Learning with Phil)
      • Assignment: Understand how information gain is used to choose the best split.
    • Day 102: Introduction to Ensemble Methods (Bagging)
    • Day 103: Boosting (Conceptual)
    • Day 104: Hands-on Decision Tree & Random Forest (Scikit-learn)
      • Assignment: Use sklearn.tree.DecisionTreeClassifier and sklearn.ensemble.RandomForestClassifier. Compare their performance.
    • Day 105: Rest/Catch-up
  • Week 16: KNN & Naive Bayes, Classification Project

    • Day 106: K-Nearest Neighbors (KNN)
    • Day 107: Naive Bayes - Intuition
      • MMLL: Chapter 5.1.3 (Bayes' Theorem application - conceptually)
      • YouTube: Naive Bayes, Clearly Explained!!! (StatQuest with Josh Starmer)
      • Assignment: Understand the "naive" assumption of conditional independence.
    • Day 108: Different Naive Bayes Variants (Conceptual)
      • Assignment: Research Gaussian, Multinomial, and Bernoulli Naive Bayes and when to use them.
    • Day 109: Hands-on Naive Bayes (Scikit-learn)
      • Assignment: Use sklearn.naive_bayes.GaussianNB or MultinomialNB.
    • Day 110: Monthly Review & Project Prep
      • Review: All classification algorithms.
      • Project Prep: Prepare for a classification project.
    • Day 111: Classification Project
      • Assignment: Choose a classification dataset (e.g., Pima Indians Diabetes, Titanic). Apply at least 3 different classification algorithms learned (e.g., Logistic Regression, SVM, Random Forest). Evaluate performance using appropriate metrics (accuracy, precision, recall, F1-score).
    • Day 112: Rest/Catch-up

Month 5: Unsupervised Learning & Optimization Deep Dive

Goal: Explore methods for finding patterns in unlabeled data and delve deeper into optimization techniques.

  • Week 17: Clustering

    • Day 113: K-Means Clustering - Objective Function
    • Day 114: Lloyd's Algorithm
      • MMLL: Chapter 9.1.1 (Algorithm).
      • Assignment: Walk through the steps of Lloyd's algorithm.
    • Day 115: Hands-on K-Means from Scratch
      • Assignment: Implement K-Means from scratch using NumPy. Test on a simple 2D dataset and visualize clusters.
    • Day 116: Hierarchical Clustering (Conceptual)
    • Day 117: Hands-on Hierarchical Clustering (SciPy)
      • Assignment: Use scipy.cluster.hierarchy to perform hierarchical clustering and plot a dendrogram.
    • Day 118: Review & Practice
      • Assignment: Review clustering algorithms. Search for "K-Means problems."
    • Day 119: Rest/Catch-up
  • Week 18: Dimensionality Reduction (PCA)

    • Day 120: PCA - Recap & Covariance Matrix
    • Day 121: PCA - Eigenvalues & Eigenvectors for Reduction
      • MMLL: Chapter 11.1.1 (PCA Algorithm)
      • YouTube: PCA Algorithm (Mathematicalmonk)
      • Assignment: Understand how eigenvectors correspond to principal components and eigenvalues to explained variance.
    • Day 122: Singular Value Decomposition (SVD) for PCA
      • MMLL: Chapter 11.1.2 (SVD for PCA)
      • SIA: Chapter 7.2 (Singular Value Decomposition)
      • YouTube: Singular Value Decomposition (SVD) and PCA (StatQuest with Josh Starmer)
      • Assignment: Understand that SVD provides a robust way to compute PCA.
    • Day 123: Hands-on PCA from Scratch (using SVD)
      • Assignment: Implement PCA from scratch using NumPy's np.linalg.svd. Apply it to a high-dimensional dataset (e.g., MNIST digits) and visualize the first 2 components.
    • Day 124: Hands-on PCA (Scikit-learn)
      • Assignment: Use sklearn.decomposition.PCA and compare results with your custom implementation. Understand explained_variance_ratio_.
    • Day 125: Review & Practice
      • Assignment: Review PCA. Search for "PCA explained visually."
    • Day 126: Rest/Catch-up
  • Week 19: Advanced Optimization

    • Day 127: Limitations of Basic Gradient Descent
      • Assignment: Understand local minima, saddle points, and issues with learning rate (e.g., slow convergence, oscillations).
    • Day 128: Momentum
    • Day 129: Adagrad & RMSprop (Conceptual)
    • Day 130: Adam Optimizer (Conceptual)
      • MMLL: Chapter 4.3.7 (Adam)
      • YouTube: Adam Optimizer, Clearly Explained!!! (StatQuest with Josh Starmer)
      • Assignment: Understand Adam as a combination of Momentum and RMSprop ideas.
    • Day 131: Hands-on Optimizers (TensorFlow/PyTorch)
      • Assignment: Build a simple linear regression model using TensorFlow or PyTorch. Experiment with SGD, Adam, and RMSprop optimizers, observing their convergence behavior.
    • Day 132: Review & Practice
      • Assignment: Search for "deep learning optimizers explained" videos/articles. Focus on their mathematical update rules.
    • Day 133: Rest/Catch-up
  • Week 20: Convex Optimization (Conceptual) & Unsupervised Project

    • Day 134: Convex Sets & Functions (Conceptual)
    • Day 135: Why Convexity Matters in ML
      • Assignment: Understand that many traditional ML models (linear regression, logistic regression with cross-entropy) have convex loss functions, guaranteeing convergence to global optima.
    • Day 136: Anomaly Detection (Brief Introduction)
      • MMLL: Chapter 9.3 (Anomaly Detection) - Basic overview.
      • YouTube: Anomaly Detection, Clearly Explained!!! (StatQuest with Josh Starmer)
      • Assignment: Understand the goal of anomaly detection. Explore simple statistical methods (e.g., Z-score).
    • Day 137: Monthly Review & Project Prep
      • Review: All concepts from Month 5.
      • Project Prep: Prepare for an unsupervised learning project.
    • Day 138: Unsupervised Learning Project
      • Assignment: Take a dataset (e.g., customers with features, or gene expression data). Perform K-Means clustering and PCA. Visualize the clusters in reduced dimensions. Interpret the results.
    • Day 139: Rest/Catch-up
    • Day 140: Monthly Review & Catch-up

Month 6: Deep Learning Fundamentals

Goal: Understand the core mathematical principles behind neural networks and popular architectures.

  • Week 21: Neural Network Basics & Forward Pass

    • Day 141: Perceptrons & Biological Analogy
      • MMLL: Chapter 12.1 (Feedforward Neural Networks) - Focus on the basic unit.
      • YouTube: The Perceptron (Andrew Ng's ML Course)
      • Assignment: Understand how a single perceptron works.
    • Day 142: Activation Functions (Sigmoid, Tanh, ReLU)
      • MMLL: Chapter 12.1.2 (Activation Functions)
      • YouTube: Activation Functions, Clearly Explained!!! (StatQuest with Josh Starmer)
      • Assignment: Understand the mathematical forms and properties of each activation function. Plot them.
    • Day 143: Feedforward Neural Networks (MLPs) - Architecture
    • Day 144: Forward Propagation - Matrix Math
      • MMLL: Chapter 12.1.3 (Feedforward Pass)
      • YouTube: Forward Propagation, Clearly Explained!!! (StatQuest with Josh Starmer)
      • Assignment: Understand how each layer's output is calculated using matrix multiplication and activation functions.
    • Day 145: Hands-on MLP Forward Pass (NumPy)
      • Assignment: Implement a simple 2-layer MLP (input, 1 hidden, output) forward pass using NumPy. Use random weights and biases.
    • Day 146: Review & Practice
      • Assignment: Review the forward pass. Search for "neural network forward propagation example" and trace the calculations.
    • Day 147: Rest/Catch-up
  • Week 22: Backpropagation - The Core of Learning

    • Day 148: Backpropagation - The Chain Rule in Action
    • Day 149: Deriving Gradients for Output Layer
      • MMLL: Chapter 12.2.2 (Output Layer Gradients)
      • YouTube: Backpropagation for a Neural Network (Andrew Ng's ML Course)
      • Assignment: Walk through the derivation of gradients for the output layer's weights and biases (e.g., for MSE loss).
    • Day 150: Deriving Gradients for Hidden Layers
      • MMLL: Chapter 12.2.3 (Hidden Layer Gradients)
      • Assignment: Understand how errors are "propagated backward" to update hidden layer weights.
    • Day 151: Hands-on Backpropagation (Simple MLP in NumPy)
      • Assignment: Extend your MLP from Day 145 to include a loss function (e.g., MSE) and implement the backpropagation algorithm to update weights and biases. This is a significant challenge, but highly rewarding.
    • Day 152: Loss Functions in Deep Learning (Recap)
      • MMLL: Chapter 12.1.4 (Loss Functions)
      • YouTube: Loss Functions for Machine Learning (StatQuest with Josh Starmer)
      • Assignment: Revisit MSE (for regression) and Cross-Entropy (for classification) in the context of NNs.
    • Day 153: Hands-on Basic NN with Framework (TensorFlow/PyTorch)
      • Assignment: Build a basic MLP using TensorFlow Keras or PyTorch. Train it on a simple dataset (e.g., MNIST digits). Focus on model.compile/model.fit or the training loop.
    • Day 154: Rest/Catch-up
  • Week 23: CNNs - Convolutions Explained

    • Day 155: Convolution Operation - Mathematical Definition
    • Day 156: Padding & Stride
      • YouTube: CNNs Part 2: Padding and Strides (DeepLearning.AI)
      • Assignment: Understand how padding (same, valid) and stride affect the output dimensions of a convolution.
    • Day 157: Pooling Layers (Max Pooling, Average Pooling)
      • YouTube: CNNs Part 3: Pooling Layers (DeepLearning.AI)
      • Assignment: Understand the purpose of pooling (downsampling, translation invariance).
    • Day 158: Hands-on Basic Convolution (NumPy)
      • Assignment: Implement a simple 2D convolution operation (without padding/stride) using NumPy. Test it on a small matrix and a simple filter.
    • Day 159: CNN Architecture Overview (Conceptual)
    • Day 160: Hands-on CNN (TensorFlow/PyTorch)
      • Assignment: Build a simple CNN for image classification (e.g., Fashion MNIST or CIFAR-10) using your chosen framework.
    • Day 161: Rest/Catch-up
  • Week 24: RNNs & Advanced Concepts (High-Level)

    • Day 162: Recurrent Neural Networks (RNNs) - Math of Recurrence
    • Day 163: Vanishing/Exploding Gradients in RNNs
    • Day 164: LSTMs & GRUs (Conceptual)
    • Day 165: Attention Mechanisms & Transformers (Very High-Level)
    • Day 166: Monthly Review & Capstone Project Prep
      • Review: All deep learning concepts, especially the role of math in NNs.
      • Project Prep: Brainstorm ideas for your final capstone project.
    • Day 167: Capstone Project Work Day 1
      • Assignment: Begin working on your chosen project. Focus on data loading, preprocessing, and setting up the basic model.
    • Day 168: Rest/Catch-up
  • Week 25 (Optional Extension/Buffer): Capstone Project & Future Learning

    • Day 169: Capstone Project Work Day 2
      • Assignment: Continue implementing and training your model.
    • Day 170: Capstone Project Work Day 3
      • Assignment: Evaluate your model, try different hyperparameters, and analyze results.
    • Day 171: Information Theory for ML (Entropy, KL Divergence)
    • Day 172: Causal Inference (Basic Concepts)
    • Day 173: Comprehensive Math Review
      • Assignment: Go back through your notes and MMLL. Revisit any challenging math concepts from Linear Algebra, Calculus, Probability, and Statistics.
    • Day 174: Capstone Project Finalization
      • Assignment: Prepare your project report/notebook. Clearly explain the problem, your approach, the models used, and the mathematical insights gained.
    • Day 175: Final Project Presentation & Future Learning Plan
      • Assignment: Present your project to yourself or a peer. Outline your next steps in ML and math.
    • Day 176-180: Buffer/Deep Dive/Review

Thursday, May 29, 2025

3B1B notes

Few points that I need to note:
3B1B’s ML explainer videos are pretty good
 They explain the fact that there are in fact three stages of optimization that occurs. First the training data gives us a value for the different stages. Within these stages, weights determine the importance of each variable (I am using the wrong terms – need to look it up)  and the error function gives us variance.
 Mean squared error tells us how far off the prediction is from the actual value – a correct prediction leads to minimization of this value
 Stochastic gradient descent is different from gradient descent wherein it only works on a small random batch of data to figure out the minima and not the entire data or the global minima since minimizing on the entire data is computationally expensive and finding out the global minima is not really possible.

Saturday, May 17, 2025

Random Forest Classifier Code

Source: https://www.kaggle.com/code/prashant111/random-forest-classifier-tutorial

-Based on ensemble learning
-Highlights the importance of feature selection - run once, see what is important, remove others, re-run, see test the increase in accuracy
-Remember that random forest can be used for both classifier and regression problems.

-In random forest classifier, the higher the number of trees in the forest, the higher the accuracy


Sunday, April 27, 2025

XGBoost Analysis Code

 #Imports
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import cohen_kappa_score
from scipy.stats import mode

from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split

import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
import shap
import warnings warnings.filterwarnings('ignore')


#Import dataset 
data = 'C:/datasets/Wholesale customers data.csv' 
df = pd.read_csv(data)

#Exploring the dataset
df.shape
df.head()
df.info()
df.describe()

#missing value check
df.isnull().sum()

#Checking for types of values
df.status_value_counts()

#declaring dependent and independent variables
X = df.drop('Channel', axis=1) y = df['Channel']

#Var checks
X.head()

y.head()

#Label encoding (OHE)
X_features=X
encoded_df = pd.get_dummies(df[X_features], drop_first = True)

#Checking the columns created after encoding
list(encoded_df.columns)

#Null imputation


#Performing Logistic Regression
import statmodels.api as sm

logit=sm.Logit(y_train,X_train)
logit_model=logit.fit()

#Model Summary
logit_model.summary2()

def get_significant_vars(lm):
    #Store the p-value and corresponding column names in a dataframe
    var_p_vals_df=pd.DataFrame(lm.pvalues)
    var_p_vals_df['vars'] = var_p_vals_df.index
    var_p_vals_df.columns = ['pvals','vars]
    #Filter the column names where the p value is less than 0/05
    return list (var_p_vals_df[var_p_vals_df.pvals<0.05]['vars']

significant_vars =get_significiant_vars(logit_model)
significant_vars



# import XGBoost
#import xgboost as xgb 
# define data_dmatrix
data_dmatrix = xgb.DMatrix(data=X,label=y)

# split X and y into training and testing sets 
#from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
# import XGBClassifier
#from xgboost import XGBClassifier

# declare parameters
params = {
            'objective':'binary:logistic',
            'max_depth': 4,
            'alpha': 10,
            'learning_rate': 1.0,
            'n_estimators':100
        }
            
      
            
# instantiate the classifier 
xgb_clf = XGBClassifier(**params)


# fit the classifier to the training data
xgb_clf.fit(X_train, y_train)
#output:
XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,
       max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)
# alternatively view the parameters of the xgb trained model
print(xgb_clf)
XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,
       max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)
# make predictions on test data
y_pred = xgb_clf.predict(X_test)
# check accuracy score
from sklearn.metrics import accuracy_score
print('XGBoost model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))



https://gist.github.com/pb111/cc341409081dffa5e9eaf60d79562a03

https://medium.com/@rithpansanga/optimizing-xgboost-a-guide-to-hyperparameter-tuning-77b6e48e289d
https://medium.com/@sadafsaleem5815/neural-networks-in-10mins-simply-explained-9ec2ad9ea815

https://www.kaggle.com/code/prashant111/a-guide-on-xgboost-hyperparameters-tuning/notebook

https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

XGBoost Model Documentation: https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters

Remember: It is important to set a subsample value for our exercise since the dataset is imbalanced (https://xgboosting.com/configure-xgboost-subsample-parameter/)

Practical example on tuning: https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

Sklearn metrics for accuracy: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html





-----------------
Update on 12.05.25:

#Installs
!pip install xgboost
!pip install shap
!pip install statsmodel


#Imports
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import cohen_kappa_score
from scipy.stats import mode
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
import shap
import warnings 
warnings.filterwarnings('ignore')


#Import dataset 

df = pd.read_csv('C.csv')

#Exploring the dataset
df.shape
df.head()
df.info()
df.describe()

#missing value check
df.isnull().sum()

#Checking for types of values
df.value_counts()


#declaring dependent and independent variables
X = df.drop('status', axis=1) 
y = df['status']


#Label encoding (OHE)
X_features= X

encoded_df = pd.get_dummies(X_features, drop_first = True)
#note here df conversion for X-featuers was not required since it is already a dataframe

#Checking the columns created after encoding
list(encoded_df.columns)
#Sample check
#encoded_df.head()


#Var checks
#X.describe()
X.info()
encoded_df.info()
y.info()
#X.head()
#y.describe()
#y.head()


#Null Imputation
#https://www.geeksforgeeks.org/ml-handling-missing-values/

#Strategy 1
# Removing rows with missing values
df_cleaned = df.dropna()
print(df_cleaned)

#Strategy 2
#Mean, Median and Mode Imputation

mean_imputation = df['age'].fillna(df['age'].mean())
median_imputation = df['age'].fillna(df['age'].median())
mode_imputation = df['age'].fillna(df['age'].mode().iloc[0])

print("\nImputation using Mean:")
print(mean_imputation)

print("\nImputation using Median:")
print(median_imputation)

print("\nImputation using Mode:")
print(mode_imputation)


-----------------
# split X and y into training and testing sets 
#from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(encoded_df, y, test_size = 0.3, random_state = 0)

#dont forget to use the X Df which has label encoding

#Performing Logistic Regression
#import statmodels.api as sm


#df.convert_objects(convert_numeric=True)

#Encoded DF - encoded_df

logit=sm.Logit(y_train,X_train)
logit_model=logit.fit()


#Functions for converting objects to Int or FLOAT
#df.convert_objects(convert_numeric=True)
#X.astype(float)).fit() - converting type

#Model Summary
logit_model.summary2()


def get_significant_vars(lm):
    #Store the p-value and corresponding column names in a dataframe
    var_p_vals_df=pd.DataFrame(lm.pvalues)
    var_p_vals_df['vars'] = var_p_vals_df.index
    var_p_vals_df.columns = ['pvals','vars']
    #Filter the column names where the p value is less than 0/05
    return list (var_p_vals_df[var_p_vals_df.pvals<0.05]['vars'])

significant_vars =get_significant_vars(logit_model)
significant_vars


-------------
# import XGBoost
#import xgboost as xgb 
# define data_dmatrix
data_dmatrix = xgb.DMatrix(data=encoded_df,label=y)

# declare parameters
params = {
            'objective':'binary:logistic',
            'max_depth': 4,
            'alpha': 10,
            'learning_rate': 1.0,
            'n_estimators':100
        }


# instantiate the classifier 
xgb_clf = XGBClassifier(**params)

# fit the classifier to the training data
xgb_clf.fit(X_train, y_train)
#output:
XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,
       max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)
       
       
# make predictions on test data
y_pred = xgb_clf.predict(X_test)


# check accuracy score
from sklearn.metrics import accuracy_score
print('XGBoost model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))


-------------------------------------------------------------------------------

Few things to remember:

Sampling - We never actually stop to check how the data distribution actually is, we just assume normality.

Question I have on this: 
Distribution of which variables specifically? 
Isn't fraud by definition a rare event which would make the distribution skewed anyways? 

Log normal vs Normal distribution: https://towardsdatascience.com/log-link-vs-log-transformation-in-r-the-difference-that-misleads-your-entire-data-analysis/ - very good article 



Sunday, April 20, 2025

Interesting Reads

Conjoint Analysis: https://www.qualtrics.com/en-au/experience-management/research/types-of-conjoint/

Great article on Deep Tech (plus awesome visuals): https://www.bcg.com/publications/2021/deep-tech-innovation

Basics on Mathematical Modelling: https://ocw.tudelft.nl/courses/mathematical-modeling-basics/

Books on Operations: https://orc.mit.edu/impact/textbooks/

Solving Cool Math Problems: https://projecteuler.net/archives

XGBoost Resources



Hello,

This page is designed to collate popular resources on xgb for my personal reference. None of it is my work.

Survival Modelling code snippet:#pip install lifelines
!pip install lifelines
#conda install -c conda-forge lifelinesimport pandas as pd

import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
df2 = pd.read_csv(r'FILELINK.csv')
T2 = df2['time']
S2 = df2['status']
print(T2)
kmf2 = KaplanMeierFitter()
kmf2.fit(T2, S2)
print("Survival function:")
print(kmf2.survival_function_)
print("Survival function plot:")
kmf2.plot()
plt.title("Survival Curve: 6-MP")
plt.xlabel("Time")
plt.ylabel("Survival Probability")
plt.savefig("6-MP.pdf")
plt.show()




https://gist.github.com/pb111/cc341409081dffa5e9eaf60d79562a03




I have used the Wholesale customers data set for this project, downloaded from the UCI Machine learning repository. This dataset can be found at the following url:

https://archive.ics.uci.edu/ml/datasets/Wholesale+customers


import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
import warnings warnings.filterwarnings('ignore')




#Import dataset data = 'C:/datasets/Wholesale customers data.csv' df = pd.read_csv(data)

#Exploring the dataset


df.shape
df.head()
df.info()
df.describe()#missing value check
df.isnull().sum()
#declaring dependent and independent variables
X = df.drop('Channel', axis=1) y = df['Channel']
#var checks
X.head()
y.head()
#Label encoding


# import XGBoost
import xgboost as xgb # define data_dmatrix
data_dmatrix = xgb.DMatrix(data=X,label=y)


# split X and y into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)



General parameters

These parameters relate to which booster we are doing boosting. The common ones are tree or linear model.
Booster parameters

It depends on which booster we have chosen for boosting.
Learning task parameters

These parameters decide on the learning scenario. For example, regression tasks may use different parameters than ranking tasks.
Command line parameters

In addition there are command line parameters which relate to behaviour of CLI version of XGBoost.

The most important parameters that we should know about are as follows:-

learning_rate - It gives us the step size shrinkage which is used to prevent overfitting. Its range is [0,1].

max_depth - It determines how deeply each tree is allowed to grow during any boosting round.

subsample - It determines the percentage of samples used per tree. Low value of subsample can lead to underfitting.

colsample_bytree - It determines the percentage of features used per tree. High value of it can lead to overfitting.

n_estimators - It is the number of trees we want to build.

objective - It determines the loss function to be used in the process. For example, reg:linear for regression problems, reg:logistic for classification problems with only decision, binary:logistic for classification problems with probability.

XGBoost also supports regularization parameters to penalize models as they become more complex and reduce them to simple models. These regularization parameters are as follows:-

gamma - It controls whether a given node will split based on the expected reduction in loss after the split. A higher value leads to fewer splits. It is supported only for tree-based learners.

alpha - It gives us the L1 regularization on leaf weights. A large value of it leads to more regularization.

lambda - It gives us the L2 regularization on leaf weights and is smoother than L1 regularization.

Though we are using trees as our base learners, we can also use XGBoost’s relatively less popular linear base learners and one other tree learner known as dart. We have to set the booster parameter to either gbtree (default), gblinear or dart.

# import XGBClassifier
from xgboost import XGBClassifier


# declare parameters
params = {
            'objective':'binary:logistic',
            'max_depth': 4,
            'alpha': 10,
            'learning_rate': 1.0,
            'n_estimators':100
        }
            
            
            
# instantiate the classifier 
xgb_clf = XGBClassifier(**params)



# fit the classifier to the training data
xgb_clf.fit(X_train, y_train)

#output:
XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,
       max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)

# alternatively view the parameters of the xgb trained model
print(xgb_clf)
XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,
       max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)
# make predictions on test data
y_pred = xgb_clf.predict(X_test)

# check accuracy score
from sklearn.metrics import accuracy_score

print('XGBoost model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

#K-fold Cross Val
from xgboost import cv

params = {"objective":"binary:logistic",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

xgb_cv = cv(dtrain=data_dmatrix, params=params, nfold=3,
                    num_boost_round=50, early_stopping_rounds=10, metrics="auc", as_pandas=True, seed=123) 
xgb_cv.head()
#xgb_cv contains train and test auc metrics for each boosting round. Let's preview xgb_cv.

14. Feature importance with XGBoost

XGBoost provides a way to examine the importance of each feature in the original dataset within the model. It involves counting the number of times each feature is split on across all boosting trees in the model. Then we visualize the result as a bar graph, with the features ordered according to how many times they appear.

XGBoost has a plot_importance() function that helps us to achieve this task. Then we can visualize the features that has been given the highest important score among all the features. Thus XGBoost provides us a way to do feature selection.

I will proceed as follows:-

xgb.plot_importance(xgb_clf)
plt.rcParams['figure.figsize'] = [6, 4]
plt.show()














Sidenote:
#Remember if categorical variable is there, input for xgb needs to be in dmatrix format
(ValueError: DataFrame.dtypes for data must be int, float, bool or categorical. When categorical type is supplied, DMatrix parameter enable_categorical must be set to True.Var1, Var2, Var3, Var4
Link: https://stackoverflow.com/questions/67080149/xgboost-error-when-categorical-type-is-supplied-dmatrix-parameter-enable-cat)
Code block For OHE:
import pandas as pd from xgboost 
import XGBRegressor from sklearn.compose 
import ColumnTransformer from sklearn.preprocessing 
import OneHotEncoder 
 # define the input data 
df = pd.DataFrame([ {'Var1': 'JP', 'Var2': 2009, 'Var3': 6581, 'Var4': 'OME', 'Var5': 325.787218, 'Ind_Var': 8.013558}, {'Var1': 'FR', 'Var2': 2018, 'Var3': 5783, 'Var4': 'I_S', 'Var5': 11.956326, 'Ind_Var': 4.105559}, {'Var1': 'BE', 'Var2': 2015, 'Var3': 6719, 'Var4': 'OME', 'Var5': 42.888565, 'Ind_Var': 7.830077}, {'Var1': 'DK', 'Var2': 2011, 'Var3': 3506, 'Var4': 'RPP', 'Var5': 70.094146, 'Ind_Var': 83.000000}, {'Var1': 'AT', 'Var2': 2019, 'Var3': 5474, 'Var4': 'NMM', 'Var5': 270.082738, 'Ind_Var': 51.710526} ]) 

 # extract the features and target 
X_train, y_train = df.iloc[:3, :-1], df.iloc[:3, -1]
X_test, y_test = df.iloc[3:, :-1], df.iloc[3:, -1] 

 # one-hot encode the categorical features cat_attribs = ['Var1', 'Var2', 'Var3', 'Var4'] 
full_pipeline = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_attribs)], remainder='passthrough') 
encoder = full_pipeline.fit(X_train) X_train = encoder.transform(X_train) X_test = encoder.transform(X_test) 
 # train the model model = XGBRegressor(n_estimators=10, max_depth=20, verbosity=2) model.fit(X_train, y_train) 
 # extract the training set predictions model.predict(X_train) 
# array([7.0887003, 3.7923286, 7.0887003], dtype=float32) 
 # extract the test set predictions model.predict(X_test) 
# array([7.0887003, 7.0887003], dtype=float32)



#Null imputation strategy for different variables


#Strategy for handling categorical variables:
Strategy 1:
cat_attribs = ['var1','var2','var3','var4'] X_train[cat_attribs] = X_train[cat_attribs].astype('category') X_test[cat_attribs] = X_test[cat_attribs].astype('category')

model = XGBRegressor(n_estimators=10, max_depth=20, enable_categorical=True, verbosity=2) model.fit(X_train, y_train) y_pred = model.predict(X_test)

Strategy 2:
# Create a mapping of labels to encoded values from X_train training_encoded_mapping = X_train['var1'].astype('category').cat.codes training_encoded_mapping = dict(zip(X_train['var1'].cat.categories, training_encoded_mapping)) X_train['var1'] = X_train['var1'].astype('category').cat.codes # Apply the mapping to X_test X_test['var1'] = X_test['var1'].map(training_encoded_mapping) # Do the same for other vars as well
And now don't pass enable_categorical=True in model initialization


#Models to be used:
Naive Bayes, DTs, Logistic Regression (it is a classification technique and not regression), NN, SVM


Unsupervised elarning (clustering algorithms): K-means clustering, Mean-shift, DBSCan - but remember that we shall have to specify the number of clusters - to only allow for default and non-default behaviour - does not have direct application here but need to learn more




`









Saturday, April 19, 2025

Data Viz Resources



Hello,

Ideas on better plotting of distributions (banding and how the population lies for any variable):

https://rafalab.dfci.harvard.edu/dsbook/dataviz-distributions.html
https://seaborn.pydata.org/tutorial/distributions.html - Using seaborn
https://github.com/cuttlefishh/python-for-data-analysis/blob/master/lessons/lesson10.ipynb



How to connect big query with python: https://codelabs.developers.google.com/codelabs/cloud-bigquery-python#0




The idea here is to assimilate resources from across the internet that will help me level up my visualization game using python. I am pretty much a noob at this and want to get better.




Most helpful visualization libraries in python (https://www.kaggle.com/discussions/getting-started/1087922)



1- matplotlib


matplotlib is the O.G. of Python data visualization libraries. Despite being over a decade old, it’s still the most widely used library for plotting in the Python community. It was designed to closely resemble MATLAB, a proprietary programming language developed in the 1980s.

2- Seaborn

Seaborn harnesses the power of matplotlib to create beautiful charts in a few lines of code. The key difference is Seaborn’s default styles and color palettes, which are designed to be more aesthetically pleasing and modern. Since Seaborn is built on top of matplotlib, you’ll need to know matplotlib to tweak Seaborn’s defaults.

3- ggplot

ggplot is based on ggplot2, an R plotting system, and concepts from The Grammar of Graphics. ggplot operates differently than matplotlib: it lets you layer components to create a complete plot. For instance, you can start with axes, then add points, then a line, a trendline, etc. Although The Grammar of Graphics has been praised as an “intuitive” method for plotting, seasoned matplotlib users might need time to adjust to this new mindset.

4- Bokeh

Like ggplot, Bokeh is based on The Grammar of Graphics, but unlike ggplot, it’s native to Python, not ported over from R. Its strength lies in the ability to create interactive, web-ready plots, which can be easily outputted as JSON objects, HTML documents, or interactive web applications. Bokeh also supports streaming and real-time data.

5- pygal

Like Bokeh and Plotly, pygal offers interactive plots that can be embedded in the web browser. Its prime differentiator is the ability to output charts as SVGs. As long as you’re working with smaller datasets, SVGs will do you just fine. But if you’re making charts with hundreds of thousands of data points, they’ll have trouble rendering and become sluggish.

6- Plotly

You might know Plotly as an online platform for data visualization, but did you also know you can access its capabilities from a Python notebook? Like Bokeh, Plotly’s forte is making interactive plots, but it offers some charts you won’t find in most libraries, like contour plots, dendograms, and 3D charts.

7- geoplotlib

geoplotlib is a toolbox for creating maps and plotting geographical data. You can use it to create a variety of map-types, like choropleths, heatmaps, and dot density maps. You must have Pyglet (an object-oriented programming interface) installed to use geoplotlib. Nonetheless, since most Python data visualization libraries don’t offer maps, it’s nice to have a library dedicated solely to them.

8- Gleam

Gleam is inspired by R’s Shiny package. It allows you to turn analyses into interactive web apps using only Python scripts, so you don’t have to know any other languages like HTML, CSS, or JavaScript. Gleam works with any Python data visualization library. Once you’ve created a plot, you can build fields on top of it so users can filter and sort data.

9- missingno

Dealing with missing data is a pain. missingno allows you to quickly gauge the completeness of a dataset with a visual summary, instead of trudging through a table. You can filter and sort data based on completion or spot correlations with a heatmap or a dendrogram.

10- Leather

Leather’s creator, Christopher Groskopf, puts it best: “Leather is the Python charting library for those who need charts now and don’t care if they’re perfect.” It’s designed to work with all data types and produces charts as SVGs, so you can scale them without losing image quality.







https://github.com/mathisonian/awesome-visualization-research

https://mode.com/blog/python-data-visualization-libraries



1. Seaborn

Seaborn is built on top of the matplotlib library. it has many built-in functions using which you can create beautiful plots with just simple lines of codes. It provides a variety of advanced visualization plots with simple syntax like box plots, violin plots, dist plots, Joint plots, pair plots, heatmap, and many more.
Key Features:It can be used to determine the relationship between two variables.
Differentiate when analyzing uni-variate or bi-variate distributions.
Plot the linear regression model for the dependent variable.
Provides multi-grid plotting

Official website: https://seaborn.pydata.org/
2. Plotly

Plotly is an advanced Python analytics library that helps in building interactive dashboards. The graphs build using Plotly are interactive plots, which means you can easily find value at any particular point or session of the graphs. Plotly makes it super easy to generate dashboards and deploying them on the server. It supports Python, R, and the Julia programming language.
You can create a wide range of graphs using Plotly:Basic Charts
Statistical charts
Scientific charts
Financial Charts
Maps
Subplots
Transforms
Jupyter Widgets Interaction

Official website: https://plotly.com/
3. Geoplotlib

Geoplotlib is an open-source Python toolbox for visualizing geographical data. It supports the development of hardware-accelerated interactive visualizations in pure Python and provides implementations of dot maps, kernel density estimation, spatial graphs, Voronoi tesselation, shapefiles, and many more common spatial visualizations.

Geoplotlib can be used to make a variety of maps, such as equivalent area maps, heat maps, and point density maps. There are also several extended modules:geoplotlib
geoplotlib.layers
geoplotlib.utils
geoplotlib.core
geoplotlib.colors

Official website: https://andrea-cuttone.github.io/geoplotlib/
4. Gleam

Gleam is inspired by R’s Shiny package. It allows you to turn analyses into interactive web apps using only Python scripts, so you don’t have to know any other languages like HTML, CSS, or JavaScript. Gleam works with any Python data visualization library. Once you’ve created a plot, you can build fields on top of it so users can filter and sort data.

Official website: https://github.com/dgrtwo/gleam
5. ggplot/ggplot2

ggplot works differently from matplotlib. It lets you add multiple components as layers to create a complete graph or plot at the end. For example, at the start you can add an axis, then points, and other components like a trend line.
They always say that you should store your data in a data frame before using ggplot to get simpler and efficient results.

Official website: https://ggplot2.tidyverse.org/reference/ggplot.html




Key code snippets:




Volunteering exploration

1. Vidyanjali - MHRD 2. Amex itself? 3. Bhumi 4. Smile foundation 5. Lotus foundation - ggn 6.