Sunday, April 27, 2025

XGBoost Analysis Code

#Imports
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import cohen_kappa_score
from scipy.stats import mode

from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split

import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
import shap
import warnings warnings.filterwarnings('ignore')

#Import dataset
data = 'C:/datasets/Wholesale customers data.csv'
df = pd.read_csv(data)

#Exploring the dataset
df.shape
df.head()
df.info()
df.describe()

#missing value check
df.isnull().sum()

#Checking for types of values

df.status_value_counts()

#declaring dependent and independent variables
X = df.drop('Channel', axis=1) y = df['Channel']

#Var checks
X.head()

y.head()

#Label encoding (OHE)
X_features=X

encoded_df = pd.get_dummies(df[X_features], drop_first = True)

#Checking the columns created after encoding

list(encoded_df.columns)

#Null imputation

#Performing Logistic Regression

import statmodels.api as sm

logit=sm.Logit(y_train,X_train)

logit_model=logit.fit()

#Model Summary

logit_model.summary2()

def get_significant_vars(lm):

#Store the p-value and corresponding column names in a dataframe

var_p_vals_df=pd.DataFrame(lm.pvalues)

var_p_vals_df['vars'] = var_p_vals_df.index

var_p_vals_df.columns = ['pvals','vars]

#Filter the column names where the p value is less than 0/05

return list (var_p_vals_df[var_p_vals_df.pvals<0.05]['vars']

significant_vars =get_significiant_vars(logit_model)

significant_vars

# import XGBoost
#import xgboost as xgb
# define data_dmatrix
data_dmatrix = xgb.DMatrix(data=X,label=y)

# split X and y into training and testing sets
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
# import XGBClassifier
#from xgboost import XGBClassifier

# declare parameters
params = {
'objective':'binary:logistic',
'max_depth': 4,
'alpha': 10,
'learning_rate': 1.0,
'n_estimators':100
}



# instantiate the classifier
xgb_clf = XGBClassifier(**params)

# fit the classifier to the training data
xgb_clf.fit(X_train, y_train)
#output:
XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,
max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,
n_estimators=100, n_jobs=1, nthread=None,
objective='binary:logistic', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
subsample=1, verbosity=1)
# alternatively view the parameters of the xgb trained model
print(xgb_clf)
XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,
max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,
n_estimators=100, n_jobs=1, nthread=None,
objective='binary:logistic', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
subsample=1, verbosity=1)
# make predictions on test data
y_pred = xgb_clf.predict(X_test)
# check accuracy score
from sklearn.metrics import accuracy_score
print('XGBoost model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

https://gist.github.com/pb111/cc341409081dffa5e9eaf60d79562a03

https://medium.com/@rithpansanga/optimizing-xgboost-a-guide-to-hyperparameter-tuning-77b6e48e289d

https://medium.com/@sadafsaleem5815/neural-networks-in-10mins-simply-explained-9ec2ad9ea815

https://www.kaggle.com/code/prashant111/a-guide-on-xgboost-hyperparameters-tuning/notebook

https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

XGBoost Model Documentation: https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters

Remember: It is important to set a subsample value for our exercise since the dataset is imbalanced (https://xgboosting.com/configure-xgboost-subsample-parameter/)

Practical example on tuning: https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

Sklearn metrics for accuracy: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

-----------------

Update on 12.05.25:

#Installs

!pip install xgboost

!pip install shap

!pip install statsmodel

#Imports

import pandas as pd

import os

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import cohen_kappa_score

from scipy.stats import mode

from sklearn.feature_selection import SelectFromModel

from sklearn.model_selection import train_test_split

import statsmodels.api as sm

import xgboost as xgb

from xgboost import XGBClassifier

from xgboost import plot_importance

from matplotlib import pyplot

import shap

import warnings

warnings.filterwarnings('ignore')

#Import dataset

df = pd.read_csv('C.csv')

#Exploring the dataset

df.shape

df.head()

df.info()

df.describe()

#missing value check

df.isnull().sum()

#Checking for types of values

df.value_counts()

#declaring dependent and independent variables

X = df.drop('status', axis=1)

y = df['status']

#Label encoding (OHE)

X_features= X

encoded_df = pd.get_dummies(X_features, drop_first = True)

#note here df conversion for X-featuers was not required since it is already a dataframe

#Checking the columns created after encoding

list(encoded_df.columns)

#Sample check

#encoded_df.head()

#Var checks

#X.describe()

X.info()

encoded_df.info()

y.info()

#X.head()

#y.describe()

#y.head()

#Null Imputation

#https://www.geeksforgeeks.org/ml-handling-missing-values/

#Strategy 1

# Removing rows with missing values

df_cleaned = df.dropna()

print(df_cleaned)

#Strategy 2

#Mean, Median and Mode Imputation

mean_imputation = df['age'].fillna(df['age'].mean())

median_imputation = df['age'].fillna(df['age'].median())

mode_imputation = df['age'].fillna(df['age'].mode().iloc[0])

print("\nImputation using Mean:")

print(mean_imputation)

print("\nImputation using Median:")

print(median_imputation)

print("\nImputation using Mode:")

print(mode_imputation)

-----------------

# split X and y into training and testing sets

#from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(encoded_df, y, test_size = 0.3, random_state = 0)

#dont forget to use the X Df which has label encoding

#Performing Logistic Regression

#import statmodels.api as sm

#df.convert_objects(convert_numeric=True)

#Encoded DF - encoded_df

logit=sm.Logit(y_train,X_train)

logit_model=logit.fit()

#Functions for converting objects to Int or FLOAT

#df.convert_objects(convert_numeric=True)

#X.astype(float)).fit() - converting type

#Model Summary

logit_model.summary2()

def get_significant_vars(lm):

#Store the p-value and corresponding column names in a dataframe

var_p_vals_df=pd.DataFrame(lm.pvalues)

var_p_vals_df['vars'] = var_p_vals_df.index

var_p_vals_df.columns = ['pvals','vars']

#Filter the column names where the p value is less than 0/05

return list (var_p_vals_df[var_p_vals_df.pvals<0.05]['vars'])

significant_vars =get_significant_vars(logit_model)

significant_vars

-------------

# import XGBoost

#import xgboost as xgb

# define data_dmatrix

data_dmatrix = xgb.DMatrix(data=encoded_df,label=y)

# declare parameters

params = {

'objective':'binary:logistic',

'max_depth': 4,

'alpha': 10,

'learning_rate': 1.0,

'n_estimators':100

}

# instantiate the classifier

xgb_clf = XGBClassifier(**params)

# fit the classifier to the training data

xgb_clf.fit(X_train, y_train)

#output:

XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,

colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,

max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,

n_estimators=100, n_jobs=1, nthread=None,

objective='binary:logistic', random_state=0, reg_alpha=0,

reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,

subsample=1, verbosity=1)

# make predictions on test data

y_pred = xgb_clf.predict(X_test)

# check accuracy score

from sklearn.metrics import accuracy_score

print('XGBoost model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

-------------------------------------------------------------------------------

Few things to remember:

Sampling - We never actually stop to check how the data distribution actually is, we just assume normality.

Question I have on this:

Distribution of which variables specifically?

Isn't fraud by definition a rare event which would make the distribution skewed anyways?

Log normal vs Normal distribution: https://towardsdatascience.com/log-link-vs-log-transformation-in-r-the-difference-that-misleads-your-entire-data-analysis/ - very good article

Sunday, April 20, 2025

Interesting Reads

Conjoint Analysis: https://www.qualtrics.com/en-au/experience-management/research/types-of-conjoint/

Great article on Deep Tech (plus awesome visuals): https://www.bcg.com/publications/2021/deep-tech-innovation

Basics on Mathematical Modelling: https://ocw.tudelft.nl/courses/mathematical-modeling-basics/

Books on Operations: https://orc.mit.edu/impact/textbooks/

Solving Cool Math Problems: https://projecteuler.net/archives

XGBoost Resources

Hello,

This page is designed to collate popular resources on xgb for my personal reference. None of it is my work.

Survival Modelling code snippet:#pip install lifelines
!pip install lifelines
#conda install -c conda-forge lifelinesimport pandas as pd

import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
df2 = pd.read_csv(r'FILELINK.csv')
T2 = df2['time']
S2 = df2['status']
print(T2)
kmf2 = KaplanMeierFitter()
kmf2.fit(T2, S2)
print("Survival function:")
print(kmf2.survival_function_)
print("Survival function plot:")
kmf2.plot()
plt.title("Survival Curve: 6-MP")
plt.xlabel("Time")
plt.ylabel("Survival Probability")
plt.savefig("6-MP.pdf")
plt.show()

https://gist.github.com/pb111/cc341409081dffa5e9eaf60d79562a03

I have used the Wholesale customers data set for this project, downloaded from the UCI Machine learning repository. This dataset can be found at the following url:

https://archive.ics.uci.edu/ml/datasets/Wholesale+customers

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
import warnings warnings.filterwarnings('ignore')

#Import dataset data = 'C:/datasets/Wholesale customers data.csv' df = pd.read_csv(data)

#Exploring the dataset

df.shape
df.head()
df.info()
df.describe()#missing value check
df.isnull().sum()
#declaring dependent and independent variables
X = df.drop('Channel', axis=1) y = df['Channel']
#var checks
X.head()
y.head()
#Label encoding

# import XGBoost
import xgboost as xgb # define data_dmatrix
data_dmatrix = xgb.DMatrix(data=X,label=y)

# split X and y into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

General parameters

These parameters relate to which booster we are doing boosting. The common ones are tree or linear model.
Booster parameters

It depends on which booster we have chosen for boosting.
Learning task parameters

These parameters decide on the learning scenario. For example, regression tasks may use different parameters than ranking tasks.
Command line parameters

In addition there are command line parameters which relate to behaviour of CLI version of XGBoost.

The most important parameters that we should know about are as follows:-

learning_rate - It gives us the step size shrinkage which is used to prevent overfitting. Its range is [0,1].

max_depth - It determines how deeply each tree is allowed to grow during any boosting round.

subsample - It determines the percentage of samples used per tree. Low value of subsample can lead to underfitting.

colsample_bytree - It determines the percentage of features used per tree. High value of it can lead to overfitting.

n_estimators - It is the number of trees we want to build.

objective - It determines the loss function to be used in the process. For example, reg:linear for regression problems, reg:logistic for classification problems with only decision, binary:logistic for classification problems with probability.

XGBoost also supports regularization parameters to penalize models as they become more complex and reduce them to simple models. These regularization parameters are as follows:-

gamma - It controls whether a given node will split based on the expected reduction in loss after the split. A higher value leads to fewer splits. It is supported only for tree-based learners.

alpha - It gives us the L1 regularization on leaf weights. A large value of it leads to more regularization.

lambda - It gives us the L2 regularization on leaf weights and is smoother than L1 regularization.

Though we are using trees as our base learners, we can also use XGBoost’s relatively less popular linear base learners and one other tree learner known as dart. We have to set the booster parameter to either gbtree (default), gblinear or dart.

# import XGBClassifier
from xgboost import XGBClassifier


# declare parameters
params = {
            'objective':'binary:logistic',
            'max_depth': 4,
            'alpha': 10,
            'learning_rate': 1.0,
            'n_estimators':100
        }
            
            
            
# instantiate the classifier 
xgb_clf = XGBClassifier(**params)



# fit the classifier to the training data
xgb_clf.fit(X_train, y_train)

#output:

XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,
       max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)

# alternatively view the parameters of the xgb trained model
print(xgb_clf)
XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,
       max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)
# make predictions on test data
y_pred = xgb_clf.predict(X_test)

# check accuracy score
from sklearn.metrics import accuracy_score

print('XGBoost model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

k-fold Cross Validation using XGBoost
To build more robust models with XGBoost, we must do k-fold cross validation. In this way, we ensure that the original training dataset is used for both training and validation. Also, each entry is used for validation just once. XGBoost supports k-fold cross validation using the cv() method. In this method, we will specify several parameters which are as follows:-
nfolds - This parameter specifies the number of cross-validation sets we want to build.
num_boost_round - It denotes the number of trees we build.
metrics - It is the performance evaluation metrics to be considered during CV.
as_pandas - It is used to return the results in a pandas DataFrame.
early_stopping_rounds - This parameter stops training of the model early if the hold-out metric does not improve for a given number of rounds.
seed - This parameter is used for reproducibility of results.
We can use these parameters to build a k-fold cross-validation model by calling XGBoost's CV() method.
#K-fold Cross Val
from xgboost import cv

params = {"objective":"binary:logistic",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

xgb_cv = cv(dtrain=data_dmatrix, params=params, nfold=3,
                    num_boost_round=50, early_stopping_rounds=10, metrics="auc", as_pandas=True, seed=123) 
xgb_cv.head()
#xgb_cv contains train and test auc metrics for each boosting round. Let's preview xgb_cv.

14. Feature importance with XGBoost

XGBoost provides a way to examine the importance of each feature in the original dataset within the model. It involves counting the number of times each feature is split on across all boosting trees in the model. Then we visualize the result as a bar graph, with the features ordered according to how many times they appear.

XGBoost has a plot_importance() function that helps us to achieve this task. Then we can visualize the features that has been given the highest important score among all the features. Thus XGBoost provides us a way to do feature selection.

I will proceed as follows:-

xgb.plot_importance(xgb_clf)
plt.rcParams['figure.figsize'] = [6, 4]
plt.show()

Sidenote:
#Remember if categorical variable is there, input for xgb needs to be in dmatrix format
(ValueError: DataFrame.dtypes for data must be int, float, bool or categorical. When categorical type is supplied, DMatrix parameter enable_categorical must be set to True.Var1, Var2, Var3, Var4
Link: https://stackoverflow.com/questions/67080149/xgboost-error-when-categorical-type-is-supplied-dmatrix-parameter-enable-cat)
Code block For OHE:
import pandas as pd from xgboost

import XGBRegressor from sklearn.compose

import ColumnTransformer from sklearn.preprocessing

import OneHotEncoder

# define the input data

df = pd.DataFrame([ {'Var1': 'JP', 'Var2': 2009, 'Var3': 6581, 'Var4': 'OME', 'Var5': 325.787218, 'Ind_Var': 8.013558}, {'Var1': 'FR', 'Var2': 2018, 'Var3': 5783, 'Var4': 'I_S', 'Var5': 11.956326, 'Ind_Var': 4.105559}, {'Var1': 'BE', 'Var2': 2015, 'Var3': 6719, 'Var4': 'OME', 'Var5': 42.888565, 'Ind_Var': 7.830077}, {'Var1': 'DK', 'Var2': 2011, 'Var3': 3506, 'Var4': 'RPP', 'Var5': 70.094146, 'Ind_Var': 83.000000}, {'Var1': 'AT', 'Var2': 2019, 'Var3': 5474, 'Var4': 'NMM', 'Var5': 270.082738, 'Ind_Var': 51.710526} ])

# extract the features and target

X_train, y_train = df.iloc[:3, :-1], df.iloc[:3, -1]
X_test, y_test = df.iloc[3:, :-1], df.iloc[3:, -1]

# one-hot encode the categorical features cat_attribs = ['Var1', 'Var2', 'Var3', 'Var4']

full_pipeline = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_attribs)], remainder='passthrough')

encoder = full_pipeline.fit(X_train) X_train = encoder.transform(X_train) X_test = encoder.transform(X_test)

# train the model model = XGBRegressor(n_estimators=10, max_depth=20, verbosity=2) model.fit(X_train, y_train)

# extract the training set predictions model.predict(X_train)

# array([7.0887003, 3.7923286, 7.0887003], dtype=float32)

# extract the test set predictions model.predict(X_test)

# array([7.0887003, 7.0887003], dtype=float32)

#Null imputation strategy for different variables

#Strategy for handling categorical variables:
Strategy 1:
cat_attribs = ['var1','var2','var3','var4'] X_train[cat_attribs] = X_train[cat_attribs].astype('category') X_test[cat_attribs] = X_test[cat_attribs].astype('category')

model = XGBRegressor(n_estimators=10, max_depth=20, enable_categorical=True, verbosity=2) model.fit(X_train, y_train) y_pred = model.predict(X_test)

Strategy 2:
# Create a mapping of labels to encoded values from X_train training_encoded_mapping = X_train['var1'].astype('category').cat.codes training_encoded_mapping = dict(zip(X_train['var1'].cat.categories, training_encoded_mapping)) X_train['var1'] = X_train['var1'].astype('category').cat.codes # Apply the mapping to X_test X_test['var1'] = X_test['var1'].map(training_encoded_mapping) # Do the same for other vars as well
And now don't pass enable_categorical=True in model initialization

#Models to be used:
Naive Bayes, DTs, Logistic Regression (it is a classification technique and not regression), NN, SVM

Unsupervised elarning (clustering algorithms): K-means clustering, Mean-shift, DBSCan - but remember that we shall have to specify the number of clusters - to only allow for default and non-default behaviour - does not have direct application here but need to learn more

Saturday, April 19, 2025

Data Viz Resources

Hello,

Ideas on better plotting of distributions (banding and how the population lies for any variable):

https://rafalab.dfci.harvard.edu/dsbook/dataviz-distributions.html
https://seaborn.pydata.org/tutorial/distributions.html - Using seaborn

https://github.com/cuttlefishh/python-for-data-analysis/blob/master/lessons/lesson10.ipynb

How to connect big query with python: https://codelabs.developers.google.com/codelabs/cloud-bigquery-python#0

The idea here is to assimilate resources from across the internet that will help me level up my visualization game using python. I am pretty much a noob at this and want to get better.

Most helpful visualization libraries in python (https://www.kaggle.com/discussions/getting-started/1087922)

1- matplotlib

matplotlib is the O.G. of Python data visualization libraries. Despite being over a decade old, it’s still the most widely used library for plotting in the Python community. It was designed to closely resemble MATLAB, a proprietary programming language developed in the 1980s.

2- Seaborn

Seaborn harnesses the power of matplotlib to create beautiful charts in a few lines of code. The key difference is Seaborn’s default styles and color palettes, which are designed to be more aesthetically pleasing and modern. Since Seaborn is built on top of matplotlib, you’ll need to know matplotlib to tweak Seaborn’s defaults.

3- ggplot

ggplot is based on ggplot2, an R plotting system, and concepts from The Grammar of Graphics. ggplot operates differently than matplotlib: it lets you layer components to create a complete plot. For instance, you can start with axes, then add points, then a line, a trendline, etc. Although The Grammar of Graphics has been praised as an “intuitive” method for plotting, seasoned matplotlib users might need time to adjust to this new mindset.

4- Bokeh

Like ggplot, Bokeh is based on The Grammar of Graphics, but unlike ggplot, it’s native to Python, not ported over from R. Its strength lies in the ability to create interactive, web-ready plots, which can be easily outputted as JSON objects, HTML documents, or interactive web applications. Bokeh also supports streaming and real-time data.

5- pygal

Like Bokeh and Plotly, pygal offers interactive plots that can be embedded in the web browser. Its prime differentiator is the ability to output charts as SVGs. As long as you’re working with smaller datasets, SVGs will do you just fine. But if you’re making charts with hundreds of thousands of data points, they’ll have trouble rendering and become sluggish.

6- Plotly

You might know Plotly as an online platform for data visualization, but did you also know you can access its capabilities from a Python notebook? Like Bokeh, Plotly’s forte is making interactive plots, but it offers some charts you won’t find in most libraries, like contour plots, dendograms, and 3D charts.

7- geoplotlib

geoplotlib is a toolbox for creating maps and plotting geographical data. You can use it to create a variety of map-types, like choropleths, heatmaps, and dot density maps. You must have Pyglet (an object-oriented programming interface) installed to use geoplotlib. Nonetheless, since most Python data visualization libraries don’t offer maps, it’s nice to have a library dedicated solely to them.

8- Gleam

Gleam is inspired by R’s Shiny package. It allows you to turn analyses into interactive web apps using only Python scripts, so you don’t have to know any other languages like HTML, CSS, or JavaScript. Gleam works with any Python data visualization library. Once you’ve created a plot, you can build fields on top of it so users can filter and sort data.

9- missingno

Dealing with missing data is a pain. missingno allows you to quickly gauge the completeness of a dataset with a visual summary, instead of trudging through a table. You can filter and sort data based on completion or spot correlations with a heatmap or a dendrogram.

10- Leather

Leather’s creator, Christopher Groskopf, puts it best: “Leather is the Python charting library for those who need charts now and don’t care if they’re perfect.” It’s designed to work with all data types and produces charts as SVGs, so you can scale them without losing image quality.

https://github.com/mathisonian/awesome-visualization-research

https://mode.com/blog/python-data-visualization-libraries

1. Seaborn

Seaborn is built on top of the matplotlib library. it has many built-in functions using which you can create beautiful plots with just simple lines of codes. It provides a variety of advanced visualization plots with simple syntax like box plots, violin plots, dist plots, Joint plots, pair plots, heatmap, and many more.
Key Features:It can be used to determine the relationship between two variables.
Differentiate when analyzing uni-variate or bi-variate distributions.
Plot the linear regression model for the dependent variable.
Provides multi-grid plotting

Official website: https://seaborn.pydata.org/
2. Plotly

Plotly is an advanced Python analytics library that helps in building interactive dashboards. The graphs build using Plotly are interactive plots, which means you can easily find value at any particular point or session of the graphs. Plotly makes it super easy to generate dashboards and deploying them on the server. It supports Python, R, and the Julia programming language.
You can create a wide range of graphs using Plotly:Basic Charts
Statistical charts
Scientific charts
Financial Charts
Maps
Subplots
Transforms
Jupyter Widgets Interaction

Official website: https://plotly.com/
3. Geoplotlib

Geoplotlib is an open-source Python toolbox for visualizing geographical data. It supports the development of hardware-accelerated interactive visualizations in pure Python and provides implementations of dot maps, kernel density estimation, spatial graphs, Voronoi tesselation, shapefiles, and many more common spatial visualizations.

Geoplotlib can be used to make a variety of maps, such as equivalent area maps, heat maps, and point density maps. There are also several extended modules:geoplotlib
geoplotlib.layers
geoplotlib.utils
geoplotlib.core
geoplotlib.colors

Official website: https://andrea-cuttone.github.io/geoplotlib/
4. Gleam

Gleam is inspired by R’s Shiny package. It allows you to turn analyses into interactive web apps using only Python scripts, so you don’t have to know any other languages like HTML, CSS, or JavaScript. Gleam works with any Python data visualization library. Once you’ve created a plot, you can build fields on top of it so users can filter and sort data.

Official website: https://github.com/dgrtwo/gleam
5. ggplot/ggplot2

ggplot works differently from matplotlib. It lets you add multiple components as layers to create a complete graph or plot at the end. For example, at the start you can add an axis, then points, and other components like a trend line.
They always say that you should store your data in a data frame before using ggplot to get simpler and efficient results.

Official website: https://ggplot2.tidyverse.org/reference/ggplot.html

Key code snippets:

I guess I will get complacent and stagnate

Sunday, April 27, 2025

XGBoost Analysis Code

Sunday, April 20, 2025

Interesting Reads

XGBoost Resources

k-fold Cross Validation using XGBoost

14. Feature importance with XGBoost

Saturday, April 19, 2025

Data Viz Resources

Interesting Reads

Report Abuse