Hello,

This page is designed to collate popular resources on xgb for my personal reference. None of it is my work.

Survival Modelling code snippet:#pip install lifelines
!pip install lifelines
#conda install -c conda-forge lifelinesimport pandas as pd

import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
df2 = pd.read_csv(r'FILELINK.csv')
T2 = df2['time']
S2 = df2['status']
print(T2)
kmf2 = KaplanMeierFitter()
kmf2.fit(T2, S2)
print("Survival function:")
print(kmf2.survival_function_)
print("Survival function plot:")
kmf2.plot()
plt.title("Survival Curve: 6-MP")
plt.xlabel("Time")
plt.ylabel("Survival Probability")
plt.savefig("6-MP.pdf")
plt.show()

https://gist.github.com/pb111/cc341409081dffa5e9eaf60d79562a03

I have used the Wholesale customers data set for this project, downloaded from the UCI Machine learning repository. This dataset can be found at the following url:

https://archive.ics.uci.edu/ml/datasets/Wholesale+customers

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
import warnings warnings.filterwarnings('ignore')

#Import dataset data = 'C:/datasets/Wholesale customers data.csv' df = pd.read_csv(data)

#Exploring the dataset

df.shape
df.head()
df.info()
df.describe()#missing value check
df.isnull().sum()
#declaring dependent and independent variables
X = df.drop('Channel', axis=1) y = df['Channel']
#var checks
X.head()
y.head()
#Label encoding

# import XGBoost
import xgboost as xgb # define data_dmatrix
data_dmatrix = xgb.DMatrix(data=X,label=y)

# split X and y into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

General parameters

These parameters relate to which booster we are doing boosting. The common ones are tree or linear model.
Booster parameters

It depends on which booster we have chosen for boosting.
Learning task parameters

These parameters decide on the learning scenario. For example, regression tasks may use different parameters than ranking tasks.
Command line parameters

In addition there are command line parameters which relate to behaviour of CLI version of XGBoost.

The most important parameters that we should know about are as follows:-

learning_rate - It gives us the step size shrinkage which is used to prevent overfitting. Its range is [0,1].

max_depth - It determines how deeply each tree is allowed to grow during any boosting round.

subsample - It determines the percentage of samples used per tree. Low value of subsample can lead to underfitting.

colsample_bytree - It determines the percentage of features used per tree. High value of it can lead to overfitting.

n_estimators - It is the number of trees we want to build.

objective - It determines the loss function to be used in the process. For example, reg:linear for regression problems, reg:logistic for classification problems with only decision, binary:logistic for classification problems with probability.

XGBoost also supports regularization parameters to penalize models as they become more complex and reduce them to simple models. These regularization parameters are as follows:-

gamma - It controls whether a given node will split based on the expected reduction in loss after the split. A higher value leads to fewer splits. It is supported only for tree-based learners.

alpha - It gives us the L1 regularization on leaf weights. A large value of it leads to more regularization.

lambda - It gives us the L2 regularization on leaf weights and is smoother than L1 regularization.

Though we are using trees as our base learners, we can also use XGBoost’s relatively less popular linear base learners and one other tree learner known as dart. We have to set the booster parameter to either gbtree (default), gblinear or dart.

# import XGBClassifier
from xgboost import XGBClassifier


# declare parameters
params = {
            'objective':'binary:logistic',
            'max_depth': 4,
            'alpha': 10,
            'learning_rate': 1.0,
            'n_estimators':100
        }
            
            
            
# instantiate the classifier 
xgb_clf = XGBClassifier(**params)



# fit the classifier to the training data
xgb_clf.fit(X_train, y_train)

#output:

XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,
       max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)

# alternatively view the parameters of the xgb trained model
print(xgb_clf)
XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,
       max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)
# make predictions on test data
y_pred = xgb_clf.predict(X_test)

# check accuracy score
from sklearn.metrics import accuracy_score

print('XGBoost model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

k-fold Cross Validation using XGBoost
To build more robust models with XGBoost, we must do k-fold cross validation. In this way, we ensure that the original training dataset is used for both training and validation. Also, each entry is used for validation just once. XGBoost supports k-fold cross validation using the cv() method. In this method, we will specify several parameters which are as follows:-
nfolds - This parameter specifies the number of cross-validation sets we want to build.
num_boost_round - It denotes the number of trees we build.
metrics - It is the performance evaluation metrics to be considered during CV.
as_pandas - It is used to return the results in a pandas DataFrame.
early_stopping_rounds - This parameter stops training of the model early if the hold-out metric does not improve for a given number of rounds.
seed - This parameter is used for reproducibility of results.
We can use these parameters to build a k-fold cross-validation model by calling XGBoost's CV() method.
#K-fold Cross Val
from xgboost import cv

params = {"objective":"binary:logistic",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

xgb_cv = cv(dtrain=data_dmatrix, params=params, nfold=3,
                    num_boost_round=50, early_stopping_rounds=10, metrics="auc", as_pandas=True, seed=123) 
xgb_cv.head()
#xgb_cv contains train and test auc metrics for each boosting round. Let's preview xgb_cv.

14. Feature importance with XGBoost

XGBoost provides a way to examine the importance of each feature in the original dataset within the model. It involves counting the number of times each feature is split on across all boosting trees in the model. Then we visualize the result as a bar graph, with the features ordered according to how many times they appear.

XGBoost has a plot_importance() function that helps us to achieve this task. Then we can visualize the features that has been given the highest important score among all the features. Thus XGBoost provides us a way to do feature selection.

I will proceed as follows:-

xgb.plot_importance(xgb_clf)
plt.rcParams['figure.figsize'] = [6, 4]
plt.show()

Sidenote:
#Remember if categorical variable is there, input for xgb needs to be in dmatrix format
(ValueError: DataFrame.dtypes for data must be int, float, bool or categorical. When categorical type is supplied, DMatrix parameter enable_categorical must be set to True.Var1, Var2, Var3, Var4
Link: https://stackoverflow.com/questions/67080149/xgboost-error-when-categorical-type-is-supplied-dmatrix-parameter-enable-cat)
Code block For OHE:
import pandas as pd from xgboost

import XGBRegressor from sklearn.compose

import ColumnTransformer from sklearn.preprocessing

import OneHotEncoder

# define the input data

df = pd.DataFrame([ {'Var1': 'JP', 'Var2': 2009, 'Var3': 6581, 'Var4': 'OME', 'Var5': 325.787218, 'Ind_Var': 8.013558}, {'Var1': 'FR', 'Var2': 2018, 'Var3': 5783, 'Var4': 'I_S', 'Var5': 11.956326, 'Ind_Var': 4.105559}, {'Var1': 'BE', 'Var2': 2015, 'Var3': 6719, 'Var4': 'OME', 'Var5': 42.888565, 'Ind_Var': 7.830077}, {'Var1': 'DK', 'Var2': 2011, 'Var3': 3506, 'Var4': 'RPP', 'Var5': 70.094146, 'Ind_Var': 83.000000}, {'Var1': 'AT', 'Var2': 2019, 'Var3': 5474, 'Var4': 'NMM', 'Var5': 270.082738, 'Ind_Var': 51.710526} ])

# extract the features and target

X_train, y_train = df.iloc[:3, :-1], df.iloc[:3, -1]
X_test, y_test = df.iloc[3:, :-1], df.iloc[3:, -1]

# one-hot encode the categorical features cat_attribs = ['Var1', 'Var2', 'Var3', 'Var4']

full_pipeline = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_attribs)], remainder='passthrough')

encoder = full_pipeline.fit(X_train) X_train = encoder.transform(X_train) X_test = encoder.transform(X_test)

# train the model model = XGBRegressor(n_estimators=10, max_depth=20, verbosity=2) model.fit(X_train, y_train)

# extract the training set predictions model.predict(X_train)

# array([7.0887003, 3.7923286, 7.0887003], dtype=float32)

# extract the test set predictions model.predict(X_test)

# array([7.0887003, 7.0887003], dtype=float32)

#Null imputation strategy for different variables

#Strategy for handling categorical variables:
Strategy 1:
cat_attribs = ['var1','var2','var3','var4'] X_train[cat_attribs] = X_train[cat_attribs].astype('category') X_test[cat_attribs] = X_test[cat_attribs].astype('category')

model = XGBRegressor(n_estimators=10, max_depth=20, enable_categorical=True, verbosity=2) model.fit(X_train, y_train) y_pred = model.predict(X_test)

Strategy 2:
# Create a mapping of labels to encoded values from X_train training_encoded_mapping = X_train['var1'].astype('category').cat.codes training_encoded_mapping = dict(zip(X_train['var1'].cat.categories, training_encoded_mapping)) X_train['var1'] = X_train['var1'].astype('category').cat.codes # Apply the mapping to X_test X_test['var1'] = X_test['var1'].map(training_encoded_mapping) # Do the same for other vars as well
And now don't pass enable_categorical=True in model initialization

#Models to be used:
Naive Bayes, DTs, Logistic Regression (it is a classification technique and not regression), NN, SVM

Unsupervised elarning (clustering algorithms): K-means clustering, Mean-shift, DBSCan - but remember that we shall have to specify the number of clusters - to only allow for default and non-default behaviour - does not have direct application here but need to learn more

I guess I will get complacent and stagnate

Sunday, April 20, 2025

XGBoost Resources

k-fold Cross Validation using XGBoost

14. Feature importance with XGBoost

No comments:

Post a Comment

Interesting Reads

Report Abuse