Hello,
This page is designed to collate popular resources on xgb for my personal reference. None of it is my work.
Survival Modelling code snippet:#pip install lifelines
!pip install lifelines
#conda install -c conda-forge lifelinesimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
df2 = pd.read_csv(r'FILELINK.csv')
T2 = df2['time']
S2 = df2['status']
print(T2)
kmf2 = KaplanMeierFitter()
kmf2.fit(T2, S2)
print("Survival function:")
print(kmf2.survival_function_)
print("Survival function plot:")
kmf2.plot()
plt.title("Survival Curve: 6-MP")
plt.xlabel("Time")
plt.ylabel("Survival Probability")
plt.savefig("6-MP.pdf")
plt.show()
https://gist.github.com/pb111/cc341409081dffa5e9eaf60d79562a03
I have used the Wholesale customers data set for this project, downloaded from the UCI Machine learning repository. This dataset can be found at the following url:
https://archive.ics.uci.edu/ml/datasets/Wholesale+customers
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
import warnings warnings.filterwarnings('ignore')
#Import dataset data = 'C:/datasets/Wholesale customers data.csv' df = pd.read_csv(data)
#Exploring the dataset
df.shape
df.head()
df.info()
df.describe()#missing value check
df.isnull().sum()
#declaring dependent and independent variables
X = df.drop('Channel', axis=1) y = df['Channel']
#var checks
X.head()
y.head()
#Label encoding
# import XGBoost
import xgboost as xgb # define data_dmatrix
data_dmatrix = xgb.DMatrix(data=X,label=y)
# split X and y into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
General parameters
These parameters relate to which booster we are doing boosting. The common ones are tree or linear model.
Booster parameters
It depends on which booster we have chosen for boosting.
Learning task parameters
These parameters decide on the learning scenario. For example, regression tasks may use different parameters than ranking tasks.
Command line parameters
In addition there are command line parameters which relate to behaviour of CLI version of XGBoost.
The most important parameters that we should know about are as follows:-
learning_rate - It gives us the step size shrinkage which is used to prevent overfitting. Its range is [0,1].
max_depth - It determines how deeply each tree is allowed to grow during any boosting round.
subsample - It determines the percentage of samples used per tree. Low value of subsample can lead to underfitting.
colsample_bytree - It determines the percentage of features used per tree. High value of it can lead to overfitting.
n_estimators - It is the number of trees we want to build.
objective - It determines the loss function to be used in the process. For example, reg:linear for regression problems, reg:logistic for classification problems with only decision, binary:logistic for classification problems with probability.
XGBoost also supports regularization parameters to penalize models as they become more complex and reduce them to simple models. These regularization parameters are as follows:-
gamma - It controls whether a given node will split based on the expected reduction in loss after the split. A higher value leads to fewer splits. It is supported only for tree-based learners.
alpha - It gives us the L1 regularization on leaf weights. A large value of it leads to more regularization.
lambda - It gives us the L2 regularization on leaf weights and is smoother than L1 regularization.
Though we are using trees as our base learners, we can also use XGBoost’s relatively less popular linear base learners and one other tree learner known as dart. We have to set the booster parameter to either gbtree (default), gblinear or dart.
# import XGBClassifier from xgboost import XGBClassifier # declare parameters params = { 'objective':'binary:logistic', 'max_depth': 4, 'alpha': 10, 'learning_rate': 1.0, 'n_estimators':100 } # instantiate the classifier xgb_clf = XGBClassifier(**params) # fit the classifier to the training data xgb_clf.fit(X_train, y_train)
#output:
XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0, max_delta_step=0, max_depth=4, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='binary:logistic', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)
# alternatively view the parameters of the xgb trained model print(xgb_clf)
XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0, max_delta_step=0, max_depth=4, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='binary:logistic', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)# make predictions on test data y_pred = xgb_clf.predict(X_test)
# check accuracy score from sklearn.metrics import accuracy_scoreprint('XGBoost model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))
#K-fold Cross Valfrom xgboost import cv params = {"objective":"binary:logistic",'colsample_bytree': 0.3,'learning_rate': 0.1, 'max_depth': 5, 'alpha': 10} xgb_cv = cv(dtrain=data_dmatrix, params=params, nfold=3,num_boost_round=50, early_stopping_rounds=10, metrics="auc", as_pandas=True, seed=123)
xgb_cv.head()
#xgb_cv
contains train and testauc
metrics for each boosting round. Let's previewxgb_cv
.
14. Feature importance with XGBoost
XGBoost provides a way to examine the importance of each feature in the original dataset within the model. It involves counting the number of times each feature is split on across all boosting trees in the model. Then we visualize the result as a bar graph, with the features ordered according to how many times they appear.
XGBoost has a plot_importance() function that helps us to achieve this task. Then we can visualize the features that has been given the highest important score among all the features. Thus XGBoost provides us a way to do feature selection.
I will proceed as follows:-
xgb.plot_importance(xgb_clf) plt.rcParams['figure.figsize'] = [6, 4] plt.show()
Sidenote:
#Remember if categorical variable is there, input for xgb needs to be in dmatrix format
(ValueError: DataFrame.dtypes for data must be int, float, bool or categorical. When categorical type is supplied, DMatrix parameter enable_categorical must be set to True.Var1, Var2, Var3, Var4
Link: https://stackoverflow.com/questions/67080149/xgboost-error-when-categorical-type-is-supplied-dmatrix-parameter-enable-cat)
Code block For OHE:
import pandas as pd from xgboost
import XGBRegressor
from sklearn.compose
import ColumnTransformer
from sklearn.preprocessing
import OneHotEncoder
# define the input data
df = pd.DataFrame([
{'Var1': 'JP', 'Var2': 2009, 'Var3': 6581, 'Var4': 'OME', 'Var5': 325.787218, 'Ind_Var': 8.013558},
{'Var1': 'FR', 'Var2': 2018, 'Var3': 5783, 'Var4': 'I_S', 'Var5': 11.956326, 'Ind_Var': 4.105559},
{'Var1': 'BE', 'Var2': 2015, 'Var3': 6719, 'Var4': 'OME', 'Var5': 42.888565, 'Ind_Var': 7.830077},
{'Var1': 'DK', 'Var2': 2011, 'Var3': 3506, 'Var4': 'RPP', 'Var5': 70.094146, 'Ind_Var': 83.000000},
{'Var1': 'AT', 'Var2': 2019, 'Var3': 5474, 'Var4': 'NMM', 'Var5': 270.082738, 'Ind_Var': 51.710526}
])
# extract the features and target
X_train, y_train = df.iloc[:3, :-1], df.iloc[:3, -1]
X_test, y_test = df.iloc[3:, :-1], df.iloc[3:, -1]
X_test, y_test = df.iloc[3:, :-1], df.iloc[3:, -1]
# one-hot encode the categorical features
cat_attribs = ['Var1', 'Var2', 'Var3', 'Var4']
full_pipeline = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_attribs)], remainder='passthrough')
encoder = full_pipeline.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)
# train the model
model = XGBRegressor(n_estimators=10, max_depth=20, verbosity=2)
model.fit(X_train, y_train)
# extract the training set predictions
model.predict(X_train)
# array([7.0887003, 3.7923286, 7.0887003], dtype=float32)
# extract the test set predictions
model.predict(X_test)
# array([7.0887003, 7.0887003], dtype=float32)
#Null imputation strategy for different variables
#Strategy for handling categorical variables:
Strategy 1:
cat_attribs = ['var1','var2','var3','var4'] X_train[cat_attribs] = X_train[cat_attribs].astype('category') X_test[cat_attribs] = X_test[cat_attribs].astype('category')
model = XGBRegressor(n_estimators=10, max_depth=20, enable_categorical=True, verbosity=2) model.fit(X_train, y_train) y_pred = model.predict(X_test)
Strategy 2:
# Create a mapping of labels to encoded values from X_train training_encoded_mapping = X_train['var1'].astype('category').cat.codes training_encoded_mapping = dict(zip(X_train['var1'].cat.categories, training_encoded_mapping)) X_train['var1'] = X_train['var1'].astype('category').cat.codes # Apply the mapping to X_test X_test['var1'] = X_test['var1'].map(training_encoded_mapping) # Do the same for other vars as well
And now don't pass enable_categorical=True in model initialization
#Null imputation strategy for different variables
#Strategy for handling categorical variables:
Strategy 1:
cat_attribs = ['var1','var2','var3','var4'] X_train[cat_attribs] = X_train[cat_attribs].astype('category') X_test[cat_attribs] = X_test[cat_attribs].astype('category')
model = XGBRegressor(n_estimators=10, max_depth=20, enable_categorical=True, verbosity=2) model.fit(X_train, y_train) y_pred = model.predict(X_test)
Strategy 2:
# Create a mapping of labels to encoded values from X_train training_encoded_mapping = X_train['var1'].astype('category').cat.codes training_encoded_mapping = dict(zip(X_train['var1'].cat.categories, training_encoded_mapping)) X_train['var1'] = X_train['var1'].astype('category').cat.codes # Apply the mapping to X_test X_test['var1'] = X_test['var1'].map(training_encoded_mapping) # Do the same for other vars as well
And now don't pass enable_categorical=True in model initialization
#Models to be used:
Naive Bayes, DTs, Logistic Regression (it is a classification technique and not regression), NN, SVM
Naive Bayes, DTs, Logistic Regression (it is a classification technique and not regression), NN, SVM
Unsupervised elarning (clustering algorithms): K-means clustering, Mean-shift, DBSCan - but remember that we shall have to specify the number of clusters - to only allow for default and non-default behaviour - does not have direct application here but need to learn more
`
No comments:
Post a Comment