Blog Post

Classification Analysis based on User Characteristics

11/26/2023 Portfolio by Young Yoon

Source Data: http://youngyoon.me/archives/30

Objectives

Classification analysis based on the following data for (N+1) Months’ Combat Power Categorization:
========================================
power_difference: Monthly Change in Combat Power by Character
low_dungeon: Monthly Plays in Low-level Dungeons
high_dungeon: Monthly Plays in High-level Dungeons
quest_dungeon: Monthly Plays in Quest Dungeons
current_power_level: (N) Months’ Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)

# Import libraries
import os
import itertools
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from time import time
from pycaret.classification import *
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

# setting font
plt.rc('font', family='AppleGothic') # For Windows
print(plt.rcParams['font.family'])

# Function for confusion matrix visualization
def plot_confusion_matrix(cm, model=None, target_names=None, cmap=None, normalize=True, labels=True, title='Confusion matrix'):
    accuracy = np.trace(cm) / float(np.sum(cm))
    misclass = 1 - accuracy

    if cmap is None:
        cmap = plt.get_cmap('Blues')

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        
    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
    
    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names)
        plt.yticks(tick_marks, target_names)
    
    if labels:
        for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
            if normalize:
                plt.text(j, i, "{:0.4f}".format(cm[i, j]),
                         horizontalalignment="center",
                         color="white" if cm[i, j] > thresh else "black")
            else:
                plt.text(j, i, "{:,}".format(cm[i, j]),
                         horizontalalignment="center",
                         color="white" if cm[i, j] > thresh else "black")

    NAME_FIG = "./" + model + "_CM.png"
    
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))
    plt.grid(False)
    plt.savefig(NAME_FIG, dpi = 300, bbox_inches = 'tight')
    plt.show()

# Data Load
df2021_character = pd.read_csv("./preprocessed_data/df2021_character.csv")

# The distribution of target column data
df2021_character['next_power_level'].value_counts().sort_index()

1    1799
2    3693
3    3267
4    3528
5    2449
6     531
7     105
Name: next_power_level, dtype: int64


features = df2021_character.drop(["next_power_level"], axis = 1)
label = df2021_character["next_power_level"]

features_train, features_test, label_train, label_test = train_test_split(features, label, test_size = 0.2, random_state = 2023, shuffle = True, stratify = label) # Data Split

features_train.head()

	power_difference	low_dungeon	high_dungeon	current_power_level
5674	-60674.4	0.0	0.0	3
5717	0.0	0.0	0.0	2
13611	15082.8	4.0	3.0	5
1058	11061.6	0.0	0.0	3
6433	1170249.6	40.0	0.0	7

# Verify dimensions of split data
print("Training Features Dimension:", features_train.shape)
print("Training Labels Dimension:", label_train.shape)
print("Test Features Dimension:", features_test.shape)
print("Test Labels Dimension:", label_test.shape)

Dimension of training features: (12297, 5)
Dimension of training labels: (12297,)
Dimension of test features: (3075, 5)
Dimension of test labels: (3075,)

# Pycaret Settings
df_train = pd.concat([features_train, label_train], axis = 1)

cell_start_time = time()
clf = setup(data = df_train, target = "next_power_level", train_size = 0.8, fix_imbalance = True)

cell_end_time = time()

print("cell execution time:", cell_end_time - cell_start_time)

cell execution time: 0.8062605857849121

	Description	Value
0	Session id	6097
1	Target	next_power_level
2	Target type	Multiclass
3	Target mapping	1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6
4	Original data shape	(12297, 6)
5	Transformed data shape	(19001, 6)
6	Transformed train set shape	(16541, 6)
7	Transformed test set shape	(2460, 6)
8	Numeric features	5
9	Preprocess	True
10	Imputation type	simple
11	Numeric imputation	mean
12	Categorical imputation	mode
13	Fix imbalance	True
14	Fix imbalance method	SMOTE
15	Fold Generator	StratifiedKFold
16	Fold Number	10
17	CPU Jobs	-1
18	Use GPU	False
19	Log Experiment	False
20	Experiment Name	clf-default-name
21	USI	234f

models()

	Name	Reference	Turbo
ID
lr	Logistic Regression	sklearn.linear_model._logistic.LogisticRegression	True
knn	K Neighbors Classifier	sklearn.neighbors._classification.KNeighborsCl…	True
nb	Naive Bayes	sklearn.naive_bayes.GaussianNB	True
dt	Decision Tree Classifier	sklearn.tree._classes.DecisionTreeClassifier	True
svm	SVM – Linear Kernel	sklearn.linear_model._stochastic_gradient.SGDC…	True
rbfsvm	SVM – Radial Kernel	sklearn.svm._classes.SVC	False
gpc	Gaussian Process Classifier	sklearn.gaussian_process._gpc.GaussianProcessC…	False
mlp	MLP Classifier	sklearn.neural_network._multilayer_perceptron….	False
ridge	Ridge Classifier	sklearn.linear_model._ridge.RidgeClassifier	True
rf	Random Forest Classifier	sklearn.ensemble._forest.RandomForestClassifier	True
qda	Quadratic Discriminant Analysis	sklearn.discriminant_analysis.QuadraticDiscrim…	True
ada	Ada Boost Classifier	sklearn.ensemble._weight_boosting.AdaBoostClas…	True
gbc	Gradient Boosting Classifier	sklearn.ensemble._gb.GradientBoostingClassifier	True
lda	Linear Discriminant Analysis	sklearn.discriminant_analysis.LinearDiscrimina…	True
et	Extra Trees Classifier	sklearn.ensemble._forest.ExtraTreesClassifier	True
lightgbm	Light Gradient Boosting Machine	lightgbm.sklearn.LGBMClassifier	True
dummy	Dummy Classifier	sklearn.dummy.DummyClassifier	True

# Top 3 models
# Classification Metrics (Accuracy, Precision, Recall, F1-score)
top3_models = compare_models(fold = 10, round = 3, sort = 'F1', n_select = 3)


Initiated	. . . . . . . . . . . . . . . . . .	21:46:50
Status	. . . . . . . . . . . . . . . . . .	Fitting 10 Folds
Estimator	. . . . . . . . . . . . . . . . . .	Logistic Regression

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
gbc	Gradient Boosting Classifier	0.914	0.992	0.914	0.915	0.914	0.893	0.893	1.687
lightgbm	Light Gradient Boosting Machine	0.909	0.992	0.909	0.909	0.909	0.887	0.887	0.408
rf	Random Forest Classifier	0.900	0.988	0.900	0.901	0.900	0.876	0.876	0.203
et	Extra Trees Classifier	0.900	0.985	0.900	0.901	0.900	0.876	0.876	0.139
dt	Decision Tree Classifier	0.885	0.929	0.885	0.886	0.885	0.857	0.857	0.021
lda	Linear Discriminant Analysis	0.872	0.986	0.872	0.874	0.872	0.841	0.841	0.019
knn	K Neighbors Classifier	0.496	0.784	0.496	0.526	0.503	0.388	0.390	0.371
ada	Ada Boost Classifier	0.450	0.656	0.450	0.284	0.332	0.344	0.388	0.125
ridge	Ridge Classifier	0.310	0.000	0.310	0.314	0.263	0.216	0.274	0.016
nb	Naive Bayes	0.265	0.703	0.265	0.348	0.243	0.132	0.156	0.016
svm	SVM – Linear Kernel	0.092	0.000	0.092	0.074	0.063	-0.026	-0.033	0.083
qda	Quadratic Discriminant Analysis	0.129	0.000	0.129	0.018	0.031	0.000	0.000	0.017
dummy	Dummy Classifier	0.117	0.500	0.117	0.014	0.025	0.000	0.000	0.015
lr	Logistic Regression	0.025	0.571	0.025	0.130	0.019	-0.033	-0.045	0.607

Chose the Gradient Boosting Classifier model and Random Forest.

– In both the GBC model and Random Forest, Accuracy, Precision, Recall, F1-score are comparably in the top tier.

Gradient Boosting Classifier Model

# Model Definition
gbc_model = create_model('gbc', fold = 10)
gbc_model

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
Fold
0	0.9146	0.9909	0.9146	0.9157	0.9145	0.8938	0.8940
1	0.9207	0.9933	0.9207	0.9210	0.9207	0.9015	0.9016
2	0.8974	0.9924	0.8974	0.8991	0.8980	0.8724	0.8724
3	0.9217	0.9894	0.9217	0.9217	0.9216	0.9027	0.9027
4	0.9024	0.9909	0.9024	0.9040	0.9023	0.8786	0.8791
5	0.9146	0.9915	0.9146	0.9148	0.9145	0.8939	0.8939
6	0.9126	0.9923	0.9126	0.9137	0.9129	0.8914	0.8915
7	0.9135	0.9922	0.9135	0.9164	0.9140	0.8924	0.8928
8	0.9054	0.9898	0.9054	0.9060	0.9054	0.8824	0.8825
9	0.9237	0.9946	0.9237	0.9241	0.9236	0.9050	0.9051
Mean	0.9127	0.9917	0.9127	0.9137	0.9128	0.8914	0.8915
Std	0.0082	0.0015	0.0082	0.0078	0.0081	0.0102	0.0101

GradientBoostingClassifier(ccp_alpha=0.0, criterion=’friedman_mse’, init=None, learning_rate=0.1, loss=’log_loss’, max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None, random_state=6097, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False)

# Model Fine Tuning
gbc_model = tune_model(gbc_model, fold = 3, optimize = 'F1', choose_better = True)


Initiated	. . . . . . . . . . . . . . . . . .	21:48:15
Status	. . . . . . . . . . . . . . . . . .	Searching Hyperparameters
Estimator	. . . . . . . . . . . . . . . . . .	Gradient Boosting Classifier

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
Fold
0	0.9131	0.9911	0.9131	0.9136	0.9131	0.8919	0.8919
1	0.9128	0.9909	0.9128	0.9134	0.9129	0.8915	0.8916
2	0.9116	0.9900	0.9116	0.9116	0.9115	0.8900	0.8901
Mean	0.9125	0.9907	0.9125	0.9128	0.9125	0.8911	0.8912
Std	0.0007	0.0005	0.0007	0.0009	0.0007	0.0008	0.0008

Fitting 3 folds for each of 10 candidates, totalling 30 fits

RF_MODEL

# Model Definition
rf_model = create_model('rf', fold = 10)
rf_model

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
Fold
0	0.9035	0.9848	0.9035	0.9037	0.9033	0.8798	0.8799
1	0.9055	0.9906	0.9055	0.9055	0.9054	0.8825	0.8826
2	0.8780	0.9865	0.8780	0.8789	0.8782	0.8484	0.8485
3	0.8913	0.9863	0.8913	0.8908	0.8908	0.8646	0.8647
4	0.8963	0.9847	0.8963	0.8968	0.8961	0.8709	0.8712
5	0.9146	0.9915	0.9146	0.9148	0.9147	0.8939	0.8939
6	0.9096	0.9874	0.9096	0.9105	0.9097	0.8877	0.8877
7	0.9166	0.9884	0.9166	0.9170	0.9167	0.8962	0.8963
8	0.8911	0.9897	0.8911	0.8915	0.8912	0.8648	0.8648
9	0.9084	0.9901	0.9084	0.9084	0.9080	0.8859	0.8860
Mean	0.9015	0.9880	0.9015	0.9018	0.9014	0.8775	0.8775
Std	0.0115	0.0023	0.0115	0.0115	0.0115	0.0143	0.0143

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion=’gini’, max_depth=None, max_features=’sqrt’, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=False, random_state=6097, verbose=0, warm_start=False)

# Model Fine Tuning
rf_model = tune_model(rf_model, fold = 3, optimize = 'F1', choose_better = True)


Initiated	. . . . . . . . . . . . . . . . . .	21:49:43
Status	. . . . . . . . . . . . . . . . . .	Searching Hyperparameters
Estimator	. . . . . . . . . . . . . . . . . .	Random Forest Classifier

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
Fold
0	0.9109	0.9909	0.9109	0.9111	0.9109	0.8892	0.8892
1	0.9164	0.9923	0.9164	0.9165	0.9163	0.8960	0.8961
2	0.9143	0.9899	0.9143	0.9143	0.9143	0.8934	0.8935
Mean	0.9139	0.9910	0.9139	0.9140	0.9138	0.8929	0.8929
Std	0.0023	0.0010	0.0023	0.0022	0.0022	0.0028	0.0028

Fitting 3 folds for each of 10 candidates, totalling 30 fits

Ensemble

# Ensemble
tuned_models = [gbc_model, rf_model]

# Bagging, Boosting, Voting
blend_model = blend_models(estimator_list = tuned_models)


Initiated	. . . . . . . . . . . . . . . . . .	21:49:50
Status	. . . . . . . . . . . . . . . . . .	Fitting 10 Folds
Estimator	. . . . . . . . . . . . . . . . . .	Voting Classifier

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
Fold
0	0.9187	0.9909	0.9187	0.9197	0.9186	0.8988	0.8990
1	0.9207	0.9942	0.9207	0.9209	0.9207	0.9015	0.9015
2	0.8994	0.9923	0.8994	0.9008	0.8997	0.8748	0.8749
3	0.9228	0.9926	0.9228	0.9225	0.9226	0.9039	0.9039
4	0.9116	0.9927	0.9116	0.9122	0.9114	0.8900	0.8902
5	0.9228	0.9934	0.9228	0.9230	0.9227	0.9039	0.9040
6	0.9157	0.9922	0.9157	0.9168	0.9158	0.8951	0.8952
7	0.9196	0.9931	0.9196	0.9204	0.9197	0.9000	0.9001
8	0.9064	0.9915	0.9064	0.9068	0.9064	0.8837	0.8837
9	0.9247	0.9949	0.9247	0.9249	0.9246	0.9063	0.9063
Mean	0.9162	0.9928	0.9162	0.9168	0.9162	0.8958	0.8959
Std	0.0077	0.0011	0.0077	0.0074	0.0076	0.0096	0.0096

# Finalize Model Selection
final_gbc_model = finalize_model(gbc_model)

# Input test data into the final model to verify model predictions
prediction = predict_model(final_gbc_model, data = features_test)

cf = confusion_matrix(label_test, prediction["prediction_label"])

plot_confusion_matrix(cf, model = "GBC", target_names = ["1", "2", "3", "4", "5", "6", "7"])

Confusion Matrix

The Predicted Label predominantly predicts the True Label, especially with lower combat power levels, leading to higher accuracy.

plot_model(estimator = final_gbc_model, plot = 'feature')

Feature Importance Plot

– The most influential variable in predicting next month’s Power Level is the current month’s Power Level. This suggests that users with low combat power levels tend to drop out quickly, while users with high combat power levels find it challenging to increase their combat power. In this current situation, it appears that Power Level changes are minimal, contributing to a ‘good’ prediction.
– However, it may be challenging to consider this analysis within its limitations as a ‘good’ prediction. As a result, additional research will be conducted by segmenting combat power levels and further refining the analysis by breaking down combat power levels into finer categories.

Write a comment