Loading
Keunyoung Yoon

Data Engineer

Data Scientist

Data Analyst

Keunyoung Yoon

Data Engineer

Data Scientist

Data Analyst

Blog Post

Classification Analysis based on User Characteristics

11/26/2023 Portfolio

Source Data: http://youngyoon.me/archives/30

Objectives

Classification analysis based on the following data for (N+1) Months’ Combat Power Categorization:
========================================
power_difference: Monthly Change in Combat Power by Character
low_dungeon: Monthly Plays in Low-level Dungeons
high_dungeon: Monthly Plays in High-level Dungeons
quest_dungeon: Monthly Plays in Quest Dungeons
current_power_level: (N) Months’ Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)

# Import libraries
import os
import itertools
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from time import time
from pycaret.classification import *
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
# setting font
plt.rc('font', family='AppleGothic') # For Windows
print(plt.rcParams['font.family'])
# Function for confusion matrix visualization
def plot_confusion_matrix(cm, model=None, target_names=None, cmap=None, normalize=True, labels=True, title='Confusion matrix'):
    accuracy = np.trace(cm) / float(np.sum(cm))
    misclass = 1 - accuracy

    if cmap is None:
        cmap = plt.get_cmap('Blues')

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        
    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
    
    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names)
        plt.yticks(tick_marks, target_names)
    
    if labels:
        for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
            if normalize:
                plt.text(j, i, "{:0.4f}".format(cm[i, j]),
                         horizontalalignment="center",
                         color="white" if cm[i, j] > thresh else "black")
            else:
                plt.text(j, i, "{:,}".format(cm[i, j]),
                         horizontalalignment="center",
                         color="white" if cm[i, j] > thresh else "black")

    NAME_FIG = "./" + model + "_CM.png"
    
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))
    plt.grid(False)
    plt.savefig(NAME_FIG, dpi = 300, bbox_inches = 'tight')
    plt.show()
# Data Load
df2021_character = pd.read_csv("./preprocessed_data/df2021_character.csv")
# The distribution of target column data
df2021_character['next_power_level'].value_counts().sort_index()

1    1799
2    3693
3    3267
4    3528
5    2449
6     531
7     105
Name: next_power_level, dtype: int64

features = df2021_character.drop(["next_power_level"], axis = 1)
label = df2021_character["next_power_level"]

features_train, features_test, label_train, label_test = train_test_split(features, label, test_size = 0.2, random_state = 2023, shuffle = True, stratify = label) # Data Split
features_train.head()
power_differencelow_dungeonhigh_dungeonquest_dungeoncurrent_power_level
5674-60674.40.00.00.03
57170.00.00.00.02
1361115082.84.03.00.05
105811061.60.00.00.03
64331170249.640.00.00.07
# Verify dimensions of split data
print("Training Features Dimension:", features_train.shape)
print("Training Labels Dimension:", label_train.shape)
print("Test Features Dimension:", features_test.shape)
print("Test Labels Dimension:", label_test.shape)

Dimension of training features: (12297, 5)
Dimension of training labels: (12297,)
Dimension of test features: (3075, 5)
Dimension of test labels: (3075,)
# Pycaret Settings
df_train = pd.concat([features_train, label_train], axis = 1)

cell_start_time = time()
clf = setup(data = df_train, target = "next_power_level", train_size = 0.8, fix_imbalance = True)

cell_end_time = time()

print("cell execution time:", cell_end_time - cell_start_time)

cell execution time: 0.8062605857849121
 DescriptionValue
0Session id6097
1Targetnext_power_level
2Target typeMulticlass
3Target mapping1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6
4Original data shape(12297, 6)
5Transformed data shape(19001, 6)
6Transformed train set shape(16541, 6)
7Transformed test set shape(2460, 6)
8Numeric features5
9PreprocessTrue
10Imputation typesimple
11Numeric imputationmean
12Categorical imputationmode
13Fix imbalanceTrue
14Fix imbalance methodSMOTE
15Fold GeneratorStratifiedKFold
16Fold Number10
17CPU Jobs-1
18Use GPUFalse
19Log ExperimentFalse
20Experiment Nameclf-default-name
21USI234f
models()
NameReferenceTurbo
ID
lrLogistic Regressionsklearn.linear_model._logistic.LogisticRegressionTrue
knnK Neighbors Classifiersklearn.neighbors._classification.KNeighborsCl…True
nbNaive Bayessklearn.naive_bayes.GaussianNBTrue
dtDecision Tree Classifiersklearn.tree._classes.DecisionTreeClassifierTrue
svmSVM – Linear Kernelsklearn.linear_model._stochastic_gradient.SGDC…True
rbfsvmSVM – Radial Kernelsklearn.svm._classes.SVCFalse
gpcGaussian Process Classifiersklearn.gaussian_process._gpc.GaussianProcessC…False
mlpMLP Classifiersklearn.neural_network._multilayer_perceptron….False
ridgeRidge Classifiersklearn.linear_model._ridge.RidgeClassifierTrue
rfRandom Forest Classifiersklearn.ensemble._forest.RandomForestClassifierTrue
qdaQuadratic Discriminant Analysissklearn.discriminant_analysis.QuadraticDiscrim…True
adaAda Boost Classifiersklearn.ensemble._weight_boosting.AdaBoostClas…True
gbcGradient Boosting Classifiersklearn.ensemble._gb.GradientBoostingClassifierTrue
ldaLinear Discriminant Analysissklearn.discriminant_analysis.LinearDiscrimina…True
etExtra Trees Classifiersklearn.ensemble._forest.ExtraTreesClassifierTrue
lightgbmLight Gradient Boosting Machinelightgbm.sklearn.LGBMClassifierTrue
dummyDummy Classifiersklearn.dummy.DummyClassifierTrue
# Top 3 models
# Classification Metrics (Accuracy, Precision, Recall, F1-score)
top3_models = compare_models(fold = 10, round = 3, sort = 'F1', n_select = 3)
Initiated. . . . . . . . . . . . . . . . . .21:46:50
Status. . . . . . . . . . . . . . . . . .Fitting 10 Folds
Estimator. . . . . . . . . . . . . . . . . .Logistic Regression
 ModelAccuracyAUCRecallPrec.F1KappaMCCTT (Sec)
gbcGradient Boosting Classifier0.9140.9920.9140.9150.9140.8930.8931.687
lightgbmLight Gradient Boosting Machine0.9090.9920.9090.9090.9090.8870.8870.408
rfRandom Forest Classifier0.9000.9880.9000.9010.9000.8760.8760.203
etExtra Trees Classifier0.9000.9850.9000.9010.9000.8760.8760.139
dtDecision Tree Classifier0.8850.9290.8850.8860.8850.8570.8570.021
ldaLinear Discriminant Analysis0.8720.9860.8720.8740.8720.8410.8410.019
knnK Neighbors Classifier0.4960.7840.4960.5260.5030.3880.3900.371
adaAda Boost Classifier0.4500.6560.4500.2840.3320.3440.3880.125
ridgeRidge Classifier0.3100.0000.3100.3140.2630.2160.2740.016
nbNaive Bayes0.2650.7030.2650.3480.2430.1320.1560.016
svmSVM – Linear Kernel0.0920.0000.0920.0740.063-0.026-0.0330.083
qdaQuadratic Discriminant Analysis0.1290.0000.1290.0180.0310.0000.0000.017
dummyDummy Classifier0.1170.5000.1170.0140.0250.0000.0000.015
lrLogistic Regression0.0250.5710.0250.1300.019-0.033-0.0450.607

Chose the Gradient Boosting Classifier model and Random Forest.

– In both the GBC model and Random Forest, Accuracy, Precision, Recall, F1-score are comparably in the top tier.

Gradient Boosting Classifier Model

# Model Definition
gbc_model = create_model('gbc', fold = 10)
gbc_model
 AccuracyAUCRecallPrec.F1KappaMCC
Fold       
00.91460.99090.91460.91570.91450.89380.8940
10.92070.99330.92070.92100.92070.90150.9016
20.89740.99240.89740.89910.89800.87240.8724
30.92170.98940.92170.92170.92160.90270.9027
40.90240.99090.90240.90400.90230.87860.8791
50.91460.99150.91460.91480.91450.89390.8939
60.91260.99230.91260.91370.91290.89140.8915
70.91350.99220.91350.91640.91400.89240.8928
80.90540.98980.90540.90600.90540.88240.8825
90.92370.99460.92370.92410.92360.90500.9051
Mean0.91270.99170.91270.91370.91280.89140.8915
Std0.00820.00150.00820.00780.00810.01020.0101
GradientBoostingClassifier(ccp_alpha=0.0, criterion=’friedman_mse’, init=None, learning_rate=0.1, loss=’log_loss’, max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None, random_state=6097, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False)
# Model Fine Tuning
gbc_model = tune_model(gbc_model, fold = 3, optimize = 'F1', choose_better = True)
Initiated. . . . . . . . . . . . . . . . . .21:48:15
Status. . . . . . . . . . . . . . . . . .Searching Hyperparameters
Estimator. . . . . . . . . . . . . . . . . .Gradient Boosting Classifier
 AccuracyAUCRecallPrec.F1KappaMCC
Fold       
00.91310.99110.91310.91360.91310.89190.8919
10.91280.99090.91280.91340.91290.89150.8916
20.91160.99000.91160.91160.91150.89000.8901
Mean0.91250.99070.91250.91280.91250.89110.8912
Std0.00070.00050.00070.00090.00070.00080.0008
Fitting 3 folds for each of 10 candidates, totalling 30 fits

RF_MODEL

# Model Definition
rf_model = create_model('rf', fold = 10)
rf_model
 AccuracyAUCRecallPrec.F1KappaMCC
Fold       
00.90350.98480.90350.90370.90330.87980.8799
10.90550.99060.90550.90550.90540.88250.8826
20.87800.98650.87800.87890.87820.84840.8485
30.89130.98630.89130.89080.89080.86460.8647
40.89630.98470.89630.89680.89610.87090.8712
50.91460.99150.91460.91480.91470.89390.8939
60.90960.98740.90960.91050.90970.88770.8877
70.91660.98840.91660.91700.91670.89620.8963
80.89110.98970.89110.89150.89120.86480.8648
90.90840.99010.90840.90840.90800.88590.8860
Mean0.90150.98800.90150.90180.90140.87750.8775
Std0.01150.00230.01150.01150.01150.01430.0143
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion=’gini’, max_depth=None, max_features=’sqrt’, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=False, random_state=6097, verbose=0, warm_start=False)
# Model Fine Tuning
rf_model = tune_model(rf_model, fold = 3, optimize = 'F1', choose_better = True)
Initiated. . . . . . . . . . . . . . . . . .21:49:43
Status. . . . . . . . . . . . . . . . . .Searching Hyperparameters
Estimator. . . . . . . . . . . . . . . . . .Random Forest Classifier
 AccuracyAUCRecallPrec.F1KappaMCC
Fold       
00.91090.99090.91090.91110.91090.88920.8892
10.91640.99230.91640.91650.91630.89600.8961
20.91430.98990.91430.91430.91430.89340.8935
Mean0.91390.99100.91390.91400.91380.89290.8929
Std0.00230.00100.00230.00220.00220.00280.0028
Fitting 3 folds for each of 10 candidates, totalling 30 fits

Ensemble

# Ensemble
tuned_models = [gbc_model, rf_model]

# Bagging, Boosting, Voting
blend_model = blend_models(estimator_list = tuned_models)
Initiated. . . . . . . . . . . . . . . . . .21:49:50
Status. . . . . . . . . . . . . . . . . .Fitting 10 Folds
Estimator. . . . . . . . . . . . . . . . . .Voting Classifier
 AccuracyAUCRecallPrec.F1KappaMCC
Fold       
00.91870.99090.91870.91970.91860.89880.8990
10.92070.99420.92070.92090.92070.90150.9015
20.89940.99230.89940.90080.89970.87480.8749
30.92280.99260.92280.92250.92260.90390.9039
40.91160.99270.91160.91220.91140.89000.8902
50.92280.99340.92280.92300.92270.90390.9040
60.91570.99220.91570.91680.91580.89510.8952
70.91960.99310.91960.92040.91970.90000.9001
80.90640.99150.90640.90680.90640.88370.8837
90.92470.99490.92470.92490.92460.90630.9063
Mean0.91620.99280.91620.91680.91620.89580.8959
Std0.00770.00110.00770.00740.00760.00960.0096
# Finalize Model Selection
final_gbc_model = finalize_model(gbc_model)

# Input test data into the final model to verify model predictions
prediction = predict_model(final_gbc_model, data = features_test)
cf = confusion_matrix(label_test, prediction["prediction_label"])

plot_confusion_matrix(cf, model = "GBC", target_names = ["1", "2", "3", "4", "5", "6", "7"])

Confusion Matrix

The Predicted Label predominantly predicts the True Label, especially with lower combat power levels, leading to higher accuracy.

plot_model(estimator = final_gbc_model, plot = 'feature')

Feature Importance Plot

– The most influential variable in predicting next month’s Power Level is the current month’s Power Level. This suggests that users with low combat power levels tend to drop out quickly, while users with high combat power levels find it challenging to increase their combat power. In this current situation, it appears that Power Level changes are minimal, contributing to a ‘good’ prediction.
– However, it may be challenging to consider this analysis within its limitations as a ‘good’ prediction. As a result, additional research will be conducted by segmenting combat power levels and further refining the analysis by breaking down combat power levels into finer categories.

Write a comment