Loading
Keunyoung Yoon

Data Engineer

Data Scientist

Data Analyst

Keunyoung Yoon

Data Engineer

Data Scientist

Data Analyst

Blog Post

Regression Analysis based on User Characteristics

11/26/2023 Portfolio

Source Data: http://youngyoon.me/archives/30

Objectives

Regression analysis based on the following data for (N+1) Monthly Login Days:
========================================
current_attendance: (N) Monthly Login Days
current_paid: (N) Monthly Spending Amount
num_characters: Number of Characters Owned by Each User
power_level: Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)
next_paid: (N+1) Monthly Spending Amount

from pycaret.regression import *
# Data Load
df_user = pd.read_csv("./preprocessed_data/df2021_user.csv")
# Delete rows where the spending amount is zero
df_user_nz = df_user[df_user["next_paid"] != 0]

# Check the top 5 rows
df_user_nz.head()
current_attendancecurrent_paidnum_characterspower_levelnext_attendancenext_paid
0170.0131314000.0
11314000.014265000.0
410.0141612950.0
51612950.014228300.0
1700.01196050.0
# Convert next month's charge amount to categories
def convert_paid(x):
    if (x >= 0) and (x <= 1000):
        x = 1
    elif (x > 1000) and (x <= 5000):
        x = 2
    elif (x > 5000) and (x <= 10000):
        x = 3
    elif (x > 10000) and (x <= 20000):
        x = 4
    elif (x > 20000) and (x <= 30000):
        x = 5
    elif (x > 30000) and (x <= 40000):
        x = 6
    elif (x > 40000) and (x <= 50000):
        x = 7
    elif (x > 50000) and (x <= 100000):
        x = 8
    else:
        x = 9
    
    return x

df_user_nz["current_paid"] = df_user_nz["current_paid"].apply(convert_paid)
df_user_nz["next_paid"] = df_user_nz["next_paid"].apply(convert_paid)
# Data Split
features = df_user_nz.drop(["next_attendance", "next_paid"], axis = 1)
label = df_user_nz["next_attendance"]

features_train, features_test, label_train, label_test = train_test_split(features, label, test_size = 0.2, random_state = 2023, shuffle = True, stratify = label)
df_train = pd.concat([features_train, label_train], axis = 1)

reg = setup(data = df_train, target = "next_attendance", train_size = 0.8)
 DescriptionValue
0Session id327
1Targetnext_attendance
2Target typeRegression
3Original data shape(10428, 5)
4Transformed data shape(10428, 5)
5Transformed train set shape(8342, 5)
6Transformed test set shape(2086, 5)
7Numeric features4
8PreprocessTrue
9Imputation typesimple
10Numeric imputationmean
11Categorical imputationmode
12Fold GeneratorKFold
13Fold Number10
14CPU Jobs-1
15Use GPUFalse
16Log ExperimentFalse
17Experiment Namereg-default-name
18USI99c9
# Top 3 models
top3_models = compare_models(round = 3, sort = 'RMSE', n_select = 3)
Initiated. . . . . . . . . . . . . . . . . .21:52:55
Status. . . . . . . . . . . . . . . . . .Loading Estimator
Estimator. . . . . . . . . . . . . . . . . .Linear Regression
 ModelMAEMSERMSER2RMSLEMAPETT (Sec)
gbrGradient Boosting Regressor4.93442.4476.5110.6440.6451.0500.031
lightgbmLight Gradient Boosting Machine4.94643.3696.5810.6360.6431.0370.055
ridgeRidge Regression5.07543.8346.6160.6320.6641.1040.007
larLeast Angle Regression5.07543.8346.6160.6320.6641.1040.008
brBayesian Ridge5.07643.8346.6160.6320.6641.1060.006
lrLinear Regression5.07543.8346.6160.6320.6641.1040.010
enElastic Net5.17544.7476.6850.6240.6841.1790.007
ompOrthogonal Matching Pursuit5.19445.3246.7280.6200.6871.1830.007
llarLasso Least Angle Regression5.21545.3206.7280.6200.6891.1950.006
lassoLasso Regression5.21545.3206.7280.6200.6891.1950.008
huberHuber Regressor4.81945.4296.7340.6190.6350.8680.011
rfRandom Forest Regressor5.08647.8676.9140.5980.6581.0440.074
adaAdaBoost Regressor5.83848.3936.9540.5940.7631.5630.013
knnK Neighbors Regressor5.33150.7247.1150.5740.6951.1650.012
etExtra Trees Regressor5.18451.5207.1740.5680.6741.0470.064
dtDecision Tree Regressor5.34556.1187.4880.5290.6951.0560.009
dummyDummy Regressor9.889119.28310.921-0.0011.0152.6100.007
parPassive Aggressive Regressor9.160152.17111.732-0.2790.9011.1840.007
# Create model
gbr_model = create_model('gbr', cross_validation=True)
 MAEMSERMSER2RMSLEMAPE
Fold      
05.256350.74387.12350.57970.66581.0680
14.914542.39786.51140.63880.64021.0123
24.940242.00226.48090.65180.64691.0731
34.953141.47736.44030.64680.64481.0755
44.992142.98046.55590.63160.65961.1073
54.898743.03486.56010.65060.63250.9755
64.765440.65256.37590.64850.64311.0488
74.835640.62436.37370.65500.62380.9840
84.879941.28706.42550.65820.62340.9718
94.909039.26566.26620.67630.66711.1831
Mean4.934542.44666.51130.64370.64471.0499
Std0.12292.97690.22130.02410.01490.0632
# Hyperparameter tuning
tuned_gbr_model = tune_model(gbr_model, optimize = "RMSE")
 MAEMSERMSER2RMSLEMAPE
Fold      
05.363452.12937.22010.56820.66851.0608
15.004744.11716.64210.62420.64291.0126
24.944743.06446.56230.64300.64821.0627
34.991442.49256.51860.63810.64561.0707
45.020044.28986.65510.62040.66241.1081
54.888443.69866.61050.64520.64020.9885
64.809341.71976.45910.63930.64301.0386
74.850841.20476.41910.65010.61580.9590
84.927942.69666.53430.64650.62820.9740
94.904540.03826.32760.66990.66591.1729
Mean4.970543.54516.59490.63450.64611.0448
Std0.14573.12670.22980.02560.01570.0619
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).
# Finalize Model Selection
final_gbr_model = finalize_model(tuned_gbr_model)

prediction = predict_model(final_gbr_model, data = features_test)
# Accuracy (not precise due to it being a regression)
pred_acc = (prediction["prediction_label"] == label_test).sum() / len(label_test)
pred_acc * 100

6.559263521288838

It seems that relying solely on accuracy is insufficient, so I exported the data to a CSV file. Then, I calculated the rate of difference (absolute value) and organized its distribution.

prediction["prediction_label"] = np.round(prediction["prediction_label"], 0)

# Based on the prediction["prediction_label"] column & label_test column
predict_result = pd.concat([prediction["prediction_label"], label_test], axis = 1)

# Extract into a CSV file to verify accuracy
predict_result.to_csv("./regression_result_gbr.csv", index=False, encoding="utf-8-sig")

The proportion of differences within 10% is 20%, within 20% is 36%, and within 50% is 64%. Of course, it’s difficult to estimate accuracy with just this, but we could confirm that a significant number fell within the expected range(with regression analysis). (It was also observed that as the error rate increased, the corresponding number of users decreased.)

In this research, the accuracy of the regression analysis appears to be relatively satisfactory.

In this research, the accuracy of the regression analysis appears to be relatively satisfactory.

Write a comment