Regression Analysis based on User Characteristics
Source Data: http://youngyoon.me/archives/30
Objectives
Regression analysis based on the following data for (N+1) Monthly Login Days:
========================================
current_attendance: (N) Monthly Login Days
current_paid: (N) Monthly Spending Amount
num_characters: Number of Characters Owned by Each User
power_level: Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)
next_paid: (N+1) Monthly Spending Amount
from pycaret.regression import *
# Data Load
df_user = pd.read_csv("./preprocessed_data/df2021_user.csv")
# Delete rows where the spending amount is zero
df_user_nz = df_user[df_user["next_paid"] != 0]
# Check the top 5 rows
df_user_nz.head()
current_attendance | current_paid | num_characters | power_level | next_attendance | next_paid | |
---|---|---|---|---|---|---|
0 | 17 | 0.0 | 1 | 3 | 13 | 14000.0 |
1 | 13 | 14000.0 | 1 | 4 | 26 | 5000.0 |
4 | 1 | 0.0 | 1 | 4 | 16 | 12950.0 |
5 | 16 | 12950.0 | 1 | 4 | 22 | 8300.0 |
17 | 0 | 0.0 | 1 | 1 | 9 | 6050.0 |
# Convert next month's charge amount to categories
def convert_paid(x):
if (x >= 0) and (x <= 1000):
x = 1
elif (x > 1000) and (x <= 5000):
x = 2
elif (x > 5000) and (x <= 10000):
x = 3
elif (x > 10000) and (x <= 20000):
x = 4
elif (x > 20000) and (x <= 30000):
x = 5
elif (x > 30000) and (x <= 40000):
x = 6
elif (x > 40000) and (x <= 50000):
x = 7
elif (x > 50000) and (x <= 100000):
x = 8
else:
x = 9
return x
df_user_nz["current_paid"] = df_user_nz["current_paid"].apply(convert_paid)
df_user_nz["next_paid"] = df_user_nz["next_paid"].apply(convert_paid)
# Data Split
features = df_user_nz.drop(["next_attendance", "next_paid"], axis = 1)
label = df_user_nz["next_attendance"]
features_train, features_test, label_train, label_test = train_test_split(features, label, test_size = 0.2, random_state = 2023, shuffle = True, stratify = label)
df_train = pd.concat([features_train, label_train], axis = 1)
reg = setup(data = df_train, target = "next_attendance", train_size = 0.8)
Description | Value | |
---|---|---|
0 | Session id | 327 |
1 | Target | next_attendance |
2 | Target type | Regression |
3 | Original data shape | (10428, 5) |
4 | Transformed data shape | (10428, 5) |
5 | Transformed train set shape | (8342, 5) |
6 | Transformed test set shape | (2086, 5) |
7 | Numeric features | 4 |
8 | Preprocess | True |
9 | Imputation type | simple |
10 | Numeric imputation | mean |
11 | Categorical imputation | mode |
12 | Fold Generator | KFold |
13 | Fold Number | 10 |
14 | CPU Jobs | -1 |
15 | Use GPU | False |
16 | Log Experiment | False |
17 | Experiment Name | reg-default-name |
18 | USI | 99c9 |
# Top 3 models
top3_models = compare_models(round = 3, sort = 'RMSE', n_select = 3)
Initiated | . . . . . . . . . . . . . . . . . . | 21:52:55 |
---|---|---|
Status | . . . . . . . . . . . . . . . . . . | Loading Estimator |
Estimator | . . . . . . . . . . . . . . . . . . | Linear Regression |
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
gbr | Gradient Boosting Regressor | 4.934 | 42.447 | 6.511 | 0.644 | 0.645 | 1.050 | 0.031 |
lightgbm | Light Gradient Boosting Machine | 4.946 | 43.369 | 6.581 | 0.636 | 0.643 | 1.037 | 0.055 |
ridge | Ridge Regression | 5.075 | 43.834 | 6.616 | 0.632 | 0.664 | 1.104 | 0.007 |
lar | Least Angle Regression | 5.075 | 43.834 | 6.616 | 0.632 | 0.664 | 1.104 | 0.008 |
br | Bayesian Ridge | 5.076 | 43.834 | 6.616 | 0.632 | 0.664 | 1.106 | 0.006 |
lr | Linear Regression | 5.075 | 43.834 | 6.616 | 0.632 | 0.664 | 1.104 | 0.010 |
en | Elastic Net | 5.175 | 44.747 | 6.685 | 0.624 | 0.684 | 1.179 | 0.007 |
omp | Orthogonal Matching Pursuit | 5.194 | 45.324 | 6.728 | 0.620 | 0.687 | 1.183 | 0.007 |
llar | Lasso Least Angle Regression | 5.215 | 45.320 | 6.728 | 0.620 | 0.689 | 1.195 | 0.006 |
lasso | Lasso Regression | 5.215 | 45.320 | 6.728 | 0.620 | 0.689 | 1.195 | 0.008 |
huber | Huber Regressor | 4.819 | 45.429 | 6.734 | 0.619 | 0.635 | 0.868 | 0.011 |
rf | Random Forest Regressor | 5.086 | 47.867 | 6.914 | 0.598 | 0.658 | 1.044 | 0.074 |
ada | AdaBoost Regressor | 5.838 | 48.393 | 6.954 | 0.594 | 0.763 | 1.563 | 0.013 |
knn | K Neighbors Regressor | 5.331 | 50.724 | 7.115 | 0.574 | 0.695 | 1.165 | 0.012 |
et | Extra Trees Regressor | 5.184 | 51.520 | 7.174 | 0.568 | 0.674 | 1.047 | 0.064 |
dt | Decision Tree Regressor | 5.345 | 56.118 | 7.488 | 0.529 | 0.695 | 1.056 | 0.009 |
dummy | Dummy Regressor | 9.889 | 119.283 | 10.921 | -0.001 | 1.015 | 2.610 | 0.007 |
par | Passive Aggressive Regressor | 9.160 | 152.171 | 11.732 | -0.279 | 0.901 | 1.184 | 0.007 |
# Create model
gbr_model = create_model('gbr', cross_validation=True)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
Fold | ||||||
0 | 5.2563 | 50.7438 | 7.1235 | 0.5797 | 0.6658 | 1.0680 |
1 | 4.9145 | 42.3978 | 6.5114 | 0.6388 | 0.6402 | 1.0123 |
2 | 4.9402 | 42.0022 | 6.4809 | 0.6518 | 0.6469 | 1.0731 |
3 | 4.9531 | 41.4773 | 6.4403 | 0.6468 | 0.6448 | 1.0755 |
4 | 4.9921 | 42.9804 | 6.5559 | 0.6316 | 0.6596 | 1.1073 |
5 | 4.8987 | 43.0348 | 6.5601 | 0.6506 | 0.6325 | 0.9755 |
6 | 4.7654 | 40.6525 | 6.3759 | 0.6485 | 0.6431 | 1.0488 |
7 | 4.8356 | 40.6243 | 6.3737 | 0.6550 | 0.6238 | 0.9840 |
8 | 4.8799 | 41.2870 | 6.4255 | 0.6582 | 0.6234 | 0.9718 |
9 | 4.9090 | 39.2656 | 6.2662 | 0.6763 | 0.6671 | 1.1831 |
Mean | 4.9345 | 42.4466 | 6.5113 | 0.6437 | 0.6447 | 1.0499 |
Std | 0.1229 | 2.9769 | 0.2213 | 0.0241 | 0.0149 | 0.0632 |
# Hyperparameter tuning
tuned_gbr_model = tune_model(gbr_model, optimize = "RMSE")
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
Fold | ||||||
0 | 5.3634 | 52.1293 | 7.2201 | 0.5682 | 0.6685 | 1.0608 |
1 | 5.0047 | 44.1171 | 6.6421 | 0.6242 | 0.6429 | 1.0126 |
2 | 4.9447 | 43.0644 | 6.5623 | 0.6430 | 0.6482 | 1.0627 |
3 | 4.9914 | 42.4925 | 6.5186 | 0.6381 | 0.6456 | 1.0707 |
4 | 5.0200 | 44.2898 | 6.6551 | 0.6204 | 0.6624 | 1.1081 |
5 | 4.8884 | 43.6986 | 6.6105 | 0.6452 | 0.6402 | 0.9885 |
6 | 4.8093 | 41.7197 | 6.4591 | 0.6393 | 0.6430 | 1.0386 |
7 | 4.8508 | 41.2047 | 6.4191 | 0.6501 | 0.6158 | 0.9590 |
8 | 4.9279 | 42.6966 | 6.5343 | 0.6465 | 0.6282 | 0.9740 |
9 | 4.9045 | 40.0382 | 6.3276 | 0.6699 | 0.6659 | 1.1729 |
Mean | 4.9705 | 43.5451 | 6.5949 | 0.6345 | 0.6461 | 1.0448 |
Std | 0.1457 | 3.1267 | 0.2298 | 0.0256 | 0.0157 | 0.0619 |
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).
# Finalize Model Selection
final_gbr_model = finalize_model(tuned_gbr_model)
prediction = predict_model(final_gbr_model, data = features_test)
# Accuracy (not precise due to it being a regression)
pred_acc = (prediction["prediction_label"] == label_test).sum() / len(label_test)
pred_acc * 100
6.559263521288838
It seems that relying solely on accuracy is insufficient, so I exported the data to a CSV file. Then, I calculated the rate of difference (absolute value) and organized its distribution.
prediction["prediction_label"] = np.round(prediction["prediction_label"], 0)
# Based on the prediction["prediction_label"] column & label_test column
predict_result = pd.concat([prediction["prediction_label"], label_test], axis = 1)
# Extract into a CSV file to verify accuracy
predict_result.to_csv("./regression_result_gbr.csv", index=False, encoding="utf-8-sig")
The proportion of differences within 10% is 20%, within 20% is 36%, and within 50% is 64%. Of course, it’s difficult to estimate accuracy with just this, but we could confirm that a significant number fell within the expected range(with regression analysis). (It was also observed that as the error rate increased, the corresponding number of users decreased.)
In this research, the accuracy of the regression analysis appears to be relatively satisfactory.
In this research, the accuracy of the regression analysis appears to be relatively satisfactory.