Blog Post

Regression Analysis based on User Characteristics

11/26/2023 Portfolio by Young Yoon

Source Data: http://youngyoon.me/archives/30

Objectives

Regression analysis based on the following data for (N+1) Monthly Login Days:
========================================
current_attendance: (N) Monthly Login Days
current_paid: (N) Monthly Spending Amount
num_characters: Number of Characters Owned by Each User
power_level: Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)
next_paid: (N+1) Monthly Spending Amount

from pycaret.regression import *

# Data Load
df_user = pd.read_csv("./preprocessed_data/df2021_user.csv")

# Delete rows where the spending amount is zero
df_user_nz = df_user[df_user["next_paid"] != 0]

# Check the top 5 rows
df_user_nz.head()

	current_attendance	current_paid	num_characters	power_level	next_attendance	next_paid
0	17	0.0	1	3	13	14000.0
1	13	14000.0	1	4	26	5000.0
4	1	0.0	1	4	16	12950.0
5	16	12950.0	1	4	22	8300.0
17	0	0.0	1	1	9	6050.0

# Convert next month's charge amount to categories
def convert_paid(x):
    if (x >= 0) and (x <= 1000):
        x = 1
    elif (x > 1000) and (x <= 5000):
        x = 2
    elif (x > 5000) and (x <= 10000):
        x = 3
    elif (x > 10000) and (x <= 20000):
        x = 4
    elif (x > 20000) and (x <= 30000):
        x = 5
    elif (x > 30000) and (x <= 40000):
        x = 6
    elif (x > 40000) and (x <= 50000):
        x = 7
    elif (x > 50000) and (x <= 100000):
        x = 8
    else:
        x = 9
    
    return x

df_user_nz["current_paid"] = df_user_nz["current_paid"].apply(convert_paid)
df_user_nz["next_paid"] = df_user_nz["next_paid"].apply(convert_paid)

# Data Split
features = df_user_nz.drop(["next_attendance", "next_paid"], axis = 1)
label = df_user_nz["next_attendance"]

features_train, features_test, label_train, label_test = train_test_split(features, label, test_size = 0.2, random_state = 2023, shuffle = True, stratify = label)

df_train = pd.concat([features_train, label_train], axis = 1)

reg = setup(data = df_train, target = "next_attendance", train_size = 0.8)

	Description	Value
0	Session id	327
1	Target	next_attendance
2	Target type	Regression
3	Original data shape	(10428, 5)
4	Transformed data shape	(10428, 5)
5	Transformed train set shape	(8342, 5)
6	Transformed test set shape	(2086, 5)
7	Numeric features	4
8	Preprocess	True
9	Imputation type	simple
10	Numeric imputation	mean
11	Categorical imputation	mode
12	Fold Generator	KFold
13	Fold Number	10
14	CPU Jobs	-1
15	Use GPU	False
16	Log Experiment	False
17	Experiment Name	reg-default-name
18	USI	99c9

# Top 3 models
top3_models = compare_models(round = 3, sort = 'RMSE', n_select = 3)


Initiated	. . . . . . . . . . . . . . . . . .	21:52:55
Status	. . . . . . . . . . . . . . . . . .	Loading Estimator
Estimator	. . . . . . . . . . . . . . . . . .	Linear Regression

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
gbr	Gradient Boosting Regressor	4.934	42.447	6.511	0.644	0.645	1.050	0.031
lightgbm	Light Gradient Boosting Machine	4.946	43.369	6.581	0.636	0.643	1.037	0.055
ridge	Ridge Regression	5.075	43.834	6.616	0.632	0.664	1.104	0.007
lar	Least Angle Regression	5.075	43.834	6.616	0.632	0.664	1.104	0.008
br	Bayesian Ridge	5.076	43.834	6.616	0.632	0.664	1.106	0.006
lr	Linear Regression	5.075	43.834	6.616	0.632	0.664	1.104	0.010
en	Elastic Net	5.175	44.747	6.685	0.624	0.684	1.179	0.007
omp	Orthogonal Matching Pursuit	5.194	45.324	6.728	0.620	0.687	1.183	0.007
llar	Lasso Least Angle Regression	5.215	45.320	6.728	0.620	0.689	1.195	0.006
lasso	Lasso Regression	5.215	45.320	6.728	0.620	0.689	1.195	0.008
huber	Huber Regressor	4.819	45.429	6.734	0.619	0.635	0.868	0.011
rf	Random Forest Regressor	5.086	47.867	6.914	0.598	0.658	1.044	0.074
ada	AdaBoost Regressor	5.838	48.393	6.954	0.594	0.763	1.563	0.013
knn	K Neighbors Regressor	5.331	50.724	7.115	0.574	0.695	1.165	0.012
et	Extra Trees Regressor	5.184	51.520	7.174	0.568	0.674	1.047	0.064
dt	Decision Tree Regressor	5.345	56.118	7.488	0.529	0.695	1.056	0.009
dummy	Dummy Regressor	9.889	119.283	10.921	-0.001	1.015	2.610	0.007
par	Passive Aggressive Regressor	9.160	152.171	11.732	-0.279	0.901	1.184	0.007

# Create model
gbr_model = create_model('gbr', cross_validation=True)

	MAE	MSE	RMSE	R2	RMSLE	MAPE
Fold
0	5.2563	50.7438	7.1235	0.5797	0.6658	1.0680
1	4.9145	42.3978	6.5114	0.6388	0.6402	1.0123
2	4.9402	42.0022	6.4809	0.6518	0.6469	1.0731
3	4.9531	41.4773	6.4403	0.6468	0.6448	1.0755
4	4.9921	42.9804	6.5559	0.6316	0.6596	1.1073
5	4.8987	43.0348	6.5601	0.6506	0.6325	0.9755
6	4.7654	40.6525	6.3759	0.6485	0.6431	1.0488
7	4.8356	40.6243	6.3737	0.6550	0.6238	0.9840
8	4.8799	41.2870	6.4255	0.6582	0.6234	0.9718
9	4.9090	39.2656	6.2662	0.6763	0.6671	1.1831
Mean	4.9345	42.4466	6.5113	0.6437	0.6447	1.0499
Std	0.1229	2.9769	0.2213	0.0241	0.0149	0.0632

# Hyperparameter tuning
tuned_gbr_model = tune_model(gbr_model, optimize = "RMSE")

	MAE	MSE	RMSE	R2	RMSLE	MAPE
Fold
0	5.3634	52.1293	7.2201	0.5682	0.6685	1.0608
1	5.0047	44.1171	6.6421	0.6242	0.6429	1.0126
2	4.9447	43.0644	6.5623	0.6430	0.6482	1.0627
3	4.9914	42.4925	6.5186	0.6381	0.6456	1.0707
4	5.0200	44.2898	6.6551	0.6204	0.6624	1.1081
5	4.8884	43.6986	6.6105	0.6452	0.6402	0.9885
6	4.8093	41.7197	6.4591	0.6393	0.6430	1.0386
7	4.8508	41.2047	6.4191	0.6501	0.6158	0.9590
8	4.9279	42.6966	6.5343	0.6465	0.6282	0.9740
9	4.9045	40.0382	6.3276	0.6699	0.6659	1.1729
Mean	4.9705	43.5451	6.5949	0.6345	0.6461	1.0448
Std	0.1457	3.1267	0.2298	0.0256	0.0157	0.0619

Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).

# Finalize Model Selection
final_gbr_model = finalize_model(tuned_gbr_model)

prediction = predict_model(final_gbr_model, data = features_test)

# Accuracy (not precise due to it being a regression)
pred_acc = (prediction["prediction_label"] == label_test).sum() / len(label_test)
pred_acc * 100

6.559263521288838

It seems that relying solely on accuracy is insufficient, so I exported the data to a CSV file. Then, I calculated the rate of difference (absolute value) and organized its distribution.

prediction["prediction_label"] = np.round(prediction["prediction_label"], 0)

# Based on the prediction["prediction_label"] column & label_test column
predict_result = pd.concat([prediction["prediction_label"], label_test], axis = 1)

# Extract into a CSV file to verify accuracy
predict_result.to_csv("./regression_result_gbr.csv", index=False, encoding="utf-8-sig")

The proportion of differences within 10% is 20%, within 20% is 36%, and within 50% is 64%. Of course, it’s difficult to estimate accuracy with just this, but we could confirm that a significant number fell within the expected range(with regression analysis). (It was also observed that as the error rate increased, the corresponding number of users decreased.)

In this research, the accuracy of the regression analysis appears to be relatively satisfactory.

Write a comment