Blog Post

Data Preprocessing and Visualization

11/26/2023 Portfolio by Young Yoon

Data Preprocessing

I have conducted a data preprocessing process to prepare complex real-world data for graphing.

Initial Raw Data

	USER ID	Character ID	cb_a2101	cb_a2102	cb_a2103	cb_a2104	cb_a2105	cb_a2106	cb_a2107	cb_a2108	…	m11_dun1	m11_dun2	m11_dun3	m11_dun4	m11_dunq	m12_dun1	m12_dun2	m12_dun3	m12_dun4	m12_dunq
0	1080	1394	827373.6	1221864.0	1557040.8	1573940.4	1573940.4	1656014.4	1644128.4	1694022.0	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1293	96260438	0.0	0.0	0.0	0.0	0.0	0.0	404038.8	0.0	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	1514	3797	0.0	0.0	0.0	0.0	0.0	0.0	401023.2	0.0	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	1823	5256	0.0	0.0	NaN	0.0	0.0	0.0	0.0	0.0	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	2729	118675385	0.0	2592.0	3637.2	0.0	0.0	0.0	0.0	0.0	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

*5 rows × 98 columns
*cb_a0000: Monthly Max Combat Power / Jan, 2021 to Dec, 2021
*a0000: Monthly Login Days / Jan, 2021 to Dec, 2021
*paid00: Monthly Spending Amount
*m00_dun1 / dun2 / dun3 / dun4 / dun_q: Monthly Plays per Dungeon Difficulty (1~4 is difficulty, q is for quests)
*All data has been sampled and sanitized.

Data Prepocessing (1)

Goal for Column Organization (1)

character_ID
power_difference: Monthly Change in Combat Power by Character
low_dungeon: Monthly Plays in Low-level Dungeons
high_dungeon: Monthly Plays in High-level Dungeons
quest_dungeon: Monthly Plays in Quest Dungeons
current_power_level: (N) Months’ Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)
next_power_level: (N+1) Months’ Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)

import pandas as pd

# Data Load
df2021 = pd.read_excel("./data/2021.xlsx")
# Verify Data Dimensions
df2021.shape
# Check the top 5 rows
df2021.head()
# Check for Missing Values
df2021.isna().sum()
# Replace Missing Values
df2021.fillna(0, inplace = True)

# Process Unique CharacterID
df2021["CharacterID"] = range(len(df2021["CharacterID"]))
# Extract Columns from DataFrame
# df2021_power = df2021[["cb_a2101", "cb_a2102", "cb_a2103", "cb_a2104", "cb_a2105", "cb_a2106", "cb_a2107", "cb_a2108", "cb_a2109", "cb_a2110", "cb_a2111", "cb_a2111"]]
df2021_power = df2021.iloc[:, 2:14]

def convert_power(x):
    if x <= 100000:
        x = 1
    elif (x > 100000) and (x <= 500000):
        x = 2
    elif (x > 500000) and (x <= 1000000):
        x = 3
    elif (x > 1000000) and (x <= 2500000):
        x = 4
    elif (x > 2500000) and (x <= 5000000):
        x = 5
    elif (x > 5000000) and (x <= 7500000):
        x = 6
    elif (x > 7500000):
        x = 7
    else:
        print("power error")
    
    return x

df2021["m01_dg_low"] = df2021["m01_dun1"] + df2021["m01_dun2"]
df2021["m01_dg_high"] = df2021["m01_dun3"] + df2021["m01_dun4"]

df2021["m02_dg_low"] = df2021["m02_dun1"] + df2021["m02_dun2"]
df2021["m02_dg_high"] = df2021["m02_dun3"] + df2021["m02_dun4"]

df2021["m03_dg_low"] = df2021["m03_dun1"] + df2021["m03_dun2"]
df2021["m03_dg_high"] = df2021["m03_dun3"] + df2021["m03_dun4"]

df2021["m04_dg_low"] = df2021["m04_dun1"] + df2021["m04_dun2"]
df2021["m04_dg_high"] = df2021["m04_dun3"] + df2021["m04_dun4"]

df2021["m05_dg_low"] = df2021["m05_dun1"] + df2021["m05_dun2"]
df2021["m05_dg_high"] = df2021["m05_dun3"] + df2021["m05_dun4"]

df2021["m06_dg_low"] = df2021["m06_dun1"] + df2021["m06_dun2"]
df2021["m06_dg_high"] = df2021["m06_dun3"] + df2021["m06_dun4"]

df2021["m07_dg_low"] = df2021["m07_dun1"] + df2021["m07_dun2"]
df2021["m07_dg_high"] = df2021["m07_dun3"] + df2021["m07_dun4"]

df2021["m08_dg_low"] = df2021["m08_dun1"] + df2021["m08_dun2"]
df2021["m08_dg_high"] = df2021["m08_dun3"] + df2021["m08_dun4"]

df2021["m09_dg_low"] = df2021["m09_dun1"] + df2021["m09_dun2"]
df2021["m09_dg_high"] = df2021["m09_dun3"] + df2021["m09_dun4"]

df2021["m10_dg_low"] = df2021["m10_dun1"] + df2021["m10_dun2"]
df2021["m10_dg_high"] = df2021["m10_dun3"] + df2021["m10_dun4"]

df2021["m11_dg_low"] = df2021["m11_dun1"] + df2021["m11_dun2"]
df2021["m11_dg_high"] = df2021["m11_dun3"] + df2021["m11_dun4"]

df2021["m12_dg_low"] = df2021["m12_dun1"] + df2021["m12_dun2"]
df2021["m12_dg_high"] = df2021["m12_dun3"] + df2021["m12_dun4"]

# Create DataFrame for Each Character
df_character = pd.DataFrame()

for idx_character in range(df2021.shape[0]):
    for idx_power in range(3, 14):
        if (df2021.iloc[idx_character, idx_power] != 0) and (df2021.iloc[idx_character, idx_power - 1] != 0):
            power_diff = df2021.iloc[idx_character, idx_power] - df2021.iloc[idx_character, idx_power - 1]
            
            # 1,2
            if idx_power == 3:
                dg_low = df2021.iloc[idx_character, 98]
                dg_high = df2021.iloc[idx_character, 99]
                dg_quest = df2021.iloc[idx_character, 42]
                
            # 2,3
            elif idx_power == 4:
                dg_low = df2021.iloc[idx_character, 100]
                dg_high = df2021.iloc[idx_character, 101]
                dg_quest = df2021.iloc[idx_character, 47]
                
            # 3,4
            elif idx_power == 5:
                dg_low = df2021.iloc[idx_character, 102]
                dg_high = df2021.iloc[idx_character, 103]
                dg_quest = df2021.iloc[idx_character, 52]
                
            # 4,5
            elif idx_power == 6:
                dg_low = df2021.iloc[idx_character, 104]
                dg_high = df2021.iloc[idx_character, 105]
                dg_quest = df2021.iloc[idx_character, 57]
                
            # 5,6
            elif idx_power == 7:
                dg_low = df2021.iloc[idx_character, 106]
                dg_high = df2021.iloc[idx_character, 107]
                dg_quest = df2021.iloc[idx_character, 62]
                
            # 6,7
            elif idx_power == 8:
                dg_low = df2021.iloc[idx_character, 108]
                dg_high = df2021.iloc[idx_character, 109]
                dg_quest = df2021.iloc[idx_character, 67]
                
            # 7,8
            elif idx_power == 9:
                dg_low = df2021.iloc[idx_character, 110]
                dg_high = df2021.iloc[idx_character, 111]
                dg_quest = df2021.iloc[idx_character, 72]
                
            # 8,9
            elif idx_power == 10:
                dg_low = df2021.iloc[idx_character, 112]
                dg_high = df2021.iloc[idx_character, 113]
                dg_quest = df2021.iloc[idx_character, 77]
            # 9,10
            elif idx_power == 11:
                dg_low = df2021.iloc[idx_character, 114]
                dg_high = df2021.iloc[idx_character, 115]
                dg_quest = df2021.iloc[idx_character, 82]
                
            # 10,11
            elif idx_power == 12:
                dg_low = df2021.iloc[idx_character, 116]
                dg_high = df2021.iloc[idx_character, 117]
                dg_quest = df2021.iloc[idx_character, 87]
                
            # 11,12
            elif idx_power == 13:
                dg_low = df2021.iloc[idx_character, 118]
                dg_high = df2021.iloc[idx_character, 119]
                dg_quest = df2021.iloc[idx_character, 92]
            
            current_power_level = convert_power(df2021.iloc[idx_character, idx_power - 1])
            next_power_level = convert_power(df2021.iloc[idx_character, idx_power])
                
        else:
            continue
        
        tmp_df = pd.DataFrame({"power_difference": power_diff, "low_dungeon": dg_low, "high_dungeon": dg_high,
                               "quest_dungeon": dg_quest, "current_power_level":current_power_level, "next_power_level":next_power_level}, index = [0])
        df_character = pd.concat([df_character, tmp_df], axis = 0)

# Organize DataFrame Index
df_character = df_character.reset_index(drop=True)
# Check the top 5 rows
df_character.head(5)
# current power level distribution
df_character["current_power_level"].value_counts().sort_index()

next_power_level
1    1799
2    3693
3    3267
4    3528
5    2449
6     531
7     105
Name: count, dtype: int64

Final Data (1)

Character ID_changed	power_difference	low_dungeon	high_dungeon	current_power_level	next_power_level
0	394490.4	10.0	7.0	3	4
1	335176.8	10.0	0.0	4	4
2	16899.6	96.0	12.0	4	4
3	0.0	0.0	0.0	4	4
4	82074.0	0.0	0.0	4	4

*5 rows × 6 columns
*power_difference: Monthly Change in Combat Power by Character
*low_dungeon: Monthly Plays in Low-level Dungeons
*high_dungeon: Monthly Plays in High-level Dungeons
*quest_dungeon: Monthly Plays in Quest Dungeons
*current_power_level: (N) Months’ Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)
*next_power_level: (N+1) Months’ Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)]

Data Prepocessing (2)

Goal for Column Organization (2)

User_ID
current_attendance: (N) Monthly Login Days
current_paid: (N) Monthly Spending Amount
num_characters: Number of Characters Owned by Each User
power_level: Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)
next_attendance: (N+1) Monthly Login Days
next_paid: (N+1) Monthly Spending Amount

# Remove duplicate rows based on UserID
df2021_unique = df2021.drop_duplicates(subset='UserID', keep='first', ignore_index=True)
# Verify Data Dimensions
df2021_unique.shape

# Create a Column for the Number of Characters per User
df2021_unique = pd.merge(df2021_unique, df2021['UserID'].value_counts().reset_index(), how='left', on='UserID')

# Create a Column for Average Combat Power per User
tmp_df2021 = df2021.iloc[:, :14].drop('CharacterID', axis = 1)

tmp_df2021 = tmp_df2021.groupby("UserID").mean().reset_index()
tmp_df2021.columns = ["UserID", "avg_power_01", "avg_power_02", "avg_power_03", "avg_power_04", "avg_power_05", "avg_power_06",
                      "avg_power_07", "avg_power_08", "avg_power_09", "avg_power_10", "avg_power_11", "avg_power_12"]

df2021_unique = pd.merge(df2021_unique, tmp_df2021, how = 'left', on = 'UserID')

# Create DataFrame for Each Character
df_user = pd.DataFrame()

for idx_user in range(df2021_unique.shape[0]):
    for idx_power in range(123, 135):
        # Jan
        if idx_power == 123:
            attendance = df2021_unique.iloc[idx_user, 14]                            # 이번달 출석일
            paid = df2021_unique.iloc[idx_user, 26]                                  # 이번달 과금액
            nCharacter = df2021_unique.iloc[idx_user, 122]                           # 부캐 개수
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])     # 부캐 평균 전투력 등급
            next_attendance = df2021_unique.iloc[idx_user, 15]                       # 다음날 출석일
            next_paid = df2021_unique.iloc[idx_user, 27]                             # 다음달 과금액
            
        # Feb
        elif idx_power == 124:
            attendance = df2021_unique.iloc[idx_user, 15]
            paid = df2021_unique.iloc[idx_user, 27]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 16]
            next_paid = df2021_unique.iloc[idx_user, 28]
            
        # Mar
        elif idx_power == 125:
            attendance = df2021_unique.iloc[idx_user, 16]
            paid = df2021_unique.iloc[idx_user, 28]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 17]
            next_paid = df2021_unique.iloc[idx_user, 29]
            
        # Apr
        elif idx_power == 126:
            attendance = df2021_unique.iloc[idx_user, 17]
            paid = df2021_unique.iloc[idx_user, 29]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 18]
            next_paid = df2021_unique.iloc[idx_user, 30]
            
        # May
        elif idx_power == 127:
            attendance = df2021_unique.iloc[idx_user, 18]
            paid = df2021_unique.iloc[idx_user, 30]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 19]
            next_paid = df2021_unique.iloc[idx_user, 31]
            
        # Jun
        elif idx_power == 128:
            attendance = df2021_unique.iloc[idx_user, 19]
            paid = df2021_unique.iloc[idx_user, 31]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 20]
            next_paid = df2021_unique.iloc[idx_user, 32]
            
        # Jul
        elif idx_power == 129:
            attendance = df2021_unique.iloc[idx_user, 20]
            paid = df2021_unique.iloc[idx_user, 32]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 21]
            next_paid = df2021_unique.iloc[idx_user, 33]
            
        # Aug
        elif idx_power == 130:
            attendance = df2021_unique.iloc[idx_user, 21]
            paid = df2021_unique.iloc[idx_user, 33]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 22]
            next_paid = df2021_unique.iloc[idx_user, 34]
            
        # Sep
        elif idx_power == 131:
            attendance = df2021_unique.iloc[idx_user, 22]
            paid = df2021_unique.iloc[idx_user, 34]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 23]
            next_paid = df2021_unique.iloc[idx_user, 35]
            
        # Oct
        elif idx_power == 132:
            attendance = df2021_unique.iloc[idx_user, 23]
            paid = df2021_unique.iloc[idx_user, 35]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 24]
            next_paid = df2021_unique.iloc[idx_user, 36]
            
        # Nov
        elif idx_power == 133:
            attendance = df2021_unique.iloc[idx_user, 24]
            paid = df2021_unique.iloc[idx_user, 36]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 25]
            next_paid = df2021_unique.iloc[idx_user, 37]
        
        tmp_df = pd.DataFrame({"current_attendance": attendance, "current_paid": paid, "num_characters": nCharacter,
                               "power_level": power_level, "next_attendance": next_attendance, "next_paid": next_paid}, index = [0])
        df_user = pd.concat([df_user, tmp_df], axis = 0)

# Organize DataFrame Index
df_user = df_user.reset_index(drop=True)
# Check the top 5 rows
df_user.head()

# next attendance distribution
df_user["next_attendance"].value_counts().sort_index()

next_attendance
0     138459
1      13782
2       5796
3       3369
4       2506
5       1783
6       1366
7       1086
8        946
9        805
10       723
11       663
12       580
13       477
14       435
15       439
16       396
17       395
18       387
19       337
20       326
21       339
22       347
23       337
24       304
25       346
26       362
27       418
28       613
29       522
30       959
31      1105
Name: count, dtype: int64

Final Data (2)

UserID_changed	current_attendance	current_paid	num_characters	power_level	next_attendance	next_paid
0	17	0.0	1	3	13	14000.0
1	13	14000.0	1	4	26	5000.0
2	26	5000.0	1	4	8	0.0
3	8	0.0	1	4	1	0.0
4	1	0.0	1	4	16	12950.0

*User_ID
*current_attendance: (N) Monthly Login Days
*current_paid: (N) Monthly Spending Amount
*num_characters: Number of Characters Owned by Each User
*power_level: Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)
*next_attendance: (N+1) Monthly Login Days
*next_paid: (N+1) Monthly Spending Amount

Visualization

Visualize Correlation by Feature
Histogram of Feature Distributions
Scatter Plot of Data Features

*Extracted 2022 Data to Compare with the 2021 Data

Final Data (1)

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Setting for seaborn figure
rc = {
    "axes.facecolor": "#F8F8F8",
    "figure.facecolor": "#F8F8F8",
    "axes.edgecolor": "#000000",
    "grid.color": "#EBEBE7" + "30",
    "font.family": "serif",
    "axes.labelcolor": "#000000",
    "xtick.color": "#000000",
    "ytick.color": "#000000",
    "grid.alpha": 0.4
}

sns.set(rc=rc)
palette = ['#302c36', '#037d97', '#E4591E', '#C09741',
           '#EC5B6D', '#90A6B1', '#6ca957', '#D8E3E2']

def plot_correlation_heatmap(df: pd.core.frame.DataFrame, title_name: str='Train correlation') -> None:
    """Draws the correlation heatmap plot.
    
    Args:
        df: train or test dataframes
        title_name: 'Train' or 'Test' (default 'Train correlation')
        
    Returns:
        subplots of size (len(col_list), 2)
    """

    corr = df.corr()
    fig, axes = plt.subplots(figsize=(15, 6))
    mask = np.zeros_like(corr)
    mask[np.triu_indices_from(mask)] = True # Display only the bottom part
    sns.heatmap(corr, mask=mask, linewidths=.5,  cmap=palette[5:][::-2] + palette[1:3], annot=True, fmt=".2f", annot_kws={'size':8})
    plt.title(title_name)
    plt.show()

# Data Load (2021)
df_character = pd.read_csv("./preprocessed_data/df2021_character.csv")
# Check the top 5 rows
df_character.head()
# Visualize Correlation by Feature
plot_correlation_heatmap(df_character)

# Data Load (2022)
df_character2 = pd.read_csv("./preprocessed_data/df2022_character.csv")
# Visualize Correlation by Feature
plot_correlation_heatmap(df_character2)

Interpretation of correlations between various variables

– This month’s Combat Power is highly correlated with next month’s Combat Power. Considering the situation, it appears that users with low Combat Power often leave while remaining low, and those with high Combat Power tend to reach a level where it becomes difficult to further increase their power, resulting in no further changes. Therefore, although there is a high correlation between the two, it is difficult to derive any meaningful interpretation from it.
– In 2021, compared to 2022, there was a high correlation between users playing dungeons of higher levels and quest dungeon play/this month’s Combat Power/next month’s Combat Power. Additionally, users who played higher-level dungeons in 2021 also appear to have played many lower-level dungeons. -> From this, we can infer that in 2021, core users with higher Combat Power than in 2022 were generally more active in gameplay.
– Overall, it seems that in 2022, all gameplay indicators (compared to 2021) did not increase in conjunction with other gameplay metrics. In other words, while in 2021, core gamers engaged in various forms of play in the game, in 2022, it appears they neither increased their Combat Power nor played dungeons as much, staying strong only in Combat Power.
– However, one deviation from this trend is the increased correlation between Combat Power and quest dungeons in 2022. This might be because quest dungeons are essential for character development, suggesting that players, losing interest in core gameplay, may have started focusing on developing their secondary characters.

# Scatter Plot Between Features (2021)
a2021 = sns.pairplot(df_character, diag_kind = 'kde')
a2021.fig.suptitle("2021")
plt.show()

# Scatter Plot Between Features (2022)
a2021 = sns.pairplot(df_character, diag_kind = 'kde')
a2021.fig.suptitle("2021")
plt.show()

Interpretation of Scatter Plot

– As previously analyzed, the 2022 data shows a distribution that is generally lower compared to 2021.
– Compared to 2021, the Power Difference in 2022 tends to be higher, which indicates that the decline in gameplay metrics suggests that users are focusing more on increasing their skills through spending rather than through gameplay.

# Histogram Between Features (2021)
sns.histplot(df_character, x = "low_dungeon", kde = True).set(title='2021')
plt.show()

# Histogram Between Features (2022)
sns.histplot(df_character2, x = "low_dungeon", kde = True).set(title='2022')
plt.show()

Difficult to interpret

It’s difficult to interpret the distribution because there are many users who didn’t play at all in both 2021 and 2022.

Final Data (2)

Used the Same Code and Only Documented the Results.

Interpretation of correlations between various variables

– Combat Power Level showed a strong correlation with this month’s attendance and this month’s attendance also had a high correlation with next month’s attendance. Additionally, there was a significant correlation between users who made payments this month and those who made payments next month.
– Overall, Combat Power level, attendance days, and payment amount showed relatively strong correlations. This is a natural result since users who are more engaged in the game are likely to have high values in all three metrics.
– A notable point is that, contrary to the expectation that raising many secondary characters indicates a high level of affection for the game, this was not the case. The number of secondary characters had almost no correlation with all the variables. In this game, secondary characters are supposed to assist the main character, and it was quite surprising that they were unrelated to other metrics. As a result, we aim to understand that users primarily focus on one character and implement various campaigns to foster attachment to a single character.
– Unlike the gameplay metrics we previously examined, the 2022 metrics did not show a significant overall decrease compared to 2021. This could be seen as a positive trend, but there is a high level of concern that users may be entering the game without much focus, or engaging in habitual spending behaviors due to a lack of meaningful activities within the game.

Interpretation of Scatter Plot

The distribution of the 2021 data and the 2022 data is nearly identical.

Interpretation of Histogram

Most users have low attendance rates.

Removed zeros from the 2021 data as the figures appeared unclear

# Remove Rows with 0 Logins for the Next Month
df_user_nz = df_user[df_user['next_attendance'] != 0]

Interpretation of correlations between various variables

There was not a significant change compared to the data before removing zeros.