Loading
Keunyoung Yoon

Data Engineer

Data Scientist

Data Analyst

Keunyoung Yoon

Data Engineer

Data Scientist

Data Analyst

Blog Post

Data Preprocessing and Visualization

11/26/2023 Portfolio

I have conducted a data preprocessing process to prepare complex real-world data for graphing.

Initial Raw Data

USER IDCharacter IDcb_a2101cb_a2102cb_a2103cb_a2104cb_a2105cb_a2106cb_a2107cb_a2108m11_dun1m11_dun2m11_dun3m11_dun4m11_dunqm12_dun1m12_dun2m12_dun3m12_dun4m12_dunq
010801394827373.61221864.01557040.81573940.41573940.41656014.41644128.41694022.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
11293962604380.00.00.00.00.00.0404038.80.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2151437970.00.00.00.00.00.0401023.20.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3182352560.00.0NaN0.00.00.00.00.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
427291186753850.02592.03637.20.00.00.00.00.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
*5 rows × 98 columns
*cb_a0000: Monthly Max Combat Power / Jan, 2021 to Dec, 2021
*a0000: Monthly Login Days / Jan, 2021 to Dec, 2021
*paid00: Monthly Spending Amount
*m00_dun1 / dun2 / dun3 / dun4 / dun_q: Monthly Plays per Dungeon Difficulty (1~4 is difficulty, q is for quests)
*All data has been sampled and sanitized.

Data Prepocessing (1)
Goal for Column Organization (1)
  • character_ID
  • power_difference: Monthly Change in Combat Power by Character
  • low_dungeon: Monthly Plays in Low-level Dungeons
  • high_dungeon: Monthly Plays in High-level Dungeons
  • quest_dungeon: Monthly Plays in Quest Dungeons
  • current_power_level: (N) Months’ Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)
  • next_power_level: (N+1) Months’ Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)
import pandas as pd

# Data Load
df2021 = pd.read_excel("./data/2021.xlsx")
# Verify Data Dimensions
df2021.shape
# Check the top 5 rows
df2021.head()
# Check for Missing Values
df2021.isna().sum()
# Replace Missing Values
df2021.fillna(0, inplace = True)

# Process Unique CharacterID
df2021["CharacterID"] = range(len(df2021["CharacterID"]))
# Extract Columns from DataFrame
# df2021_power = df2021[["cb_a2101", "cb_a2102", "cb_a2103", "cb_a2104", "cb_a2105", "cb_a2106", "cb_a2107", "cb_a2108", "cb_a2109", "cb_a2110", "cb_a2111", "cb_a2111"]]
df2021_power = df2021.iloc[:, 2:14]
def convert_power(x):
    if x <= 100000:
        x = 1
    elif (x > 100000) and (x <= 500000):
        x = 2
    elif (x > 500000) and (x <= 1000000):
        x = 3
    elif (x > 1000000) and (x <= 2500000):
        x = 4
    elif (x > 2500000) and (x <= 5000000):
        x = 5
    elif (x > 5000000) and (x <= 7500000):
        x = 6
    elif (x > 7500000):
        x = 7
    else:
        print("power error")
    
    return x
df2021["m01_dg_low"] = df2021["m01_dun1"] + df2021["m01_dun2"]
df2021["m01_dg_high"] = df2021["m01_dun3"] + df2021["m01_dun4"]

df2021["m02_dg_low"] = df2021["m02_dun1"] + df2021["m02_dun2"]
df2021["m02_dg_high"] = df2021["m02_dun3"] + df2021["m02_dun4"]

df2021["m03_dg_low"] = df2021["m03_dun1"] + df2021["m03_dun2"]
df2021["m03_dg_high"] = df2021["m03_dun3"] + df2021["m03_dun4"]

df2021["m04_dg_low"] = df2021["m04_dun1"] + df2021["m04_dun2"]
df2021["m04_dg_high"] = df2021["m04_dun3"] + df2021["m04_dun4"]

df2021["m05_dg_low"] = df2021["m05_dun1"] + df2021["m05_dun2"]
df2021["m05_dg_high"] = df2021["m05_dun3"] + df2021["m05_dun4"]

df2021["m06_dg_low"] = df2021["m06_dun1"] + df2021["m06_dun2"]
df2021["m06_dg_high"] = df2021["m06_dun3"] + df2021["m06_dun4"]

df2021["m07_dg_low"] = df2021["m07_dun1"] + df2021["m07_dun2"]
df2021["m07_dg_high"] = df2021["m07_dun3"] + df2021["m07_dun4"]

df2021["m08_dg_low"] = df2021["m08_dun1"] + df2021["m08_dun2"]
df2021["m08_dg_high"] = df2021["m08_dun3"] + df2021["m08_dun4"]

df2021["m09_dg_low"] = df2021["m09_dun1"] + df2021["m09_dun2"]
df2021["m09_dg_high"] = df2021["m09_dun3"] + df2021["m09_dun4"]

df2021["m10_dg_low"] = df2021["m10_dun1"] + df2021["m10_dun2"]
df2021["m10_dg_high"] = df2021["m10_dun3"] + df2021["m10_dun4"]

df2021["m11_dg_low"] = df2021["m11_dun1"] + df2021["m11_dun2"]
df2021["m11_dg_high"] = df2021["m11_dun3"] + df2021["m11_dun4"]

df2021["m12_dg_low"] = df2021["m12_dun1"] + df2021["m12_dun2"]
df2021["m12_dg_high"] = df2021["m12_dun3"] + df2021["m12_dun4"]
# Create DataFrame for Each Character
df_character = pd.DataFrame()

for idx_character in range(df2021.shape[0]):
    for idx_power in range(3, 14):
        if (df2021.iloc[idx_character, idx_power] != 0) and (df2021.iloc[idx_character, idx_power - 1] != 0):
            power_diff = df2021.iloc[idx_character, idx_power] - df2021.iloc[idx_character, idx_power - 1]
            
            # 1,2
            if idx_power == 3:
                dg_low = df2021.iloc[idx_character, 98]
                dg_high = df2021.iloc[idx_character, 99]
                dg_quest = df2021.iloc[idx_character, 42]
                
            # 2,3
            elif idx_power == 4:
                dg_low = df2021.iloc[idx_character, 100]
                dg_high = df2021.iloc[idx_character, 101]
                dg_quest = df2021.iloc[idx_character, 47]
                
            # 3,4
            elif idx_power == 5:
                dg_low = df2021.iloc[idx_character, 102]
                dg_high = df2021.iloc[idx_character, 103]
                dg_quest = df2021.iloc[idx_character, 52]
                
            # 4,5
            elif idx_power == 6:
                dg_low = df2021.iloc[idx_character, 104]
                dg_high = df2021.iloc[idx_character, 105]
                dg_quest = df2021.iloc[idx_character, 57]
                
            # 5,6
            elif idx_power == 7:
                dg_low = df2021.iloc[idx_character, 106]
                dg_high = df2021.iloc[idx_character, 107]
                dg_quest = df2021.iloc[idx_character, 62]
                
            # 6,7
            elif idx_power == 8:
                dg_low = df2021.iloc[idx_character, 108]
                dg_high = df2021.iloc[idx_character, 109]
                dg_quest = df2021.iloc[idx_character, 67]
                
            # 7,8
            elif idx_power == 9:
                dg_low = df2021.iloc[idx_character, 110]
                dg_high = df2021.iloc[idx_character, 111]
                dg_quest = df2021.iloc[idx_character, 72]
                
            # 8,9
            elif idx_power == 10:
                dg_low = df2021.iloc[idx_character, 112]
                dg_high = df2021.iloc[idx_character, 113]
                dg_quest = df2021.iloc[idx_character, 77]
            # 9,10
            elif idx_power == 11:
                dg_low = df2021.iloc[idx_character, 114]
                dg_high = df2021.iloc[idx_character, 115]
                dg_quest = df2021.iloc[idx_character, 82]
                
            # 10,11
            elif idx_power == 12:
                dg_low = df2021.iloc[idx_character, 116]
                dg_high = df2021.iloc[idx_character, 117]
                dg_quest = df2021.iloc[idx_character, 87]
                
            # 11,12
            elif idx_power == 13:
                dg_low = df2021.iloc[idx_character, 118]
                dg_high = df2021.iloc[idx_character, 119]
                dg_quest = df2021.iloc[idx_character, 92]
            
            current_power_level = convert_power(df2021.iloc[idx_character, idx_power - 1])
            next_power_level = convert_power(df2021.iloc[idx_character, idx_power])
                
        else:
            continue
        
        tmp_df = pd.DataFrame({"power_difference": power_diff, "low_dungeon": dg_low, "high_dungeon": dg_high,
                               "quest_dungeon": dg_quest, "current_power_level":current_power_level, "next_power_level":next_power_level}, index = [0])
        df_character = pd.concat([df_character, tmp_df], axis = 0)
# Organize DataFrame Index
df_character = df_character.reset_index(drop=True)
# Check the top 5 rows
df_character.head(5)
# current power level distribution
df_character["current_power_level"].value_counts().sort_index()

next_power_level
1    1799
2    3693
3    3267
4    3528
5    2449
6     531
7     105
Name: count, dtype: int64

Final Data (1)

Character ID_changedpower_differencelow_dungeonhigh_dungeonquest_dungeoncurrent_power_levelnext_power_level
0394490.410.07.00.034
1335176.810.00.00.044
216899.696.012.00.044
30.00.00.00.044
482074.00.00.00.044
*5 rows × 6 columns
*power_difference: Monthly Change in Combat Power by Character
*low_dungeon: Monthly Plays in Low-level Dungeons
*high_dungeon: Monthly Plays in High-level Dungeons
*quest_dungeon: Monthly Plays in Quest Dungeons
*current_power_level: (N) Months’ Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)
*next_power_level: (N+1) Months’ Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)]

Data Prepocessing (2)
Goal for Column Organization (2)
  • User_ID
  • current_attendance: (N) Monthly Login Days
  • current_paid: (N) Monthly Spending Amount
  • num_characters: Number of Characters Owned by Each User
  • power_level: Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)
  • next_attendance: (N+1) Monthly Login Days
  • next_paid: (N+1) Monthly Spending Amount
# Remove duplicate rows based on UserID
df2021_unique = df2021.drop_duplicates(subset='UserID', keep='first', ignore_index=True)
# Verify Data Dimensions
df2021_unique.shape
# Create a Column for the Number of Characters per User
df2021_unique = pd.merge(df2021_unique, df2021['UserID'].value_counts().reset_index(), how='left', on='UserID')

# Create a Column for Average Combat Power per User
tmp_df2021 = df2021.iloc[:, :14].drop('CharacterID', axis = 1)

tmp_df2021 = tmp_df2021.groupby("UserID").mean().reset_index()
tmp_df2021.columns = ["UserID", "avg_power_01", "avg_power_02", "avg_power_03", "avg_power_04", "avg_power_05", "avg_power_06",
                      "avg_power_07", "avg_power_08", "avg_power_09", "avg_power_10", "avg_power_11", "avg_power_12"]

df2021_unique = pd.merge(df2021_unique, tmp_df2021, how = 'left', on = 'UserID')
# Create DataFrame for Each Character
df_user = pd.DataFrame()

for idx_user in range(df2021_unique.shape[0]):
    for idx_power in range(123, 135):
        # Jan
        if idx_power == 123:
            attendance = df2021_unique.iloc[idx_user, 14]                            # 이번달 출석일
            paid = df2021_unique.iloc[idx_user, 26]                                  # 이번달 과금액
            nCharacter = df2021_unique.iloc[idx_user, 122]                           # 부캐 개수
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])     # 부캐 평균 전투력 등급
            next_attendance = df2021_unique.iloc[idx_user, 15]                       # 다음날 출석일
            next_paid = df2021_unique.iloc[idx_user, 27]                             # 다음달 과금액
            
        # Feb
        elif idx_power == 124:
            attendance = df2021_unique.iloc[idx_user, 15]
            paid = df2021_unique.iloc[idx_user, 27]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 16]
            next_paid = df2021_unique.iloc[idx_user, 28]
            
        # Mar
        elif idx_power == 125:
            attendance = df2021_unique.iloc[idx_user, 16]
            paid = df2021_unique.iloc[idx_user, 28]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 17]
            next_paid = df2021_unique.iloc[idx_user, 29]
            
        # Apr
        elif idx_power == 126:
            attendance = df2021_unique.iloc[idx_user, 17]
            paid = df2021_unique.iloc[idx_user, 29]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 18]
            next_paid = df2021_unique.iloc[idx_user, 30]
            
        # May
        elif idx_power == 127:
            attendance = df2021_unique.iloc[idx_user, 18]
            paid = df2021_unique.iloc[idx_user, 30]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 19]
            next_paid = df2021_unique.iloc[idx_user, 31]
            
        # Jun
        elif idx_power == 128:
            attendance = df2021_unique.iloc[idx_user, 19]
            paid = df2021_unique.iloc[idx_user, 31]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 20]
            next_paid = df2021_unique.iloc[idx_user, 32]
            
        # Jul
        elif idx_power == 129:
            attendance = df2021_unique.iloc[idx_user, 20]
            paid = df2021_unique.iloc[idx_user, 32]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 21]
            next_paid = df2021_unique.iloc[idx_user, 33]
            
        # Aug
        elif idx_power == 130:
            attendance = df2021_unique.iloc[idx_user, 21]
            paid = df2021_unique.iloc[idx_user, 33]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 22]
            next_paid = df2021_unique.iloc[idx_user, 34]
            
        # Sep
        elif idx_power == 131:
            attendance = df2021_unique.iloc[idx_user, 22]
            paid = df2021_unique.iloc[idx_user, 34]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 23]
            next_paid = df2021_unique.iloc[idx_user, 35]
            
        # Oct
        elif idx_power == 132:
            attendance = df2021_unique.iloc[idx_user, 23]
            paid = df2021_unique.iloc[idx_user, 35]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 24]
            next_paid = df2021_unique.iloc[idx_user, 36]
            
        # Nov
        elif idx_power == 133:
            attendance = df2021_unique.iloc[idx_user, 24]
            paid = df2021_unique.iloc[idx_user, 36]
            nCharacter = df2021_unique.iloc[idx_user, 122]
            power_level = convert_power(df2021_unique.iloc[idx_user, idx_power])
            next_attendance = df2021_unique.iloc[idx_user, 25]
            next_paid = df2021_unique.iloc[idx_user, 37]
        
        tmp_df = pd.DataFrame({"current_attendance": attendance, "current_paid": paid, "num_characters": nCharacter,
                               "power_level": power_level, "next_attendance": next_attendance, "next_paid": next_paid}, index = [0])
        df_user = pd.concat([df_user, tmp_df], axis = 0)
# Organize DataFrame Index
df_user = df_user.reset_index(drop=True)
# Check the top 5 rows
df_user.head()

# next attendance distribution
df_user["next_attendance"].value_counts().sort_index()

next_attendance
0     138459
1      13782
2       5796
3       3369
4       2506
5       1783
6       1366
7       1086
8        946
9        805
10       723
11       663
12       580
13       477
14       435
15       439
16       396
17       395
18       387
19       337
20       326
21       339
22       347
23       337
24       304
25       346
26       362
27       418
28       613
29       522
30       959
31      1105
Name: count, dtype: int64

Final Data (2)

UserID_changedcurrent_attendancecurrent_paidnum_characterspower_levelnext_attendancenext_paid
0170.0131314000.0
11314000.014265000.0
2265000.01480.0
380.01410.0
410.0141612950.0
*User_ID
*current_attendance: (N) Monthly Login Days
*current_paid: (N) Monthly Spending Amount
*num_characters: Number of Characters Owned by Each User
*power_level: Combat Power Categorization (Higher Numbers Indicate Greater Combat Strength)
*next_attendance: (N+1) Monthly Login Days
*next_paid: (N+1) Monthly Spending Amount

  1. Visualize Correlation by Feature
  2. Histogram of Feature Distributions
  3. Scatter Plot of Data Features

*Extracted 2022 Data to Compare with the 2021 Data

Final Data (1)

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Setting for seaborn figure
rc = {
    "axes.facecolor": "#F8F8F8",
    "figure.facecolor": "#F8F8F8",
    "axes.edgecolor": "#000000",
    "grid.color": "#EBEBE7" + "30",
    "font.family": "serif",
    "axes.labelcolor": "#000000",
    "xtick.color": "#000000",
    "ytick.color": "#000000",
    "grid.alpha": 0.4
}

sns.set(rc=rc)
palette = ['#302c36', '#037d97', '#E4591E', '#C09741',
           '#EC5B6D', '#90A6B1', '#6ca957', '#D8E3E2']

def plot_correlation_heatmap(df: pd.core.frame.DataFrame, title_name: str='Train correlation') -> None:
    """Draws the correlation heatmap plot.
    
    Args:
        df: train or test dataframes
        title_name: 'Train' or 'Test' (default 'Train correlation')
        
    Returns:
        subplots of size (len(col_list), 2)
    """

    corr = df.corr()
    fig, axes = plt.subplots(figsize=(15, 6))
    mask = np.zeros_like(corr)
    mask[np.triu_indices_from(mask)] = True # Display only the bottom part
    sns.heatmap(corr, mask=mask, linewidths=.5,  cmap=palette[5:][::-2] + palette[1:3], annot=True, fmt=".2f", annot_kws={'size':8})
    plt.title(title_name)
    plt.show()
# Data Load (2021)
df_character = pd.read_csv("./preprocessed_data/df2021_character.csv")
# Check the top 5 rows
df_character.head()
# Visualize Correlation by Feature
plot_correlation_heatmap(df_character)

# Data Load (2022)
df_character2 = pd.read_csv("./preprocessed_data/df2022_character.csv")
# Visualize Correlation by Feature
plot_correlation_heatmap(df_character2)

Interpretation of correlations between various variables

– This month’s Combat Power is highly correlated with next month’s Combat Power. Considering the situation, it appears that users with low Combat Power often leave while remaining low, and those with high Combat Power tend to reach a level where it becomes difficult to further increase their power, resulting in no further changes. Therefore, although there is a high correlation between the two, it is difficult to derive any meaningful interpretation from it.
– In 2021, compared to 2022, there was a high correlation between users playing dungeons of higher levels and quest dungeon play/this month’s Combat Power/next month’s Combat Power. Additionally, users who played higher-level dungeons in 2021 also appear to have played many lower-level dungeons. -> From this, we can infer that in 2021, core users with higher Combat Power than in 2022 were generally more active in gameplay.
– Overall, it seems that in 2022, all gameplay indicators (compared to 2021) did not increase in conjunction with other gameplay metrics. In other words, while in 2021, core gamers engaged in various forms of play in the game, in 2022, it appears they neither increased their Combat Power nor played dungeons as much, staying strong only in Combat Power.
– However, one deviation from this trend is the increased correlation between Combat Power and quest dungeons in 2022. This might be because quest dungeons are essential for character development, suggesting that players, losing interest in core gameplay, may have started focusing on developing their secondary characters.

# Scatter Plot Between Features (2021)
a2021 = sns.pairplot(df_character, diag_kind = 'kde')
a2021.fig.suptitle("2021")
plt.show()

# Scatter Plot Between Features (2022)
a2021 = sns.pairplot(df_character, diag_kind = 'kde')
a2021.fig.suptitle("2021")
plt.show()

Interpretation of Scatter Plot

– As previously analyzed, the 2022 data shows a distribution that is generally lower compared to 2021.
– Compared to 2021, the Power Difference in 2022 tends to be higher, which indicates that the decline in gameplay metrics suggests that users are focusing more on increasing their skills through spending rather than through gameplay.

# Histogram Between Features (2021)
sns.histplot(df_character, x = "low_dungeon", kde = True).set(title='2021')
plt.show()

# Histogram Between Features (2022)
sns.histplot(df_character2, x = "low_dungeon", kde = True).set(title='2022')
plt.show()

Difficult to interpret

It’s difficult to interpret the distribution because there are many users who didn’t play at all in both 2021 and 2022.

Final Data (2)

Used the Same Code and Only Documented the Results.

Interpretation of correlations between various variables

– Combat Power Level showed a strong correlation with this month’s attendance and this month’s attendance also had a high correlation with next month’s attendance. Additionally, there was a significant correlation between users who made payments this month and those who made payments next month.
– Overall, Combat Power level, attendance days, and payment amount showed relatively strong correlations. This is a natural result since users who are more engaged in the game are likely to have high values in all three metrics.
– A notable point is that, contrary to the expectation that raising many secondary characters indicates a high level of affection for the game, this was not the case. The number of secondary characters had almost no correlation with all the variables. In this game, secondary characters are supposed to assist the main character, and it was quite surprising that they were unrelated to other metrics. As a result, we aim to understand that users primarily focus on one character and implement various campaigns to foster attachment to a single character.
– Unlike the gameplay metrics we previously examined, the 2022 metrics did not show a significant overall decrease compared to 2021. This could be seen as a positive trend, but there is a high level of concern that users may be entering the game without much focus, or engaging in habitual spending behaviors due to a lack of meaningful activities within the game.

Interpretation of Scatter Plot

The distribution of the 2021 data and the 2022 data is nearly identical.

Interpretation of Histogram

Most users have low attendance rates.

Removed zeros from the 2021 data as the figures appeared unclear

# Remove Rows with 0 Logins for the Next Month
df_user_nz = df_user[df_user['next_attendance'] != 0]

Interpretation of correlations between various variables

There was not a significant change compared to the data before removing zeros.