E-commerce Data Preprocessing
Today we have preprocessed the business data, mainly processing
1.Extract order data for 2019
2.Handle data for business process inconsistencies (payment time is earlier than order time, payment time is longer than 30 minutes, order amount is less than 0, payment amount is less than 0)
3.Data with empty processing channel (supplement majority)
4.Processing Platform Type Field (Remove extra spaces and keep data consistent)
5.Add a discount field to handle fields with discounts greater than 1 (change payment amount to Order amount * Average discount)
Data here, interested in doing with me. (Baidu Disk)
Links:https://pan.baidu.com/s/1yprZr0sxzLLPPBy-ICCocg
Extraction Code: dmmo
Description of each field:
1.'id'index soy sauce
2.'orderID'order id
3. 'userID'User id
4. 'goodsID'commodity id
5. 'orderAmount'order price
6. 'payment'Paid Price
7. 'chanelID'channel
8. 'platfromType'platform type
9. 'orderTime'placing order time
10. 'payTime'Payment Time
11. 'chargeback'refund or not
Say nothing but code directly
# Import Data Analysis of Three Magic Repositories import numpy as np import pandas as pd import matplotlib.pyplot as plt
1. Importing data
# 1. Importing data ds_df = pd.read_excel('data/Order data of an e-commerce website.xlsx') #Please change your path to the path where you put your data
1.1 View the first five rows of data
ds_df.head() # View the first five rows of data
1.2 Attributes and other information
ds_df.info()
2. Data Preprocessing
2.1.Extract order data for 2019
# Make Boolean Index df_19 = ds_df[ds_df.orderTime.dt.year==2019] len(df_19)
Now that there's less data, let's take a closer look at this new df_19
2.2 Handle business process inconsistencies (payment time is earlier than order time, payment time is longer than 30 minutes, order amount is less than 0, payment amount is less than 0)
# Filter orders that are paid before the order is placed df_19 = df_19[~(df_19.payTime<df_19.orderTime)] # Filter orders that take longer than 30 minutes to pay df_19 = df_19[(df_19.payTime-df_19.orderTime)<pd.to_timedelta('0 days 00:30:00')] # Filter out orders with order amount less than 0 and payment amount less than 0 df_19 = df_19[(df_19.orderAmount>0) & (df_19.payment > 0)] len(df_19)
2.3.Data with empty processing channel (supplement majority)
df_19.chanelID.isnull().any() # Determine if there is a null value # Get the chanelID's majority, which is an object and cannot be used to populate directly chanel_mode = df_19.chanelID.mode() # Fill in empty values (convert the above objects to strings and refill) df_19.fillna(value={'chanelID':str(chanel_mode)},inplace=True)
2.4.Processing Platform Type Field (Remove extra spaces and keep data consistent)
# Convert all characters to lowercase df_19.platfromType = df_19.platfromType.str.lower() # Remove extra spaces and keep data consistent df_19.platfromType = df_19.platfromType.str.replace(' ','')
2.5.Add a discount field to handle fields with discounts greater than 1 (change payment amount to Order amount * Average discount)
# Obtain discount normal order temp = df_19[~((df_19.payment / df_19.orderAmount)>1)] # Calculate average discount for normal orders mean_discount=(temp.payment / temp.orderAmount).mean() # Take out an unusual discount order temp1 = df_19[(df_19.payment / df_19.orderAmount)>1] # Correct abnormal payment price temp1['payment'] = np.round(temp1.orderAmount*mean_discount,2) temp1.set_index('id') temp1.index over_discount_index = temp1.index # Replace abnormal rows in df_19 with temp1 rows df_19.loc[over_discount_index] = temp1.loc[over_discount_index]
This is where the preprocessing is done, but you find the index doesn't look good, so I reset it
df_19.reset_index(drop=True,inplace=True)
This works when you're done.
Okay, it's not easy to create, it helps to leave your approval or small amount of appreciation!