python e-commerce data preprocessing

Keywords: Python pandas

E-commerce Data Preprocessing

Today we have preprocessed the business data, mainly processing

1.Extract order data for 2019
2.Handle data for business process inconsistencies (payment time is earlier than order time, payment time is longer than 30 minutes, order amount is less than 0, payment amount is less than 0)
3.Data with empty processing channel (supplement majority)
4.Processing Platform Type Field (Remove extra spaces and keep data consistent)
5.Add a discount field to handle fields with discounts greater than 1 (change payment amount to Order amount * Average discount)

Data here, interested in doing with me. (Baidu Disk)
Links:https://pan.baidu.com/s/1yprZr0sxzLLPPBy-ICCocg
Extraction Code: dmmo

Description of each field:
1.'id'index soy sauce
2.'orderID'order id
3. 'userID'User id
4. 'goodsID'commodity id
5. 'orderAmount'order price
6. 'payment'Paid Price
7. 'chanelID'channel
8. 'platfromType'platform type
9. 'orderTime'placing order time
10. 'payTime'Payment Time
11. 'chargeback'refund or not

Say nothing but code directly

# Import Data Analysis of Three Magic Repositories
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

1. Importing data

# 1. Importing data
ds_df = pd.read_excel('data/Order data of an e-commerce website.xlsx') 
 #Please change your path to the path where you put your data

1.1 View the first five rows of data

ds_df.head() # View the first five rows of data

1.2 Attributes and other information

ds_df.info()

2. Data Preprocessing

2.1.Extract order data for 2019

# Make Boolean Index
df_19 = ds_df[ds_df.orderTime.dt.year==2019]
len(df_19)


Now that there's less data, let's take a closer look at this new df_19

2.2 Handle business process inconsistencies (payment time is earlier than order time, payment time is longer than 30 minutes, order amount is less than 0, payment amount is less than 0)

 # Filter orders that are paid before the order is placed
df_19 = df_19[~(df_19.payTime<df_19.orderTime)] 

# Filter orders that take longer than 30 minutes to pay
df_19 = df_19[(df_19.payTime-df_19.orderTime)<pd.to_timedelta('0 days 00:30:00')] 
# Filter out orders with order amount less than 0 and payment amount less than 0
df_19 = df_19[(df_19.orderAmount>0) & (df_19.payment > 0)]
len(df_19)

2.3.Data with empty processing channel (supplement majority)

df_19.chanelID.isnull().any() # Determine if there is a null value

# Get the chanelID's majority, which is an object and cannot be used to populate directly
chanel_mode = df_19.chanelID.mode()  
# Fill in empty values (convert the above objects to strings and refill)
df_19.fillna(value={'chanelID':str(chanel_mode)},inplace=True)

2.4.Processing Platform Type Field (Remove extra spaces and keep data consistent)

# Convert all characters to lowercase
df_19.platfromType = df_19.platfromType.str.lower()
# Remove extra spaces and keep data consistent
df_19.platfromType = df_19.platfromType.str.replace(' ','')

2.5.Add a discount field to handle fields with discounts greater than 1 (change payment amount to Order amount * Average discount)

# Obtain discount normal order
temp = df_19[~((df_19.payment / df_19.orderAmount)>1)]
# Calculate average discount for normal orders
mean_discount=(temp.payment / temp.orderAmount).mean()
# Take out an unusual discount order
temp1 = df_19[(df_19.payment / df_19.orderAmount)>1]
# Correct abnormal payment price
temp1['payment'] = np.round(temp1.orderAmount*mean_discount,2)
temp1.set_index('id')
temp1.index
over_discount_index = temp1.index
# Replace abnormal rows in df_19 with temp1 rows
df_19.loc[over_discount_index] = temp1.loc[over_discount_index]

This is where the preprocessing is done, but you find the index doesn't look good, so I reset it

df_19.reset_index(drop=True,inplace=True)

This works when you're done.

Okay, it's not easy to create, it helps to leave your approval or small amount of appreciation!

Posted by bpgillett on Fri, 24 Sep 2021 09:24:37 -0700