Tianchi competition - Forecast of repurchase of tmall users

Keywords: Big Data Machine Learning AI

Learn big data analysis and machine learning from 0, and simply write down the competition experience. The score is 0.623537, ranking 629 / 5602

1, Competition background

Merchants sometimes carry out large-scale promotional activities (such as discounts or cash coupons) on specific dates (such as boxing day sales, "Black Friday" or "double 11 (November 11)") to attract a large number of new buyers. However, many attracted buyers are one-time. They never buy again after this consumption. The promotion activities for these users have not brought future sales increase to the store. To alleviate this problem, businesses must determine who can be converted into repeat buyers. Through the fine marketing of these potential loyal customers, businesses can greatly reduce the promotion cost and improve the return on investment (ROI). As we all know, in the field of online advertising, accurate positioning of users is a great challenge, especially for new buyers. However, with the help of user behavior logs accumulated by tmall for a long time, we may be able to solve this problem. In this challenge, we provided a group of merchants and their new buyers in the "double 11" promotion. Your task is to predict which new buyers in a given business will become loyal customers in the future. In other words, you need to predict the probability that these new buyers will buy goods at the same merchant again in the next six months. We give a data set containing about 200000 users for training and another data set with similar scale for testing. Like other games, you can extract any feature and then train with other tools. You only need to submit the forecast results for evaluation.

Link: Challenge of tmall repurchase forecast Baseline Tianchi competition Alibaba cloud Tianchi

2, Data exploration

Import related packages first:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

Read data:

train_data = pd.read_csv("data_format1/train_format1.csv")
test_data = pd.read_csv("data_format1/test_format1.csv")

user_info = pd.read_csv("data_format1/user_info_format1.csv")
user_log = pd.read_csv("data_format1/user_log_format1.csv")

View the missing user information data - age value

View the lack of user information data - gender value

View missing user information data - age or gender

View the lack of user information data - the user behavior log data is missing

Analyze the store

2, Characteristic Engineering

Import related packages:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"] = "SimHei" #Solve the problem of Chinese garbled code
import seaborn as sns
import random
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn import model_selection
from sklearn.neighbors import KNeighborsRegressor

Read data:

df_train = pd.read_csv(r'data_format1\train_format1.csv')
df_test = pd.read_csv(r'data_format1\test_format1.csv')
user_info = pd.read_csv(r'data_format1\user_info_format1.csv')
user_log = pd.read_csv(r'data_format1\user_log_format1.csv')

print(df_test.shape,df_train.shape)
print(user_info.shape,user_log.shape)

Visualization of age distribution:

fig = plt.figure(figsize = (10, 6))
x = np.array(["NULL","<18","18-24","25-29","30-34","35-39","40-49",">=50"])
#< 18 years old is 1; [18,24] is 2; [25,29] is 3; [30,34] is 4; [35,39] is 5; [40,49] is 6; >= 7 and 8 at 50
y = np.array([user_info[user_info['age_range'] == -1]['age_range'].count(),
             user_info[user_info['age_range'] == 1]['age_range'].count(),
             user_info[user_info['age_range'] == 2]['age_range'].count(),
             user_info[user_info['age_range'] == 3]['age_range'].count(),
             user_info[user_info['age_range'] == 4]['age_range'].count(),
             user_info[user_info['age_range'] == 5]['age_range'].count(),
             user_info[user_info['age_range'] == 6]['age_range'].count(),
             user_info[user_info['age_range'] == 7]['age_range'].count() + 
              user_info[user_info['age_range'] == 8]['age_range'].count()])
plt.bar(x,y,label='Number of people')
plt.legend()
plt.title('User age distribution')

The effect is shown in the figure:

Start feature value merging:

df_train = pd.merge(df_train,user_info,on="user_id",how="left")

total_logs_temp = user_log.groupby([user_log["user_id"],user_log["seller_id"]])["item_id"].count().reset_index()

total_logs_temp.rename(columns={"seller_id":"merchant_id","item_id":"total_item_id"},inplace=True)

df_train = pd.merge(df_train,total_logs_temp,on=["user_id","merchant_id"],how="left")

unique_item_id = user_log.groupby(["user_id","seller_id","item_id"]).count().reset_index()[["user_id","seller_id","item_id"]]

unique_item_id_cnt = unique_item_id.groupby(["user_id","seller_id"]).count().reset_index()

unique_item_id_cnt.rename(columns={"seller_id":"merchant_id","item_id":"unique_item_id"},inplace=True)

df_train = pd.merge(df_train, unique_item_id_cnt, on=["user_id", "merchant_id"], how="left")

cat_id_temp = user_log.groupby(["user_id", "seller_id", "cat_id"]).count().reset_index()[["user_id", "seller_id", "cat_id"]]

cat_id_temp_cnt = cat_id_temp.groupby(["user_id", "seller_id"]).count().reset_index()

cat_id_temp_cnt.rename(columns={"seller_id":"merchant_id","cat_id":"total_cat_id"},inplace=True)

df_train = pd.merge(df_train, cat_id_temp_cnt, on=["user_id", "merchant_id"], how="left")

time_temp = user_log.groupby(["user_id", "seller_id", "time_stamp"]).count().reset_index()[["user_id", "seller_id", "time_stamp"]]

time_temp_cnt = time_temp.groupby(["user_id", "seller_id"]).count().reset_index()

time_temp_cnt.rename(columns={"seller_id":"merchant_id","time_stamp":"total_time_temp"},inplace=True)

df_train = pd.merge(df_train, time_temp_cnt, on=["user_id", "merchant_id"], how="left")

click_temp = user_log.groupby(["user_id", "seller_id", "action_type"])["item_id"].count().reset_index()

click_temp.rename(columns={"seller_id":"merchant_id","item_id":"times"},inplace=True)

click_temp["clicks"] = click_temp["action_type"] == 0

click_temp["clicks"] = click_temp["clicks"] * click_temp["times"]

click_temp["shopping_cart"] = click_temp["action_type"] == 1
click_temp["shopping_cart"] = click_temp["shopping_cart"] * click_temp["times"]

click_temp["purchases"] = click_temp["action_type"] == 2
click_temp["purchases"] = click_temp["purchases"] * click_temp["times"]

click_temp["favourites"] = click_temp["action_type"] == 3
click_temp["favourites"] = click_temp["favourites"] * click_temp["times"]

four_features = click_temp.groupby(["user_id", "merchant_id"]).sum().reset_index()

#Delete related columns
four_features = four_features.drop(["action_type", "times"], axis=1)

#merge
df_train = pd.merge(df_train, four_features, on=["user_id", "merchant_id"], how="left")

#Missing values are filled forward
df_train = df_train.fillna(method="ffill")

Save the established feature project as a separate file:

#Save the constructed features
df_train.to_csv("df_train.csv",index=None)

Three, model building

y = df_train["label"]
X = df_train.drop(["user_id", "merchant_id", "label"], axis=1)
X.head(10)

Split data:

#Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

logistic regression:

#logistic regression
Logit = LogisticRegression(solver='liblinear')
Logit.fit(X_train, y_train)
Predict = Logit.predict(X_test)
Predict_proba = Logit.predict_proba(X_test)
print(Predict[0:20])
print(Predict_proba[:])
Score = accuracy_score(y_test, Predict)
Score

Decision tree:

#Decision tree
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=4,random_state=0) 
tree.fit(X_train, y_train)
Predict_proba = tree.predict_proba(X_test)
print(Predict_proba[:])
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train))) 
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

Posted by dch27 on Wed, 29 Sep 2021 13:21:38 -0700

Programmer Group