Hello, as a beginner of machine learning, I participated in 2021 iFLYTEK AI developer competition some time ago. The team name is "write a story"
I participated in the big data competition for the first time. I mainly chose the relatively easy structured data competition. Among the five competitions, three entered the top five (advertising click through rate prediction, offline store sales prediction, mobile device user age and gender prediction). In addition to the element of luck (not much bonus, experts went to other competitions), I also want to thank the fishman, Ah Shui and other leaders provided baseline and shared a lot of information
Here to share with you the idea of solving the problem of advertising click through rate prediction competition
Game link
IFLYTEK 2021 advertising click through rate estimation challenge
Background and tasks
For mobile device manufacturers, it is very difficult to obtain the demographic attribute information of current mobile phone users. Accurately predicting the demographic attribute information based on the preferences of users' mobile phones and daily applications is the basis for improving personalized experience and building accurate user portraits.
It should be noted that the event data has been fully recognized and agreed by individual users, and appropriate anonymous processing has been carried out to protect privacy. Due to confidentiality, we will not provide details on how to obtain gender and age data.
There are two tasks in this competition to predict the gender and age of mobile devices (device_id). There are two problems: dichotomy and regression. Finally, the scores of the two parts will be combined for ranking.
Upper code
The code has been annotated. I'm not very good at writing in the way of defining functions. I'm sure everyone can understand it
# ============================================================================= # # Import Toolkit # ============================================================================= import numpy as np import pandas as pd import matplotlib.pyplot as plt import os import seaborn as sns from sklearn.model_selection import KFold import lightgbm as lgb from sklearn.metrics import f1_score, roc_auc_score import warnings warnings.filterwarnings('ignore') os.chdir('C:/Users/yyz/Desktop/match/Advertising click through rate/data/Advertising click through rate estimation challenge_data set/') # ============================================================================= # # Read data and merge # ============================================================================= df_tr = pd.read_csv('train.csv') df_te = pd.read_csv( 'test.csv') df_tr_te = pd.concat([df_tr,df_te],axis=0,ignore_index = True) # Distinguish between training sets and tests df_tr_te['isClick'] = df_tr_te['isClick'].fillna(-1) # Read data to be submitted df_sub = pd.read_csv('sample_submit.csv') # Breakdown of dates df_tr_te['hour'] = df_tr_te['date'].apply(lambda x: int(x.split(' ')[-1].split(':')[0])) df_tr_te['day']= df_tr_te['date'].apply(lambda x: int(x.split(' ')[0].split('-')[1])) # To user_ Those whose ID count is less than or equal to 3 are classified as 1 te = df_tr_te['user_id'].value_counts().reset_index() lis_thr = te[te['user_id']<=3]['index'].unique().tolist() df_tr_te['thr'] = np.where(df_tr_te['user_id'].isin(lis_thr),0,1) # ============================================================================= # Characteristic Engineering # ============================================================================= # Historical click through rate def _his_click_rate(df, f1, window_size = 2): fea_name = '{}_his_{}_clickrate'.format(f1,window_size ) df[fea_name] = 0 for i in tqdm(range(3,8)): df_t = df.loc[((df['day'] >= i-window_size) & (df['day'] < i))] inds = df['day'] == i df.loc[inds,fea_name] = df.loc[inds,f1].map(df_t.groupby(f1)['isClick'].mean()) return df df_tr_te = _his_click_rate(df = df_tr_te, f1 = 'user_id', window_size = 5) # Another one is added to the baseline df_tr_te['user_id_webpage_id'] = [str(i)+ str(j) for i,j in zip(df_tr_te['user_id'],df_tr_te['webpage_id'])] df_tr_te = _his_click_rate(df = df_tr_te, f1 = 'user_id_webpage_id', window_size = 5) # Window features df_tr_te['user_product_day_5mean'] = df_tr_te.groupby(['user_id','product','day'])['isClick'].transform(lambda x: x.rolling(3).mean().shift(1)) # Fill and replace with missing value data df_tr_te['gender'] = df_tr_te['gender'].fillna('NAN').map({'Female':1,'Male':0,'NAN':-1}) # Week data replacement mainly classifies Friday, Saturday and Sunday into one category df_tr_te['xingqi'] = df_tr_te['day'].replace([2,3,4,5,6,7],[2,2,1,0,0,0]) # Univariate count feature for c in ['user_id','product','hour','campaign_id','webpage_id','user_group_id','age_level', 'gender','day','product_category_id','user_depth']: df_tr_te[c + '_cnt'] = df_tr_te.groupby(c)['id'].transform('count') # count feature of bivariate import itertools lis_i = ['user_id','product','hour','campaign_id','webpage_id','user_group_id','age_level', 'gender','day','product_category_id','user_depth'] lis_i_re = list(itertools.permutations(lis_i, 2)) for c in lis_i_re: df_tr_te[c[0] + c[1] + '_cnt'] = df_tr_te.groupby(list(c))['id'].transform('count') # Processing time (according to the number of data pieces, it is speculated that it is 2021 data) df_tr_te['date'] = ['2021-' + i for i in df_tr_te['date']] df_tr_te['date'] = pd.to_datetime(df_tr_te['date']) # Calculate the time difference by user, day and hour df_tr_te['user_time_hour'] = df_tr_te.groupby(['user_id','day','hour'])['date'].transform(lambda x: (x.max()-x.min()).total_seconds()) # Calculate the time difference by user and day df_tr_te['user_time_day'] = df_tr_te.groupby(['user_id','day'])['date'].transform(lambda x: (x.max()-x.min()).total_seconds()) # First order difference df_tr_te['user_time_del'] = df_tr_te.groupby(['user_id'])['date'].transform(lambda x: (x.diff(periods=-1))) df_tr_te['user_time_del'] = df_tr_te['user_time_del'].apply(lambda x: x.total_seconds()) # Count count df_tr_te['user_id_webpage_id_product'] = df_tr_te.groupby(['user_id','product','webpage_id'])['id'].transform('count') # Product weight by user and day df_tr_te['user_id_day_range'] = df_tr_te.groupby(['user_id','day'])['product'].transform(lambda x : len(x) / np.array(range(1,len(x)+1))) # Product weight by user df_tr_te['user_id_range'] = df_tr_te.groupby(['user_id'])['product'].transform(lambda x : len(x) / np.array(range(1,len(x)+1))) # Web page by user, product weight df_tr_te['user_id_product_webpage_range'] = df_tr_te.groupby(['user_id','product'])['webpage_id'].transform(lambda x : len(x) / np.array(range(1,len(x)+1))) # Web pages by user, activity weight df_tr_te['user_id_campaign_id_webpage_range'] = df_tr_te.groupby(['user_id','campaign_id'])['webpage_id'].transform(lambda x : len(x) / np.array(range(1,len(x)+1))) # Time mean of different combinations lis_i_1 = ['user_id','product','campaign_id','webpage_id','product_category_id', 'user_group_id','age_level','gender','user_depth','var_1'] for c in lis_i_1: df_tr_te[str(c) + '_user_time_hour_mean'] = df_tr_te.groupby(c)['user_time_hour'].transform('mean') df_tr_te[str(c) + '_user_time_day_mean'] = df_tr_te.groupby(c)['user_time_hour'].transform('mean') df_tr_te[str(c) + '_user_time_hour_sum'] = df_tr_te.groupby(c)['user_time_hour'].transform('sum') df_tr_te[str(c) + '_user_time_day_sum'] = df_tr_te.groupby(c)['user_time_hour'].transform('sum') # Gender, age, average product time df_tr_te['yong_time_gender_age_level_product_category_id_ave'] = df_tr_te.groupby(['gender','age_level','product_category_id'])['user_time_hour'].transform('mean') # Violence increased the average time of the combination of two characteristics lis_i_1 = ['user_id','product','campaign_id','webpage_id','product_category_id','user_group_id','age_level','gender','user_depth','var_1'] lis_i_re_1 = list(itertools.permutations(lis_i_1, 2)) for c in lis_i_re_1: df_tr_te[c[0] + c[1] + '_user_time_hour_mean'] = df_tr_te.groupby(list(c))['user_time_hour'].transform('mean') # nunique features for i in ['product','campaign_id','webpage_id','product_category_id']: df_tr_te['day_'+str(i)+'_nunique'] = df_tr_te.groupby(['user_id','day'])[i].transform('nunique') df_tr_te['day_'+str(i)+'_nunique_p%'] = df_tr_te['user_idday_cnt'] / df_tr_te['day_'+str(i)+'_nunique'] df_tr_te['day_web_nunique'] = df_tr_te.groupby(['user_id','day','hour'])['webpage_id'].transform('nunique') # ============================================================================= # modeling # ============================================================================= # cate_features = ['user_id','product','hour','campaign_id','webpage_id','user_group_id','age_level'] features = [i for i in df_tr_te.columns if i not in ['id','isClick','date','user_id_webpage_id']] test= df_tr_te[df_tr_te['isClick']==-1] train= df_tr_te[df_tr_te['isClick']!=-1] x_train = train[features] x_test = test[features] y_train = train['isClick'] def cv_model(clf, train_x, train_y, test_x, clf_name='lgb'): folds = 5 seed = 2021 kf = KFold(n_splits=folds, shuffle=True, random_state=seed) train = np.zeros(train_x.shape[0]) test = np.zeros(test_x.shape[0]) cv_scores = [] for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)): print('************************************ {} ************************************'.format(str(i+1))) trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index] train_matrix = clf.Dataset(trn_x, label=trn_y) valid_matrix = clf.Dataset(val_x, label=val_y) params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'min_child_weight': 5, 'num_leaves': 2**6, 'lambda_l2': 10, 'feature_fraction': 0.9, 'bagging_fraction': 0.9, 'bagging_freq': 4, 'learning_rate': 0.01, 'seed': 2021, 'nthread': 28, 'n_jobs':-1, 'silent': True, 'verbose': -1, } model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], #categorical_feature = categorical_feature, verbose_eval=500,early_stopping_rounds=200) val_pred = model.predict(val_x, num_iteration=model.best_iteration) test_pred = model.predict(test_x, num_iteration=model.best_iteration) train[valid_index] = val_pred test += test_pred / kf.n_splits cv_scores.append(roc_auc_score(val_y, val_pred)) print(cv_scores) print("%s_scotrainre_list:" % clf_name, cv_scores) print("%s_score_mean:" % clf_name, np.mean(cv_scores)) print("%s_score_std:" % clf_name, np.std(cv_scores)) return train, test lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test) ## Prediction results df_sub['isClick'] = lgb_test df_sub.to_csv('C:/Users/yyz/Desktop/match/Advertising click through rate/baseline55_5zhe_re.csv', index=False)
Problem solving ideas
- The model is based on the baseline provided by the boss and uses single-mode lightgbm
- In terms of feature structure, it mainly includes the following aspects:
a. For the common count and nunique features, as for grouping by several classification variables, you need to try more;
b. Time characteristics: because the data involves time, all constructed many time differences, average time, total time characteristics, average time and total time characteristics of different classification combinations;
c. Weight features: the business of this competition involves advertising. The number of times an advertisement appears should be inversely proportional to its probability of being clicked this time. All of them have constructed a lot of weight features;
d. Historical click through characteristics
e. Other features: Week classification, samples with less frequency are classified into one category, etc - Parameter adjustment: mainly for learning_rate, num_leaves, min_child_weight adjustment
Post optimization
Now two months have passed. Looking at the code you racked your brains to write, there are still many optimization places:
- Different models, such as catboost and XGB, are used for model fusion
- There are many features and the calculation time is long. Filter the features
- The code is cumbersome and needs modularization
The above is a summary of this competition. I hope it will be helpful to friends who are new to the data competition. I'll see you on the field!
If you want to learn more, you can learn about the fish's new book machine learning algorithm competition practice. I started on October 7. I've seen it several times and gained a lot!