Top 1 scheme of iFLYTEK 2021 advertising click through rate estimation challenge

Keywords: Python data visualization

Hello, as a beginner of machine learning, I participated in 2021 iFLYTEK AI developer competition some time ago. The team name is "write a story"
I participated in the big data competition for the first time. I mainly chose the relatively easy structured data competition. Among the five competitions, three entered the top five (advertising click through rate prediction, offline store sales prediction, mobile device user age and gender prediction). In addition to the element of luck (not much bonus, experts went to other competitions), I also want to thank the fishman, Ah Shui and other leaders provided baseline and shared a lot of information

Here to share with you the idea of solving the problem of advertising click through rate prediction competition

Game link

IFLYTEK 2021 advertising click through rate estimation challenge

Background and tasks

For mobile device manufacturers, it is very difficult to obtain the demographic attribute information of current mobile phone users. Accurately predicting the demographic attribute information based on the preferences of users' mobile phones and daily applications is the basis for improving personalized experience and building accurate user portraits.

It should be noted that the event data has been fully recognized and agreed by individual users, and appropriate anonymous processing has been carried out to protect privacy. Due to confidentiality, we will not provide details on how to obtain gender and age data.

There are two tasks in this competition to predict the gender and age of mobile devices (device_id). There are two problems: dichotomy and regression. Finally, the scores of the two parts will be combined for ranking.

Upper code

The code has been annotated. I'm not very good at writing in the way of defining functions. I'm sure everyone can understand it

# =============================================================================
# # Import Toolkit
# =============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import seaborn as sns
from sklearn.model_selection import KFold 
import lightgbm as lgb
from sklearn.metrics import f1_score, roc_auc_score
import warnings
warnings.filterwarnings('ignore')

os.chdir('C:/Users/yyz/Desktop/match/Advertising click through rate/data/Advertising click through rate estimation challenge_data set/')

# =============================================================================
# # Read data and merge
# =============================================================================
df_tr    = pd.read_csv('train.csv')
df_te    = pd.read_csv( 'test.csv')
df_tr_te = pd.concat([df_tr,df_te],axis=0,ignore_index = True)
# Distinguish between training sets and tests
df_tr_te['isClick'] = df_tr_te['isClick'].fillna(-1)
# Read data to be submitted
df_sub   = pd.read_csv('sample_submit.csv') 

# Breakdown of dates
df_tr_te['hour'] = df_tr_te['date'].apply(lambda x: int(x.split(' ')[-1].split(':')[0]))
df_tr_te['day']= df_tr_te['date'].apply(lambda x: int(x.split(' ')[0].split('-')[1]))

# To user_ Those whose ID count is less than or equal to 3 are classified as 1
te = df_tr_te['user_id'].value_counts().reset_index()
lis_thr = te[te['user_id']<=3]['index'].unique().tolist()
df_tr_te['thr'] = np.where(df_tr_te['user_id'].isin(lis_thr),0,1)

# =============================================================================
# Characteristic Engineering
# =============================================================================

# Historical click through rate
def _his_click_rate(df, f1, window_size = 2): 
    fea_name = '{}_his_{}_clickrate'.format(f1,window_size )
    df[fea_name] = 0
    for i in tqdm(range(3,8)):
        df_t = df.loc[((df['day'] >= i-window_size) & (df['day'] < i))]
        inds = df['day'] == i
        df.loc[inds,fea_name] = df.loc[inds,f1].map(df_t.groupby(f1)['isClick'].mean()) 
    return df

df_tr_te = _his_click_rate(df = df_tr_te, f1 = 'user_id', window_size = 5)
# Another one is added to the baseline
df_tr_te['user_id_webpage_id'] = [str(i)+ str(j) for i,j in zip(df_tr_te['user_id'],df_tr_te['webpage_id'])]
df_tr_te = _his_click_rate(df = df_tr_te, f1 = 'user_id_webpage_id', window_size = 5)

# Window features
df_tr_te['user_product_day_5mean'] = df_tr_te.groupby(['user_id','product','day'])['isClick'].transform(lambda x: x.rolling(3).mean().shift(1))
                  
# Fill and replace with missing value data
df_tr_te['gender'] = df_tr_te['gender'].fillna('NAN').map({'Female':1,'Male':0,'NAN':-1})
# Week data replacement mainly classifies Friday, Saturday and Sunday into one category
df_tr_te['xingqi'] = df_tr_te['day'].replace([2,3,4,5,6,7],[2,2,1,0,0,0])

# Univariate count feature
for c in ['user_id','product','hour','campaign_id','webpage_id','user_group_id','age_level',
          'gender','day','product_category_id','user_depth']: 
    df_tr_te[c + '_cnt'] = df_tr_te.groupby(c)['id'].transform('count')
    
# count feature of bivariate
import itertools
lis_i =  ['user_id','product','hour','campaign_id','webpage_id','user_group_id','age_level',
          'gender','day','product_category_id','user_depth']  
lis_i_re = list(itertools.permutations(lis_i, 2))
for c in lis_i_re:
    df_tr_te[c[0] + c[1] + '_cnt'] = df_tr_te.groupby(list(c))['id'].transform('count')
    
# Processing time (according to the number of data pieces, it is speculated that it is 2021 data)
df_tr_te['date'] =  ['2021-' + i for i in df_tr_te['date']]
df_tr_te['date'] = pd.to_datetime(df_tr_te['date'])
# Calculate the time difference by user, day and hour
df_tr_te['user_time_hour'] = df_tr_te.groupby(['user_id','day','hour'])['date'].transform(lambda x: (x.max()-x.min()).total_seconds())
# Calculate the time difference by user and day
df_tr_te['user_time_day'] = df_tr_te.groupby(['user_id','day'])['date'].transform(lambda x: (x.max()-x.min()).total_seconds())
# First order difference
df_tr_te['user_time_del'] = df_tr_te.groupby(['user_id'])['date'].transform(lambda x: (x.diff(periods=-1)))
df_tr_te['user_time_del'] = df_tr_te['user_time_del'].apply(lambda x: x.total_seconds())

# Count count
df_tr_te['user_id_webpage_id_product'] = df_tr_te.groupby(['user_id','product','webpage_id'])['id'].transform('count')
# Product weight by user and day
df_tr_te['user_id_day_range'] = df_tr_te.groupby(['user_id','day'])['product'].transform(lambda x : len(x) / np.array(range(1,len(x)+1)))
# Product weight by user
df_tr_te['user_id_range'] = df_tr_te.groupby(['user_id'])['product'].transform(lambda x : len(x) / np.array(range(1,len(x)+1)))  
# Web page by user, product weight 
df_tr_te['user_id_product_webpage_range'] = df_tr_te.groupby(['user_id','product'])['webpage_id'].transform(lambda x : len(x) / np.array(range(1,len(x)+1)))   
# Web pages by user, activity weight 
df_tr_te['user_id_campaign_id_webpage_range'] = df_tr_te.groupby(['user_id','campaign_id'])['webpage_id'].transform(lambda x : len(x) / np.array(range(1,len(x)+1)))  

# Time mean of different combinations
lis_i_1 =  ['user_id','product','campaign_id','webpage_id','product_category_id',
            'user_group_id','age_level','gender','user_depth','var_1']
for c in lis_i_1:
    df_tr_te[str(c) + '_user_time_hour_mean'] = df_tr_te.groupby(c)['user_time_hour'].transform('mean')
    df_tr_te[str(c) + '_user_time_day_mean'] = df_tr_te.groupby(c)['user_time_hour'].transform('mean')
    df_tr_te[str(c) + '_user_time_hour_sum'] = df_tr_te.groupby(c)['user_time_hour'].transform('sum')
    df_tr_te[str(c) + '_user_time_day_sum'] = df_tr_te.groupby(c)['user_time_hour'].transform('sum')
    
# Gender, age, average product time
df_tr_te['yong_time_gender_age_level_product_category_id_ave'] = df_tr_te.groupby(['gender','age_level','product_category_id'])['user_time_hour'].transform('mean')
    
# Violence increased the average time of the combination of two characteristics 
lis_i_1 =  ['user_id','product','campaign_id','webpage_id','product_category_id','user_group_id','age_level','gender','user_depth','var_1']
lis_i_re_1 = list(itertools.permutations(lis_i_1, 2))
for c in lis_i_re_1:
    df_tr_te[c[0] + c[1] + '_user_time_hour_mean'] = df_tr_te.groupby(list(c))['user_time_hour'].transform('mean') 

# nunique features
for i in ['product','campaign_id','webpage_id','product_category_id']:
        df_tr_te['day_'+str(i)+'_nunique'] = df_tr_te.groupby(['user_id','day'])[i].transform('nunique')
        df_tr_te['day_'+str(i)+'_nunique_p%'] = df_tr_te['user_idday_cnt'] / df_tr_te['day_'+str(i)+'_nunique']
    
df_tr_te['day_web_nunique'] = df_tr_te.groupby(['user_id','day','hour'])['webpage_id'].transform('nunique')

# =============================================================================
# modeling
# =============================================================================
 
# cate_features  = ['user_id','product','hour','campaign_id','webpage_id','user_group_id','age_level']

features = [i for i in df_tr_te.columns if i not in ['id','isClick','date','user_id_webpage_id']]

test= df_tr_te[df_tr_te['isClick']==-1]
train= df_tr_te[df_tr_te['isClick']!=-1]

x_train = train[features]
x_test = test[features]
y_train = train['isClick']

def cv_model(clf, train_x, train_y, test_x, clf_name='lgb'):
    folds = 5
    seed = 2021
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)

    train = np.zeros(train_x.shape[0])
    test = np.zeros(test_x.shape[0])

    cv_scores = []

    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]

        train_matrix = clf.Dataset(trn_x, label=trn_y)
        valid_matrix = clf.Dataset(val_x, label=val_y)

        params = {
            'boosting_type': 'gbdt',
            'objective': 'binary',
            'metric': 'auc',
            'min_child_weight': 5,
            'num_leaves': 2**6,  
            'lambda_l2': 10,
            'feature_fraction': 0.9,
            'bagging_fraction': 0.9,
            'bagging_freq': 4,
            'learning_rate': 0.01, 
            'seed': 2021,
            'nthread': 28,
            'n_jobs':-1,
            'silent': True,
            'verbose': -1,
        }

        model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], 
                          #categorical_feature = categorical_feature,
                          verbose_eval=500,early_stopping_rounds=200)
        val_pred = model.predict(val_x, num_iteration=model.best_iteration)
        test_pred = model.predict(test_x, num_iteration=model.best_iteration)

        train[valid_index] = val_pred
        test += test_pred / kf.n_splits
        cv_scores.append(roc_auc_score(val_y, val_pred))
        
        print(cv_scores)
       
    print("%s_scotrainre_list:" % clf_name, cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    print("%s_score_std:" % clf_name, np.std(cv_scores))
    return train, test

lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test)

## Prediction results
df_sub['isClick'] = lgb_test
df_sub.to_csv('C:/Users/yyz/Desktop/match/Advertising click through rate/baseline55_5zhe_re.csv', index=False)

Problem solving ideas

  1. The model is based on the baseline provided by the boss and uses single-mode lightgbm
  2. In terms of feature structure, it mainly includes the following aspects:
    a. For the common count and nunique features, as for grouping by several classification variables, you need to try more;
    b. Time characteristics: because the data involves time, all constructed many time differences, average time, total time characteristics, average time and total time characteristics of different classification combinations;
    c. Weight features: the business of this competition involves advertising. The number of times an advertisement appears should be inversely proportional to its probability of being clicked this time. All of them have constructed a lot of weight features;
    d. Historical click through characteristics
    e. Other features: Week classification, samples with less frequency are classified into one category, etc
  3. Parameter adjustment: mainly for learning_rate, num_leaves, min_child_weight adjustment

Post optimization

Now two months have passed. Looking at the code you racked your brains to write, there are still many optimization places:

  • Different models, such as catboost and XGB, are used for model fusion
  • There are many features and the calculation time is long. Filter the features
  • The code is cumbersome and needs modularization

The above is a summary of this competition. I hope it will be helpful to friends who are new to the data competition. I'll see you on the field!

If you want to learn more, you can learn about the fish's new book machine learning algorithm competition practice. I started on October 7. I've seen it several times and gained a lot!

Posted by RClapham on Sat, 20 Nov 2021 17:44:24 -0800