A Skills and Experiences on kaggle Feature Construction

Keywords: Python network encoding

For a long time, we have shown a lack of creativity, and the work we have done is regarded as a copycat and a reference. This is undeniable, but with the accumulation of our own, we have made a breakthrough from reference to creation. Creativity is one of the basic elements of our work, which is very important in all walks of life, and is no exception in the field of machine learning.
Creativity is also needed to create features, so this article lists some ideas in my daily life, hoping to inspire others, so that we can use creativity on this basis to achieve good results in Kaggle rankings.
The inspiration for this article comes from Beluga's article on Kaggle Part of this article is directly extracted from this article, so readers can also look at this article. The following is a shared text:

1. Don't try to predict the future when you don't need it:

If training/testing comes from the same timeline, then features can be used very skillfully. Although this is only a case of kaggle, this advantage can be exploited. For example, in the taxi travel duration challenge, test data are randomly extracted from training data. In this case, the average target variables of different categories of variables can be used as features. In this case, Beluga actually uses average target variables for different working days. Then, the same average value is mapped to a variable and mapped to the test data.

2. logloss tailoring technology:

This part is based on a very simple idea learned in Jeremy Howard's Neural Network Course. If we are very confident and unfair, Logloss will be punished a lot. Therefore, when it is necessary to predict the classification of probability, it is much better to cut the probability between 0.05 and 0.95, so that the prediction of oneself becomes not very sure.

3. Submit to kaggle in gzip format:

The following small piece of code can help us save countless upload time:

df.to_csv('submission.csv.gz', index=False, compression='gzip')

4. How best to use latitude and longitude characteristics - Part 1:

One of my favorite parts of Beluga's article is how he uses Lat/Lon data, which creates the following features:

A. Haversine distance between two longitudes and latitudes:

def haversine_array(lat1, lng1, lat2, lng2):
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    AVG_EARTH_RADIUS = 6371  # in km
    lat = lat2 - lat1
    lng = lng2 - lng1
    d = np.sin(lat * 0.5) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(lng * 0.5) ** 2
    h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
    return h

B. Manhattan Distance between Two Longitudes and Latitudes:

def dummy_manhattan_distance(lat1, lng1, lat2, lng2):
    a = haversine_array(lat1, lng1, lat1, lng2)
    b = haversine_array(lat1, lng1, lat2, lng1)
    return a + b

C. The orientation between two longitudes and latitudes:

def bearing_array(lat1, lng1, lat2, lng2):
    AVG_EARTH_RADIUS = 6371  # in km
    lng_delta_rad = np.radians(lng2 - lng1)
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    y = np.sin(lng_delta_rad) * np.cos(lat2)
    x = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(lng_delta_rad)
    return np.degrees(np.arctan2(y, x))

D. Take the central latitude and longitude between the placing points:

train.loc[:, 'center_latitude'] = (train['pickup_latitude'].values + train['dropoff_latitude'].values) / 2
train.loc[:, 'center_longitude'] = (train['pickup_longitude'].values + train['dropoff_longitude'].values) / 2

5. How to make the best use of latitude and longitude characteristics - Part 2:

In Beluga's article, the second way he uses latitude and longitude data is to create clusters for the latitude and longitude of the locations. It works by creating administrative regions in the data through design.

from sklearn.cluster import MiniBatchKMeans
coords = np.vstack((train[['pickup_latitude', 'pickup_longitude']].values,
                    train[['dropoff_latitude', 'dropoff_longitude']].values,
                    test[['pickup_latitude', 'pickup_longitude']].values,
                    test[['dropoff_latitude', 'dropoff_longitude']].values))

sample_ind = np.random.permutation(len(coords))[:500000]
kmeans = MiniBatchKMeans(n_clusters=100, batch_size=10000).fit(coords[sample_ind])

train.loc[:, 'pickup_cluster'] = kmeans.predict(train[['pickup_latitude', 'pickup_longitude']])
train.loc[:, 'dropoff_cluster'] = kmeans.predict(train[['dropoff_latitude', 'dropoff_longitude']])
test.loc[:, 'pickup_cluster'] = kmeans.predict(test[['pickup_latitude', 'pickup_longitude']])
test.loc[:, 'dropoff_cluster'] = kmeans.predict(test[['dropoff_latitude', 'dropoff_longitude']])

He then used these clusters to create features, such as counting the number of times he went out and entered on a given day.

6. How best to use latitude and longitude characteristics - Part 3

In Beluga's article, PCA is also used to transform longitude and latitude coordinates. In this case, it does not reduce the dimension, but coordinate transformation, 2D -> 2D transformation, it actually does the following operations.

pca = PCA().fit(coords)
train['pickup_pca0'] = pca.transform(train[['pickup_latitude', 'pickup_longitude']])[:, 0]
train['pickup_pca1'] = pca.transform(train[['pickup_latitude', 'pickup_longitude']])[:, 1]
train['dropoff_pca0'] = pca.transform(train[['dropoff_latitude', 'dropoff_longitude']])[:, 0]
train['dropoff_pca1'] = pca.transform(train[['dropoff_latitude', 'dropoff_longitude']])[:, 1]
test['pickup_pca0'] = pca.transform(test[['pickup_latitude', 'pickup_longitude']])[:, 0]
test['pickup_pca1'] = pca.transform(test[['pickup_latitude', 'pickup_longitude']])[:, 1]
test['dropoff_pca0'] = pca.transform(test[['dropoff_latitude', 'dropoff_longitude']])[:, 0]
test['dropoff_pca1'] = pca.transform(test[['dropoff_latitude', 'dropoff_longitude']])[:, 1]

7. Don't forget the normal things you can do with features:

  • Scale by Max-Min;
  • Standardization using standard deviation;
  • Feature/target-based logs: use feature-based or target-based logs;
  • Thermal encoding;

8. Create intuitive additional features:

  • A) Date-time characteristics: time-based characteristics, such as "night", "noon", "night", "last month's purchase behavior", "last week's purchase behavior", etc.
  • B) Ideological characteristics: Assuming that there is shopping cart data and that you want to classify your trips (see Walmart Recruiting: Kaggle's Classification of Travel Types);

In addition, you can also consider creating a feature like "fashion", which can be created by adding items belonging to men's fashion, women's fashion and adolescent fashion categories.
In addition, we can create a feature like "rare", which is created by tagging some rare items according to the data we have and then calculating the number of rare items in the shopping cart. These features may be valid or invalid. According to my observation, they usually provide a lot of value.

9. Do something that is not normal:

These features are not very intuitive and should not be created where machine learning models need to be explained.

  • A) Interactive features: If there are features A and B, and create features A* B, A + B, A / B, AB, this will make the feature space explode. If you have 10 features and you want to create two variable interaction features, this will add 180 features to the model. And most of the time, there will be more than 10 features.
  • B) Use hashed bucket features: Assume you have thousands of features, arranged in order, but considering the training time of the algorithm, you don't want to use all thousands of features. Generally, some hashing algorithms are used to achieve this, and finally the task of text categorization is completed.

For example:
Suppose there are six features A, B, C, D, E, F:
And the data rows are:

A: 1,B: 1,C: 1,D: 0,E: 1,F: 0 

It may be decided to use hash functions so that the six features correspond to three buckets and create data hash vectors using this feature.
After processing, the data may be as follows:

Bucket1: 2,Bucket2: 2,Bucket3: 0

A: 1, B: 1, C: 1, D: 0, E: 1, F: 0. This happens because A and B fall in barrel 1, C and E fall in barrel 2, D and F fall in barrel 3. This is just a summary of the above observations. You can also replace the above addition operation with any mathematical function you want.
After that, Bucket 1, Bucket 2 and Bucket 3 will be used as variables of machine learning.
A:1, B:1, C:1, D:0, E:1 and F:0 are all the contents of this article, which will be updated continuously in the future. If readers have better treatment methods, please leave a message below.

Author information

Rahul Agarwal, Statistical Analysis
This article is translated by Ali Yunqi Community Organization.
The original title of the article is Good Feature Building Techniques and Tricks for Kaggle. Translator: Begonia, Revision: Uncle_LLD.
Brief translation of the article, more detailed content, Please check the original text..

Posted by rks on Wed, 08 May 2019 20:51:38 -0700