[Alibaba cloud Tianchi algorithm challenge] Introduction to zero basics NLP - News Text Classification - Day2 - data reading and data analysis

Keywords: Algorithm Machine Learning Deep Learning NLP Alibaba Cloud

  1, Game problem analysis

[Alibaba cloud Tianchi algorithm challenge] Introduction to zero basics NLP - News Text Classification - Day1 - Game Title Understanding_ 202xxx blog - CSDN blog

2, Data reading

After downloading the data, anaconda and python 3.8 are recommended for data reading and model training

First install the required module package:

pip version:

Add domestic sources to pip to increase download speed_ 202xxx blog - CSDN blog

pip install pandas -i https://pypi.tuna.tsinghua.edu.cn/simple

conda version:

Create a new conda virtual environment:

conda create --name py38 python=3.8
conda activate py38

conda add domestic source_ 202xxx blog - CSDN blog

conda install pandas

Read data with, train_dir is the storage path of the training set, and nrows is the number of rows of read data

import pandas as pd
train_dir = '../data/train_set.csv'
nrows=None
train_df = pd.read_csv(train_dir, sep='\t', nrows=nrows)

View the first 5 news data

train_df.head()
labeltext
022967 6758 339 2021 1854 3731 4109 3792 4149 15...
1114464 486 6352 5619 2465 4802 1452 3137 5778 54...
237346 4068 5074 3747 5681 6093 1777 2226 7354 6...
327159 948 4866 2109 5520 2490 211 3956 5520 549...
433646 3055 3055 2490 4659 6065 3370 5814 2465 5...

3, Data analysis

Sentence length distribution

Count the length of each news (number of words)

import matplotlib.pyplot as plt
train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split(' ')))
print(train_df['text_len'].describe())

Use the describe() function to get the statistical characteristic value of the news length  

count    200000.000000
mean        907.207110
std         996.029036
min           2.000000
25%         374.000000
50%         676.000000
75%        1131.000000
max       57921.000000
Name: text_len, dtype: float64

The total number of training data news is 200000, of which the shortest news word number is 2, the longest news word number is 57921, and the average purple potato is 907.2

Draw a histogram of sentence length:

_ = plt.hist(train_df['text_len'], bins=200)
plt.xlabel('Text char count')
plt.title("Histogram of char count")

  It can be seen that most sentences are less than 6000 in length

News category distribution

Draw the statistical histogram of news category labels

train_df['label'].value_counts().astype("object").plot(kind='bar')
plt.title('News class count')
plt.xlabel("category")

The corresponding relationships of labels in the dataset are as follows: {technology ': 0,' stock ': 1,' Sports': 2, 'entertainment': 3, 'current affairs': 4,' society ': 5,' education ': 6,' Finance ': 7,' home ': 8,' game ': 9,' real estate ': 10,' fashion ': 11,' lottery ': 12,' constellation ': 13}  

It can be seen that the largest number of news is science and technology, and the least is constellation.

Character distribution statistics

Combine all characters of training data and count the frequency of each character

from collections import Counter
all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse = True)

print(len(word_count))

print(word_count[0])

print(word_count[-1])

It can be seen that the number of characters used in the training data is 6869, of which the most used character number is "3750", which has been used 7.48 million times, and the least character number is "3133", which has been used twice:

6869
('3750', 7482224)
('3133', 1)

Now the characters of each article are de duplicated and re counted

from collections import Counter
train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(train_df['text_unique']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:int(d[1]), reverse = True)

print(word_count[0])

print(word_count[1])

print(word_count[2])

You can get 197997 articles with the number "3750", and the total number of articles is 200000, so you can guess that it is punctuation.

('3750', 197997)
('900', 197653)
('648', 191975)

4, Operation idea

1. Suppose character 3750, character 900 and character 648 are punctuation marks of sentences, please analyze the average number of sentences in each news story?

A: count the number of characters 3750, 900 and 648 in each news. The number of sentences in each news = the number of punctuation. Therefore, averaging can get the average number of sentences in each news. It is equivalent to knowing that the number of characters 3750 in 200000 news articles is 7482224 times, the number of characters 900 is 3262544, and the number of characters 648 is 4924890. The average number of sentences per news is 78

#Assignment 1
from collections import Counter
all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(" "))

print(word_count["3750"])
print(word_count["900"])
print(word_count["648"])
print((word_count["3750"] + word_count["900"] + word_count["648"])/train_df.shape[0])

2. Count the characters that appear most frequently in each type of news?

Answer: group and splice the text news in df according to the label, and calculate the maximum characters and number of news in each label

train_df_label = train_df[["label", "text"]].groupby("label").apply(lambda x:" ".join(x["text"])).reset_index()
train_df_label.columns = [["label", "text"]]
from collections import Counter
train_df_label['text_max'] = train_df_label["text"].apply(lambda x:sorted(Counter(x["text"].split(" ")).items(), key=lambda d:int(d[1]), reverse = True)[0], axis = 1)
train_df_label
labeltexttext_max
003659 3659 1903 1866 4326 4744 7239 3479 4261 4...(3750, 1267331)
114412 5988 5036 4216 7539 5644 1906 2380 2252 6...(3750, 1200686)
222967 6758 339 2021 1854 3731 4109 3792 4149 15...(3750, 1458331)
337346 4068 5074 3747 5681 6093 1777 2226 7354 6...(3750, 774668)
443772 4269 3433 6122 2035 4531 465 6565 498 358...(3750, 360839)
552827 2444 7399 3528 2260 6127 1871 119 3615 57...(3750, 715740)
665284 1779 2109 6248 7039 5677 1816 5430 3154 1...(3750, 469540)
776469 1066 1623 1018 3694 4089 3809 4516 6656 3...(3750, 428638)
882087 730 5166 3300 7539 1722 5305 913 4326 669...(3750, 242367)
993819 4525 1129 6725 6485 2109 3800 5264 1006 4...(3750, 178783)
101026 4270 1866 5977 3523 3764 4464 3659 4853 517...(3750, 180259)
11114464 486 6352 5619 2465 4802 1452 3137 5778 54...(3750, 83834)
12122708 2218 5915 4559 886 1241 4819 314 4261 166...(3750, 87412)
13131903 2112 3019 3607 7539 3864 4939 4768 3420 2...(3750, 33796)

5, Summary

According to different dimensions, the characteristics of news are extracted respectively to prepare for model training.

6, Reference

tianchi_competition / zero foundation introduction NLP - news text classification at main · RXR / tianchi_competition · GitHub

Datawhale zero foundation entry NLP event - Task2 data reading and data analysis - Tianchi laboratory - real time online data analysis collaboration tool, enjoy free computing resources

Posted by enemeth on Sun, 31 Oct 2021 13:44:17 -0700