Tasks:
The part of speech tagging training and testing were carried out using the 1998 People's Daily corpus.
Job input:
In 1998, the People's Daily Corpus (1998-01-105-tape.txt) used 80% data as training set and 20% data as verification set.
Operating environment:
Jupyter Notebook, Python3
Method of operation:
A simple statistical method is used to predict the part of speech of a word. N-gram language rules are not yet used.
Operational steps:
1. Processing corpus: Delete the pre-paragraph label.
# Read the original corpus file in_path = '1998-01-105-Tape.txt' file = open(in_path, encoding='gbk') in_data = file.readlines()
# Pre-processed corpus curpus_path = 'curpus.txt' curpusfile = open(curpus_path, 'w', encoding='utf-8')
#Delete pre-paragraph labels, [], {} for sentence in in_data: words = sentence.strip().split(' ') words.pop(0) for word in words: if word.strip() != '': if word.startswith('['): word = word[1:] elif ']' in word: word = word[0:word.index(']')] w_c = word.split('/') # Generating Corpus if(len(w_c) > 1): curpusfile.write(w_c[0] + ' ' + w_c[1] + '\n')
2. Random partition of training set 80% and verification set 20%.
from sklearn.model_selection import train_test_split # Random partition curpus = open(curpus_path, encoding='utf-8').readlines() train_data, test_data = train_test_split( curpus, test_size=0.2, random_state=10)
# View the partitioned data size print(len(curpus)) print(len(train_data) / len(curpus)) print(len(test_data) / len(curpus))
1114419 0.7999998205342874 0.20000017946571264
3. Statistical training set word frequency.
# Generating Word Frequency Recording Files from tqdm import tqdm_notebook doc = [] for sentence in tqdm_notebook(train_data): words = sentence.strip().split(' ') if len(words) > 1: temp = [] temp.append(words[0]) temp.append(words[1]) flag = False for line in doc: if line[0] == temp[0] and line[1] == temp[1]: line[2] += 1 flag = True break if not flag: temp.append(1) doc.append(temp)
4. Choose the most probable part of speech.
# Save the verification set test_path = 'test.txt' testfile = open(test_path, 'w', encoding='utf-8') for sentence in test_data: words = sentence.strip().split(' ') if len(words) > 1: testfile.write(sentence)
# Save label results result_path = 'result.txt' resultfile = open(result_path, 'w', encoding='utf-8')
# Select the most probable part of speech for tagging for sentence in tqdm_notebook(test_data): words = sentence.strip().split(' ') if len(words) > 1: words[1] = 'n' max = 0 for line in doc: if line[0] == words[0] and line[2] > max: max = line[2] words[1] = line[1] resultfile.write(words[0] + ' ' + word[1] + '\n')
Performance evaluation: accuracy
def get_word(path): f = open(path, 'r', encoding='utf-8') lines = f.readlines() return lines result_lines = get_word(result_path) test_lines = get_word(test_path) list_num = len(test_lines) right_num = 0 for i in range(0, list_num): if result_lines[i][1] == test_lines[i][1]: right_num += 1 print("Accuracy:", right_num / list_num)
Accuracy: 0.23189316857201872