Python 3 natural language processing -- preprocessing

Keywords: Python REST Database less

Python 3 natural language processing (5) - preprocessing

Note: please contact the blogger, or pay attention to the WeChat public number "citation space". Otherwise, the plagiarism will be reported!

1. participle
When a document or a long string needs to be processed, the first thing you need to do is to split it into words and punctuation marks. We call this process word segmentation. Next, we'll look at the types of participators available in NLTK and their usage.
Create a file named tokenizer.py and add the following code:

from nltk.tokenize import LineTokenizer,SpaceTokenizer,TweetTokenizer
from nltk import word_tokenize

We'll start with line tokernizer. Add the following three lines of code:

str1='My name is Maximus Decimus, commander of the Armies of the North, General of the Felix Legions and loyal servant to the true emperor, Marcus Aurelius. \nFather to a murdered son, husband to a murdered wife. \nAnd I will have my vengeance, in this life or the next.'
lTokenizer=LineTokenizer()
print('Line tokenizer output:',lTokenizer.tokenize(str1))

As the name implies, the participator should split the input string into lines (not sentences). Let's look at the output effect of the word breaker:

Line tokenizer output: ['My name is Maximus Decimus, commander of the Armies of the North, General of the Felix Legions and loyal servant to the true emperor, Marcus Aurelius. ', 'Father to a murdered son, husband to a murdered wife. ', 'And I will have my vengeance, in this life or the next.']

As shown above, it returns a list of three strings. This means that the given input has been split into three lines based on the position of the newline character. The role of LineTokernizer is to split the input string into lines.
Now let's look at SpaceTokenizer. As the name implies, it is based on the space character to segment words. Add the following lines:

rawText='By 11 o\'clock on sunday, the doctor shall open the dispensary.'
sTokenizer=SpaceTokenizer()
print('Space Tokenizer output:',sTokenizer.tokenize(rawText))

sTokenizer is an object of SpaceTokenize class. When calling the tokenize() method, we will see the following output:

Space Tokenizer output: ['By', '11', "o'clock", 'on', 'sunday,', 'the', 'doctor', 'shall', 'open', 'the', 'dispensary.']

As expected, the input rawText is split by the space character ''.
Next, call the word_tokenize() method, for example:

print('Word Tokenizer output:',word_tokenize(rawText))

The results are as follows:

Word Tokenizer output: ['By', '11', "o'clock", 'on', 'sunday', ',', 'the', 'doctor', 'shall', 'open', 'the', 'dispensary', '.']

As shown above, the difference between SpaceTokenizer and word'u tokenize() is obvious.
Finally, we will introduce TweetTokernizer, which can be used when processing special strings:

tTokenizer=TweetTokenizer()
print('Tweet Tokenizer output:',tTokenizer.tokenize("This is a coool #dummysmiley: :-) :-P <3"))

Tweets contain special words, special characters, labels, smiley faces, etc. that we want to keep intact. The results of the above code are as follows:

Tweet Tokenizer output: ['This', 'is', 'a', 'coool', '#dummysmiley', ':', ':-)', ':-P', '<3']

As we can see, Tokenizer keeps the integrity of special characters without splitting them, and the smiley face remains intact. This is a special and rare class that can be used when needed.

2. Stem extraction
Stemming is the basic component of words without any suffix. The function of stemming extractor is to remove the suffix and output the stemming of words.
Create a file called stairs.py and add the following import lines:

from nltk import PorterStemmer,LancasterStemmer,word_tokenize

Before stem extraction, we first need to segment the input text, using the following code to complete this step:

raw='My name is Maximus Decimus, commander of the Armies of the North, General of the Felix Legions and loyal servant to the true emperor, Marcus Aurelius. Father to a murdered son, husband to a murdered wife. And I will have my vengeance, in this life or the next.'
tokens=word_tokenize(raw)

The participle list contains all the participles generated by the input string raw.
First, use PorterStemmer and add the following three lines of code:

porter=PorterStemmer()
pStems=[porter.stem(t) for t in tokens]
print(pStems)

First, the Stemmer is initialized, then it is applied to all the input text, and finally the output result is printed. By observing the output, we can learn more about:

['My', 'name', 'is', 'maximu', 'decimu', ',', 'command', 'of', 'the', 'armi', 'of', 'the', 'north', ',', 'gener', 'of', 'the', 'felix', 'legion', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'marcu', 'aureliu', '.', 'father', 'to', 'a', 'murder', 'son', ',', 'husband', 'to', 'a', 'murder', 'wife', '.', 'and', 'I', 'will', 'have', 'my', 'vengeanc', ',', 'in', 'thi', 'life', 'or', 'the', 'next', '.']

As you can s e e in the output, all the words have the suffix "s", "es", "e", "ed", "al" removed.
Next, use Lancaster Stemmer, which is more error prone than porter because it contains more tails to be removed:

lancaster=LancasterStemmer()
lStems=[lancaster.stem(t) for t in tokens]
print(lStems)

A similar experiment was carried out and Lancaster Stemmer was used instead of Porter Stemmer. The output results are as follows:

['my', 'nam', 'is', 'maxim', 'decim', ',', 'command', 'of', 'the', 'army', 'of', 'the', 'nor', ',', 'gen', 'of', 'the', 'felix', 'leg', 'and', 'loy', 'serv', 'to', 'the', 'tru', 'emp', ',', 'marc', 'aureli', '.', 'fath', 'to', 'a', 'murd', 'son', ',', 'husband', 'to', 'a', 'murd', 'wif', '.', 'and', 'i', 'wil', 'hav', 'my', 'veng', ',', 'in', 'thi', 'lif', 'or', 'the', 'next', '.']

We'll discuss their differences in the output section, but it's easy to see that the finisher is better than Porter. Such as "us", "e", "th", "eral" and "ered".
By comparing the output of these two Stemmer extractors, we find that lancaster is more thorough in removing the suffix. It removes as many trailing characters as possible, while porter removes as few trailing characters as possible.

3. Word form reduction
A lexical element is the center word of a word, or simply the basic composition of a word. We have known what is stemming, but different from the process of stemming extraction, stemming is obtained by removing or replacing suffixes, and the process of lexical element acquisition is a dictionary matching process. Because word shape reduction is a dictionary mapping process, it is a more complex process than stem extraction.
Create a file called lemmatizer.py and add the following code:

from nltk import word_tokenize,WordNetLemmatizer

Before any stem extraction, we first need to segment the input text, using the following code to complete:

raw='My name is Maximus Decimus, commander of the armies of the north, General of the Felix legions and loyal servant to the true emperor, Marcus Aurelius. Father to a murdered son, husband to a murdered wife. And I will have my vengeance, in this life or the next.'
tokens=word_tokenize(raw)

Now let's use lemmatizer to add three lines of code as follows:

lemmatizer=WordNetLemmatizer()
lemmas=[lemmatizer.lemmatize(t) for t in tokens]
print(lemmas)

Run the program, the output of the above three lines of code is as follows:

['My', 'name', 'is', 'Maximus', 'Decimus', ',', 'commander', 'of', 'the', 'army', 'of', 'the', 'north', ',', 'General', 'of', 'the', 'Felix', 'legion', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'Marcus', 'Aurelius', '.', 'Father', 'to', 'a', 'murdered', 'son', ',', 'husband', 'to', 'a', 'murdered', 'wife', '.', 'And', 'I', 'will', 'have', 'my', 'vengeance', ',', 'in', 'this', 'life', 'or', 'the', 'next', '.']

As shown above, the morphological reducers can judge proper nouns without removing the suffix "s", but for non proper nouns (such as legions and arms), they need to remove and replace the suffix. Word shape reduction is a dictionary matching process. By comparing the output of stem extractor and word shape restorer, we find that stem extractor causes many errors, while word shape restorer only causes few errors, but it does not do any processing for word murdered, which is a processing error. From the final results, we can see that the shape reducers are better than the stem extractors in terms of the basic form of the extracted words.
It is worth mentioning that WordNet lemmatizer will only remove affixes if it can find the target word in the dictionary. This makes the processing speed of word shape reduction slower than that of stem extraction. Moreover, it recognizes words with capital letters and treats them as special words. It doesn't do anything with these special words, just returns them as they are. To avoid this problem, you may need to convert your input string to lowercase letters before performing a morphological restore. However, word shape reduction is still not perfect, it also has errors. After checking the input and output results of this instance, we find that it cannot convert murdered to murder. Similarly, it can handle the word "women" correctly, but not the word "men".

4. Disable word library
In this section, we will take Gutenburg Corpus as an example. Gutenberg corpus is a part of NLTK data module. It selects 18 texts from about 25000 e-books in the Gutenberg archive. It is a plain text corpus, because it has no classification, so it is very suitable for simple word processing without considering its relevance to any topic. One of the objectives of this section is to introduce the most important preprocessing process in the text analysis process - stop word processing. According to the goal, we will use this corpus to explain the frequency distribution and the application of the stop words database in Python's NLTK module. In short, stop words are words with little semantic value but high syntactic value. When you use the bag of word method (such as TF/IDF) instead of parsing, you usually need to remove the stop words.
Create a file named Gutenberg.py and import the following three lines of code:

import nltk
from nltk.corpus import gutenberg
print(gutenberg.fileids())

The first two lines of code are imported into Gutenberg corpus and other required corpus, and the third line is used to check whether the corpus is loaded successfully. Run this file in Python integrated environment, and the output is as follows:

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

As shown above, the names of 18 Gutenberg documents are printed on the screen.
Add the following two lines of code, we do some simple preprocessing for all the word lists in the corpus:

gb_words=gutenberg.words('bible-kjv.txt')
words_filtered=[e for e in gb_words if len(e)>=3]

The first line of code copies a list of all the words in the corpus sample bible-kjv.txt and stores them in the GB words variable. The second line of code traverses the list of all words in Gutenberg corpus, removing all words less than 3 in length.

Now, we use nltk.corpus.stopwords and do the stop word processing on the previously filtered word list. Add the following lines of code:

stopwords=nltk.corpus.stopwords.words('english')
words=[w for w in words_filtered if w.lower() not in stopwords]

The first line simply loads english stopwords from the stopwords repository into the stopwords variable. The second line of code is that we do further processing on the previously filtered word list to filter out all the stop words.

Now we apply nltk.FreqDist to the preprocessed words and the unprocessed words respectively, and add the following lines of code:

fdistPlain=nltk.FreqDist(gb_words)
fdist=nltk.FreqDist(words)

If we want to see the frequency distribution characteristics after the above operations, add the following two lines of code:

print('The most commom 10 words in the bag:\n',fdistPlain.most_common(10))
print('The most common 10 words in the bag minus the stopwords:\n',fdist.most_common(10))

The most common (10) function returns the 10 most commonly used words in the word bag processed by the word frequency distribution. After running the above program, you will get output similar to the following:

The most commom 10 words in the bag:
 [(',', 70509), ('the', 62103), (':', 43766), ('and', 38847), ('of', 34480), ('.', 26160), ('to', 13396), ('And', 12846), ('that', 12576), ('in', 12331)]
The most common 10 words in the bag minus the stopwords:
 [('shall', 9760), ('unto', 8940), ('LORD', 6651), ('thou', 4890), ('thy', 4450), ('God', 4115), ('said', 3995), ('thee', 3827), ('upon', 2730), ('man', 2721)]

If you look closely at the results, you will find that the 10 most common words in the unprocessed text are meaningless. On the other hand, the 10 most common words in the preprocessed text, such as god, lord and man, remind us that we are dealing with texts related to beliefs or religions. The processing of stop words is a preprocessing technology that needs to be mastered before any complex text data analysis. NLTK's stopwords corpus contains 11 languages. In any text analysis application, when you need to analyze keywords, you can get twice the result with half the effort. Word frequency distribution will help you get important words. From a statistical point of view, if you draw a two-dimensional plan containing word frequency and word importance, the ideal distribution curve will look like a bell curve.

5. Extract common words of two texts
Create a file called lemmatizer.py and create a set of short or news articles with long strings:

story1='''There was an old man who lived in a little hut in the middle of a forest. His wife was dead, and he had only one son, whom he loved dearly. Near their hut was a group of birch trees, in which some black-game had made their nests, and the youth had often begged his father's permission to shoot the birds, but the old man always strictly forbade him to do anything of the kind.
One day, however, when the father had gone to a little distance to collect some sticks for the fire, the boy fetched his bow, and shot at a bird that was just flying towards its nest. But he had not taken proper aim, and the bird was only wounded, and fluttered along the ground. The boy ran to catch it, but though he ran very fast, and the bird seemed to flutter along very slowly, he never could quite come up with it; it was always just a little in advance. But so absorbed was he in the chase that he did not notice for some time that he was now deep in the forest, in a place where he had never been before. Then he felt it would be foolish to go any further, and he turned to find his way home.
He thought it would be easy enough to follow the path along which he had come, but somehow it was always branching off in unexpected directions. He looked about for a house where he might stop and ask his way, but there was not a sign of one anywhere, and he was afraid to stand still, for it was cold, and there were many stories of wolves being seen in that part of the forest. Night fell, and he was beginning to start at every sound, when suddenly a magician came running towards him, with a pack of wolves snapping at his heels. Then all the boy's courage returned to him. He took his bow, and aiming an arrow at the largest wolf, shot him through the heart, and a few more arrows soon put the rest to flight. The magician was full of gratitude to his deliverer, and promised him a reward for his help if the youth would go back with him to his house.'''

story2='''The newly-coined word "online education" may by no means sound strange to most people. During the past several years, hundreds of online education colleges have sprung up around China.
Why could online education be so popular in such a short period of time? For one thing, If we want to catch up with the development and the great pace of modern society, we all should possess an urgent and strong desire to study, while most people nowadays are under so enormous pressures that they can hardly have time and energy to study full time at school. Furthermore, online education enables them to save a great deal of time on the way spent on the way between home and school. Last but not least, the quick development of internet,which makes possible all our dreams of attending class on the net,should also be another critical reason.
Personally, I appreciate this new form of education. It's indeed a helpful complement to the traditional educational means. It can provide different learners with more flexible and various ways to learn. Most of all, through online education, we can stick to our jobs and at the same time study and absorb the latest knowledge.'''

First, delete some special characters in the text. We removed all line breaks "\ n", commas, periods, exclamation marks, question marks, and so on. Finally, use the casefold() function to convert all strings to lowercase:

story1=story1.replace(',','').replace('\n','').replace('.','').replace('"','').replace('!','').replace('?','').casefold()
story2=story2.replace(',','').replace('\n','').replace('.','').replace('"','').replace('!','').replace('?','').casefold()

Next, the text is segmented:

story1_words=story1.split(' ')
print("Story1 words:",story1_words)
story2_words=story2.split(' ')
print("Story2 words:",story2_words)

Call split on story1 and story2 to segment words according to the "" character to get their word list. Now let's look at the output of this step:

Story1 words: ['there', 'was', 'an', 'old', 'man', 'who', 'lived', 'in', 'a', 'little', 'hut', 'in', 'the', 'middle', 'of', 'a', 'forest', 'his', 'wife', 'was', 'dead', 'and', 'he', 'had', 'only', 'one', 'son', 'whom', 'he', 'loved', 'dearly', 'near', 'their', 'hut', 'was', 'a', 'group', 'of', 'birch', 'trees', 'in', 'which', 'some', 'black-game', 'had', 'made', 'their', 'nests', 'and', 'the', 'youth', 'had', 'often', 'begged', 'his', "father's", 'permission', 'to', 'shoot', 'the', 'birds', 'but', 'the', 'old', 'man', 'always', 'strictly', 'forbade', 'him', 'to', 'do', 'anything', 'of', 'the', 'kindone', 'day', 'however', 'when', 'the', 'father', 'had', 'gone', 'to', 'a', 'little', 'distance', 'to', 'collect', 'some', 'sticks', 'for', 'the', 'fire', 'the', 'boy', 'fetched', 'his', 'bow', 'and', 'shot', 'at', 'a', 'bird', 'that', 'was', 'just', 'flying', 'towards', 'its', 'nest', 'but', 'he', 'had', 'not', 'taken', 'proper', 'aim', 'and', 'the', 'bird', 'was', 'only', 'wounded', 'and', 'fluttered', 'along', 'the', 'ground', 'the', 'boy', 'ran', 'to', 'catch', 'it', 'but', 'though', 'he', 'ran', 'very', 'fast', 'and', 'the', 'bird', 'seemed', 'to', 'flutter', 'along', 'very', 'slowly', 'he', 'never', 'could', 'quite', 'come', 'up', 'with', 'it;', 'it', 'was', 'always', 'just', 'a', 'little', 'in', 'advance', 'but', 'so', 'absorbed', 'was', 'he', 'in', 'the', 'chase', 'that', 'he', 'did', 'not', 'notice', 'for', 'some', 'time', 'that', 'he', 'was', 'now', 'deep', 'in', 'the', 'forest', 'in', 'a', 'place', 'where', 'he', 'had', 'never', 'been', 'before', 'then', 'he', 'felt', 'it', 'would', 'be', 'foolish', 'to', 'go', 'any', 'further', 'and', 'he', 'turned', 'to', 'find', 'his', 'way', 'homehe', 'thought', 'it', 'would', 'be', 'easy', 'enough', 'to', 'follow', 'the', 'path', 'along', 'which', 'he', 'had', 'come', 'but', 'somehow', 'it', 'was', 'always', 'branching', 'off', 'in', 'unexpected', 'directions', 'he', 'looked', 'about', 'for', 'a', 'house', 'where', 'he', 'might', 'stop', 'and', 'ask', 'his', 'way', 'but', 'there', 'was', 'not', 'a', 'sign', 'of', 'one', 'anywhere', 'and', 'he', 'was', 'afraid', 'to', 'stand', 'still', 'for', 'it', 'was', 'cold', 'and', 'there', 'were', 'many', 'stories', 'of', 'wolves', 'being', 'seen', 'in', 'that', 'part', 'of', 'the', 'forest', 'night', 'fell', 'and', 'he', 'was', 'beginning', 'to', 'start', 'at', 'every', 'sound', 'when', 'suddenly', 'a', 'magician', 'came', 'running', 'towards', 'him', 'with', 'a', 'pack', 'of', 'wolves', 'snapping', 'at', 'his', 'heels', 'then', 'all', 'the', "boy's", 'courage', 'returned', 'to', 'him', 'he', 'took', 'his', 'bow', 'and', 'aiming', 'an', 'arrow', 'at', 'the', 'largest', 'wolf', 'shot', 'him', 'through', 'the', 'heart', 'and', 'a', 'few', 'more', 'arrows', 'soon', 'put', 'the', 'rest', 'to', 'flight', 'the', 'magician', 'was', 'full', 'of', 'gratitude', 'to', 'his', 'deliverer', 'and', 'promised', 'him', 'a', 'reward', 'for', 'his', 'help', 'if', 'the', 'youth', 'would', 'go', 'back', 'with', 'him', 'to', 'his', 'house']
Story2 words: ['the', 'newly-coined', 'word', 'online', 'education', 'may', 'by', 'no', 'means', 'sound', 'strange', 'to', 'most', 'people', 'during', 'the', 'past', 'several', 'years', 'hundreds', 'of', 'online', 'education', 'colleges', 'have', 'sprung', 'up', 'around', 'chinawhy', 'could', 'online', 'education', 'be', 'so', 'popular', 'in', 'such', 'a', 'short', 'period', 'of', 'time', 'for', 'one', 'thing', 'if', 'we', 'want', 'to', 'catch', 'up', 'with', 'the', 'development', 'and', 'the', 'great', 'pace', 'of', 'modern', 'society', 'we', 'all', 'should', 'possess', 'an', 'urgent', 'and', 'strong', 'desire', 'to', 'study', 'while', 'most', 'people', 'nowadays', 'are', 'under', 'so', 'enormous', 'pressures', 'that', 'they', 'can', 'hardly', 'have', 'time', 'and', 'energy', 'to', 'study', 'full', 'time', 'at', 'school', 'furthermore', 'online', 'education', 'enables', 'them', 'to', 'save', 'a', 'great', 'deal', 'of', 'time', 'on', 'the', 'way', 'spent', 'on', 'the', 'way', 'between', 'home', 'and', 'school', 'last', 'but', 'not', 'least', 'the', 'quick', 'development', 'of', 'internet,which', 'makes', 'possible', 'all', 'our', 'dreams', 'of', 'attending', 'class', 'on', 'the', 'net,should', 'also', 'be', 'another', 'critical', 'reasonpersonally', 'i', 'appreciate', 'this', 'new', 'form', 'of', 'education', "it's", 'indeed', 'a', 'helpful', 'complement', 'to', 'the', 'traditional', 'educational', 'means', 'it', 'can', 'provide', 'different', 'learners', 'with', 'more', 'flexible', 'and', 'various', 'ways', 'to', 'learn', 'most', 'of', 'all', 'through', 'online', 'education', 'we', 'can', 'stick', 'to', 'our', 'jobs', 'and', 'at', 'the', 'same', 'time', 'study', 'and', 'absorb', 'the', 'latest', 'knowledge']

As you can see, all the special characters are removed and a list of words is created.

Now, we create a vocabulary set based on this list of words. A vocabulary set is composed of a set of non repetitive words. We call Python's own set() function to convert this list into a set:

story1_vocab=set(story1_words)
print('Story1 vocabulary:',story1_vocab)
story2_vocab=set(story2_words)
print('Story2 vocabulary:',story2_vocab)

The results are as follows:

Story1 vocabulary: {'still', 'being', "boy's", 'strictly', 'ground', 'largest', 'further', 'forbade', 'forest', 'always', 'of', 'put', 'find', 'slowly', 'were', 'now', 'gone', 'branching', 'sticks', 'magician', 'permission', 'afraid', 'only', 'proper', 'come', 'before', 'heels', 'help', 'more', 'ask', 'back', 'trees', 'some', 'which', 'there', 'about', 'seen', 'anywhere', 'off', 'wolf', 'path', 'birch', 'group', 'deep', 'with', 'birds', 'night', 'the', 'shot', 'snapping', 'time', 'go', 'chase', 'loved', 'when', 'catch', 'fire', 'at', 'begged', 'stop', 'old', 'fast', 'fell', 'been', 'arrow', 'distance', 'dead', 'came', 'then', 'one', 'day', 'where', 'for', 'aim', 'fetched', 'quite', 'easy', 'their', 'often', 'just', 'towards', 'but', 'seemed', 'had', 'sign', 'many', 'beginning', 'gratitude', 'along', 'pack', 'flying', 'promised', 'house', 'flight', 'dearly', 'very', 'fluttered', 'might', 'start', 'through', 'suddenly', 'his', 'bow', 'follow', 'do', 'notice', 'never', 'could', 'be', 'courage', 'son', 'and', 'stories', 'would', 'deliverer', 'that', 'soon', 'foolish', 'however', 'returned', 'took', 'advance', 'all', 'near', 'ran', 'absorbed', 'felt', 'he', 'father', 'way', 'every', 'rest', 'anything', 'homehe', 'heart', 'who', 'was', 'part', 'collect', 'so', 'him', 'whom', 'not', 'it', 'running', 'lived', 'unexpected', 'somehow', 'arrows', 'few', 'full', 'stand', 'any', 'aiming', "father's", 'cold', 'a', 'to', 'wolves', 'bird', 'little', 'sound', 'place', 'it;', 'wounded', 'hut', 'man', 'in', 'made', 'nests', 'though', 'looked', 'if', 'up', 'flutter', 'did', 'turned', 'wife', 'directions', 'thought', 'reward', 'black-game', 'taken', 'middle', 'enough', 'kindone', 'nest', 'an', 'shoot', 'its', 'youth', 'boy'}
Story2 vocabulary: {'newly-coined', 'of', 'no', 'great', 'latest', 'dreams', 'attending', 'during', 'helpful', 'nowadays', 'net,should', 'study', 'save', 'more', 'are', 'period', 'also', 'new', 'they', 'spent', 'least', "it's", 'deal', 'desire', 'various', 'may', 'most', 'last', 'thing', 'chinawhy', 'with', 'can', 'the', 'flexible', 'time', 'strange', 'catch', 'i', 'provide', 'at', 'reasonpersonally', 'while', 'home', 'appreciate', 'online', 'hundreds', 'colleges', 'critical', 'strong', 'one', 'urgent', 'possible', 'for', 'another', 'sprung', 'pace', 'our', 'same', 'popular', 'but', 'internet,which', 'stick', 'means', 'educational', 'pressures', 'through', 'modern', 'around', 'could', 'be', 'indeed', 'makes', 'and', 'energy', 'by', 'school', 'education', 'that', 'possess', 'have', 'should', 'all', 'different', 'furthermore', 'way', 'so', 'not', 'jobs', 'enables', 'knowledge', 'it', 'complement', 'this', 'short', 'years', 'full', 'people', 'quick', 'we', 'hardly', 'past', 'on', 'several', 'traditional', 'a', 'to', 'under', 'class', 'sound', 'ways', 'learners', 'between', 'want', 'in', 'if', 'word', 'them', 'up', 'absorb', 'learn', 'such', 'development', 'enormous', 'form', 'an', 'society'}

The above is a set of non repetitive words from these two essays.

Now, the last step is to find out the common words between the two essays. Python allows the use of the set operator &, which we use to find common words in two vocabularies:

common_vocab=story1_vocab&story2_vocab
print('Common Vocabulary:',common_vocab)

The final output is as follows:

Common Vocabulary: {'with', 'full', 'the', 'through', 'of', 'time', 'could', 'be', 'catch', 'a', 'at', 'to', 'and', 'sound', 'that', 'one', 'more', 'in', 'for', 'if', 'all', 'way', 'up', 'but', 'so', 'not', 'it', 'an'}
Published 5 original articles, won praise 3, and visited 214
Private letter follow

Posted by jsschmitt on Sun, 02 Feb 2020 05:33:16 -0800