Readme
I'm a programming Xiaobai. Although the registration time is long, I'm not engaged in coding. I began to teach myself Python in order to learn AI.
I usually knock the code according to the book, but I don't have a deep understanding. Now I want to study chatbot and find that my coding level needs to be strengthened, so I open this series to record the process of deducting code line by line. Of course, it doesn't start from 0. It just writes out what you don't understand. It can also be used as data for future reference.
Finally, I would like to reiterate that I have not learned programming systematically. Writing this series is to break through myself. Please give me your advice!
Source of code
CHATBOT TUTORIAL of pytoch
https://pytorch.org/tutorials/beginner/chatbot_tutorial.html?highlight=gpu%20training
catalogue
Step by step to understand the python Chatbot tutorial code (I)
Step by step to understand the python Chatbot tutorial code (II)
Code: Create formatted data file (for ease of understanding, change the order of the code slightly, and this chapter is slightly longer.)
For convenience, a standard format file will be created through the following code, including each line of query sentence separated by TAB and a response sentence pair
1. loadLines splits each line of the file into a field Dictionary (lineID, characterID, movieID, character, text)
# Splits each line of the file into a dictionary of fields lines = {} MOVIE_LINES_FIELDS = ["lineID", "characterID", "movieID", "character", "text"] def loadLines(fileName, fields): lines = {} with open(fileName, 'r', encoding='iso-8859-1') as f: for line in f: values = line.split(" +++$+++ ") # Extract fields lineObj = {} for i, field in enumerate(fields): lineObj[field] = values[i] lines[lineObj['lineID']] = lineObj return lines lines = loadLines(os.path.join(corpus, "movie_lines.txt"), MOVIE_LINES_FIELDS)
View dictionary lines content
list(lines.items())[0] ('L1045', {'lineID': 'L1045', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'They do not!\n'})
encoding='iso-8859-1'
It belongs to single byte coding and can represent up to 0-255 characters. It is applied to English series. Cannot represent Chinese characters.
For more coding contents, please refer to https://www.cnblogs.com/huangchenggener/p/10983866.html
line.split(' + + + $ + + + ')
split() method: slice the string by specifying the separator, so print(values) gets the result:
['L1045 ', ' u0 ', ' m0 ', ' BIANCA ', ' They do not!\n'] ['L1044 ', ' u2 ', ' m0 ', ' CAMERON ', ' They do to!\n'] ['L985 ', ' u0 ', ' m0 ', ' BIANCA ', ' I hope so.\n'] ['L984 ', ' u2 ', ' m0 ', ' CAMERON ', ' She okay?\n'] ['L925 ', ' u0 ', ' m0 ', ' BIANCA ', " Let's go.\n"] ['L924 ', ' u2 ', ' m0 ', ' CAMERON ', ' Wow\n'] ['L872 ', ' u0 ', ' m0 ', ' BIANCA ', " Okay -- you're gonna need to learn how to lie.\n"] ['L871 ', ' u2 ', ' m0 ', ' CAMERON ', ' No\n'] ['L870 ', ' u0 ', ' m0 ', ' BIANCA ', ' I\'m kidding. You know how sometimes you just become this "persona"? And you don\'t know how to quit?\n'] ...
enumerate
reference resources https://blog.csdn.net/landian0531/article/details/120081598
for i, field in enumerate(fields): arrange fields according to dictionary {I: field}, i.e
print(i,field) yields the following results:
0 lineID 1 characterID 2 movieID 3 character 4 text 0 lineID 1 characterID 2 movieID 3 character 4 text 0 lineID 1 characterID 2 movieID 3 character 4 text 0 lineID 1 characterID 2 movieID 3 character 4 text 0 lineID 1 characterID 2 movieID 3 character 4 text 0 lineID 1 characterID 2 movieID 3 character 4 text ......
Adding key value pairs to a dictionary
Take chestnuts for example: it can be seen from chestnuts that the addition of dictionaries is very convenient!
a = {'mathematics':95} print(a) #Add a new key value pair a['language'] = 89 print(a) #Add a new key value pair again a['English'] = 90 print(a) {'mathematics': 95} {'mathematics': 95, 'language': 89} {'mathematics': 95, 'language': 89, 'English': 90}
Back in the code, because the initial dictionary lineObj = {} is null
Therefore, the first sentence lineObj[field]=values[i] is to add a new key / value pair to the dictionary lineObj, and print(lineObj) after repeated cycles:
{'lineID': 'L1045 ', 'characterID': ' u0 ', 'movieID': ' m0 ', 'character': ' BIANCA ', 'text': ' They do not!\n'} {'lineID': 'L1044 ', 'characterID': ' u2 ', 'movieID': ' m0 ', 'character': ' CAMERON ', 'text': ' They do to!\n'} {'lineID': 'L985 ', 'characterID': ' u0 ', 'movieID': ' m0 ', 'character': ' BIANCA ', 'text': ' I hope so.\n'} ......
lines[lineObj['lineID']] = lineObj
As above, this code takes the lineID in the current lineObj as the key and the current lineObj as the value of the lineID. Then the dictionary lines result is:
{'L1045': {'lineID': 'L1045', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'They do not!\n'}, 'L1044': {'lineID': 'L1044', 'characterID': 'u2', 'movieID': 'm0', 'character': 'CAMERON', 'text': 'They do to!\n'}, 'L985': {'lineID': 'L985', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'I hope so.\n'}, ......
2. loadConversationsloadLines according to movie_conversations.txt groups the fields of rows into conversations
Since this code is very similar to the above functions, only the variable results of each stage are listed below for ease of understanding.
#Groups fields of lines from 'loadLines' into conversations based on *movie_conversation.txt* MOVIE_CONVERSATIONS_FIELDS=['character1ID','character2ID','movieID','utteranceIDs'] def loadConversations(fileName, lines, fields): conversations = [] with open(fileName, 'r', encoding='iso-8859-1') as f: for line in f: values = line.split(" +++$+++ ") # Extract fields convObj = {} for i, field in enumerate(fields): convObj[field] = values[i] # Convert string to list (convObj["utteranceIDs"] == "['L598485', 'L598486', ...]") utterance_id_pattern = re.compile('L[0-9]+') lineIds = utterance_id_pattern.findall(convObj["utteranceIDs"]) # Reassemble lines convObj["lines"] = [] for lineId in lineIds: convObj["lines"].append(lines[lineId]) conversations.append(convObj) return conversations conversations=loadConversations(os.path.join(corpus,'movie_conversations.txt'),lines,MOVIE_CONVERSATIONS_FIELDS)
View movie_ Content and format of conversations
file=os.path.join(corpus,'movie_conversations.txt') with open(file,'r') as datafile: lines=datafile.readlines() for line in lines[:10]: print(line) u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197'] u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199'] u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203'] u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L204', 'L205', 'L206'] u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L207', 'L208'] u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L271', 'L272', 'L273', 'L274', 'L275'] u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L276', 'L277'] u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L280', 'L281'] u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L363', 'L364'] u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L365', 'L366']
line.split(' + + + $ + + + ')
print(values) gets the result:
['u0', 'u2', 'm0', "['L194', 'L195', 'L196', 'L197']\n"] ['u0', 'u2', 'm0', "['L198', 'L199']\n"] ['u0', 'u2', 'm0', "['L200', 'L201', 'L202', 'L203']\n"] ['u0', 'u2', 'm0', "['L204', 'L205', 'L206']\n"] ['u0', 'u2', 'm0', "['L207', 'L208']\n"] ['u0', 'u2', 'm0', "['L271', 'L272', 'L273', 'L274', 'L275']\n"] ['u0', 'u2', 'm0', "['L276', 'L277']\n"] ['u0', 'u2', 'm0', "['L280', 'L281']\n"]
enumerate
Result of print(convObj):
{'character1ID': 'u0'} {'character1ID': 'u0', 'character2ID': 'u2'} {'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0'} {'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0', 'utteranceIDs': "['L194', 'L195', 'L196', 'L197']\n"} {'character1ID': 'u0'} {'character1ID': 'u0', 'character2ID': 'u2'} {'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0'} {'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0', 'utteranceIDs': "['L198', 'L199']\n"} {'character1ID': 'u0'} {'character1ID': 'u0', 'character2ID': 'u2'} {'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0'} {'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0', 'utteranceIDs': "['L200', 'L201', 'L202', 'L203']\n"} {'character1ID': 'u0'} {'character1ID': 'u0', 'character2ID': 'u2'} {'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0'} {'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0', 'utteranceIDs': "['L204', 'L205', 'L206']\n"}
re.compile
The general use steps of re module are as follows:
- Use the compile function to compile the string form of the regular expression into a Pattern object
- The text is matched and searched through a series of methods provided by the Pattern object to obtain the matching result (a Match object)
- Finally, use the attributes and methods provided by the Match object to obtain information and perform other operations as needed
compile() is used with findall() to return a list.
'L[0-9] +' is a regular expression that starts with L, and [0-9] + matches one or more numbers
The result of print(lineIds) is
['L194', 'L195', 'L196', 'L197'] ['L198', 'L199'] ['L200', 'L201', 'L202', 'L203'] ['L204', 'L205', 'L206'] ['L207', 'L208'] ['L271', 'L272', 'L273', 'L274', 'L275'] ['L276', 'L277'] ['L280', 'L281'] ['L363', 'L364'] ['L365', 'L366']
Adding key value pairs to a dictionary
convObj["lines"] = []: add a new key lines. If the value is empty, it will be added to the dictionary convObj
Result of print(convObj):
{'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0', 'utteranceIDs': "['L194', 'L195', 'L196', 'L197']\n", 'lines': []} {'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0', 'utteranceIDs': "['L198', 'L199']\n", 'lines': []} {'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0', 'utteranceIDs': "['L200', 'L201', 'L202', 'L203']\n", 'lines': []} {'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0', 'utteranceIDs': "['L204', 'L205', 'L206']\n", 'lines': []} {'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0', 'utteranceIDs': "['L207', 'L208']\n", 'lines': []} {'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0', 'utteranceIDs': "['L271', 'L272', 'L273', 'L274', 'L275']\n", 'lines': []} ......
append
The append() method is used to add a new object to the end of the list
for lineId in lineIds: convObj["lines"].append(lines[lineId])
Here, use the lineIds obtained by re.compile above, and then use the for loop to extract the value (the first lineId is L194). Then add the dictionary lines returned by a piece of code and the corresponding value L194 of the key lineId in it to the dictionary convObj as the value of the key 'lines'
So the result of print(convObj['lines']: (in fact, it is the same as the lines content of the previous return, but the order starts with L194)
[{'lineID': 'L194', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'Can we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again.\n'}] [{'lineID': 'L194', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'Can we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again.\n'}, {'lineID': 'L195', 'characterID': 'u2', 'movieID': 'm0', 'character': 'CAMERON', 'text': "Well, I thought we'd start with pronunciation, if that's okay with you.\n"}] [{'lineID': 'L194', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'Can we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again.\n'}, {'lineID': 'L195', 'characterID': 'u2', 'movieID': 'm0', 'character': 'CAMERON', 'text': "Well, I thought we'd start with pronunciation, if that's okay with you.\n"}, {'lineID': 'L196', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'Not the hacking and gagging and spitting part. Please.\n'}] [{'lineID': 'L194', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'Can we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again.\n'}, {'lineID': 'L195', 'characterID': 'u2', 'movieID': 'm0', 'character': 'CAMERON', 'text': "Well, I thought we'd start with pronunciation, if that's okay with you.\n"}, {'lineID': 'L196', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'Not the hacking and gagging and spitting part. Please.\n'}, {'lineID': 'L197', 'characterID': 'u2', 'movieID': 'm0', 'character': 'CAMERON', 'text': "Okay... then how 'bout we try out some French cuisine. Saturday? Night?\n"}]
conversations
The final print(conversations) results are as follows:
{'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0', 'utteranceIDs': "['L194', 'L195', 'L196', 'L197']\n", 'lines': [{'lineID': 'L194', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'Can we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again.\n'}, {'lineID': 'L195', 'characterID': 'u2', 'movieID': 'm0', 'character': 'CAMERON', 'text': "Well, I thought we'd start with pronunciation, if that's okay with you.\n"}, {'lineID': 'L196', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'Not the hacking and gagging and spitting part. Please.\n'}, {'lineID': 'L197', 'characterID': 'u2', 'movieID': 'm0', 'character': 'CAMERON', 'text': "Okay... then how 'bout we try out some French cuisine. Saturday? Night?\n"}]} {'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0', 'utteranceIDs': "['L198', 'L199']\n", 'lines': [{'lineID': 'L198', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': "You're asking me out. That's so cute. What's your name again?\n"}, {'lineID': 'L199', 'characterID': 'u2', 'movieID': 'm0', 'character': 'CAMERON', 'text': 'Forget it.\n'}]} ......
3. Extract sentence pairs from the conversation
# Extracts pairs of sentences from conversations def extractSentencePairs(conversations): qa_pairs = [] for conversation in conversations: # Iterate over all the lines of the conversation for i in range(len(conversation["lines"]) - 1): # We ignore the last line (no answer for it) inputLine = conversation["lines"][i]["text"].strip() targetLine = conversation["lines"][i+1]["text"].strip() # Filter wrong samples (if one of the lists is empty) if inputLine and targetLine: qa_pairs.append([inputLine, targetLine]) return qa_pairs
The magic of circulation
According to the first sentence, the loop for conversation in conversations divides conversations into zero, so conversations[0] is conversation. The results are as follows:
{'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0', 'utteranceIDs': "['L194', 'L195', 'L196', 'L197']\n", 'lines': [{'lineID': 'L194', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'Can we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again.\n'}, {'lineID': 'L195', 'characterID': 'u2', 'movieID': 'm0', 'character': 'CAMERON', 'text': "Well, I thought we'd start with pronunciation, if that's okay with you.\n"}, {'lineID': 'L196', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'Not the hacking and gagging and spitting part. Please.\n'}, {'lineID': 'L197', 'characterID': 'u2', 'movieID': 'm0', 'character': 'CAMERON', 'text': "Okay... then how 'bout we try out some French cuisine. Saturday? Night?\n"}]}
range() & len()
- Start: counting starts from start. The default is 0. For example, range (5) is equivalent to range (0, 5);
- Stop: count to the end of stop, but do not include stop. For example, range (0, 5) is [0, 1, 2, 3, 4] without 5
- Step: step size. The default value is 1. For example, range (0, 5) is equivalent to range(0, 5, 1)
The results of conversations[0]['lines'] are as follows:
[{'lineID': 'L194', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'Can we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again.\n'}, {'lineID': 'L195', 'characterID': 'u2', 'movieID': 'm0', 'character': 'CAMERON', 'text': "Well, I thought we'd start with pronunciation, if that's okay with you.\n"}, {'lineID': 'L196', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'Not the hacking and gagging and spitting part. Please.\n'}, {'lineID': 'L197', 'characterID': 'u2', 'movieID': 'm0', 'character': 'CAMERON', 'text': "Okay... then how 'bout we try out some French cuisine. Saturday? Night?\n"}]
len(conversations[0]['lines']) result: 4 so this sentence is intended to extract the number of conversations.
conversation["lines"][i]["text"].strip()
This sentence is used to extract the dialogue content step by step. The process is as follows
Print (conversations [0] [lines] [0])
{'lineID': 'L194', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'Can we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again.\n'}
print(conversations[0]["lines"][0])['text'].strip() gets: (extracts the first sentence and deletes the symbols at the beginning and end through strip())
'Can we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again.'
AND
The expression is evaluated from left to right. If all values are true, the last value is returned. If false, the first false value is returned.
Therefore, if inputLine and targetLine: statement is terminated when the first inputLine cannot take value.
qa_pairs
print(qa_pairs[0] result:
['Can we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again.', "Well, I thought we'd start with pronunciation, if that's okay with you."]
4. Call these functions and create files
The code commented out here has been moved to the above code.
# Define path to new file datafile = os.path.join(corpus, "formatted_movie_lines.txt") delimiter = '\t' # Unescape the delimiter delimiter = str(codecs.decode(delimiter, "unicode_escape")) # Initialize lines dict, conversations list, and field ids #lines = {} #conversations = [] #MOVIE_LINES_FIELDS = ["lineID", "characterID", "movieID", "character", "text"] #MOVIE_CONVERSATIONS_FIELDS = ["character1ID", "character2ID", "movieID", "utteranceIDs"] # Load lines and process conversations #print("\nProcessing corpus...") #lines = loadLines(os.path.join(corpus, "movie_lines.txt"), #MOVIE_LINES_FIELDS) #print("\nLoading conversations...") #conversations = loadConversations(os.path.join(corpus, "movie_conversations.txt"),lines, MOVIE_CONVERSATIONS_FIELDS) # Write new csv file print("\nWriting newly formatted file...") with open(datafile, 'w', encoding='utf-8') as outputfile: writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n') for pair in extractSentencePairs(conversations): writer.writerow(pair) # Print a sample of lines print("\nSample lines from file:") printLines(datafile)
codecs.decode
For coding problems, please read the following articles in detail.
https://www.jb51.net/article/92006.htm
This sentence will be reflected in the output of this Code: in Python 3, the string represented in bytes must be prefixed with b, that is, written in the form of b 'xxxx' above.
In addition, I really don't understand why code conversion is required for adding '\ t'. I'll fill the pit when I understand it later.
lineterminator
delimiter = '\ t' is to use tabs instead of commas to separate cells. lineterminator = '\ n' is to set one line spacing ('\ n\n' is twice the line spacing)
write.writerow and write.writerows
writerow single line write
writerows multiline writes
printLines(datafile)
Writing newly formatted file... Sample lines from file: b"Can we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again.\tWell, I thought we'd start with pronunciation, if that's okay with you.\n" b"Well, I thought we'd start with pronunciation, if that's okay with you.\tNot the hacking and gagging and spitting part. Please.\n" b"Not the hacking and gagging and spitting part. Please.\tOkay... then how 'bout we try out some French cuisine. Saturday? Night?\n" b"You're asking me out. That's so cute. What's your name again?\tForget it.\n" b"No, no, it's my fault -- we didn't have a proper introduction ---\tCameron.\n" b"Cameron.\tThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser. My sister. I can't date until she does.\n" b"The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser. My sister. I can't date until she does.\tSeems like she could get a date easy enough...\n" b'Why?\tUnsolved mystery. She used to be really popular when she started high school, then it was just like she got sick of it or something.\n'