Preface
Recently, in my Python crawler, I always encounter the error of json.loads() method. Take a look at the retrieved original text, which has escaped "and", that is, \ "and \".
At the beginning, I met this problem and searched the Internet wildly, but I couldn't find the answer to my problem, so I wanted to make my own Json parsing algorithm according to the algorithm of Json parsing. However, it seems to be very troublesome, so lazy I made a replacement algorithm. For "positioning, and replace" in Json with \ ", so that there will be no error in the subsequent Json parsing
Code
Encapsulate the json.loads method, try and add the corresponding processing logic to the except ion:
def json_loads(string): """ //Try to convert string to JSON object. If the conversion fails, an exception will be thrown and the wrong JSON will be recorded in json.log :param string: :return: """ try: temp_json = json.loads(string) except json.JSONDecodeError: temp_json = None if temp_json is None: try: # Replace "point" is a constant defined by myself for identifying \ ', which can refer to URLDecoder. I set% 10AA string = str(string).replace("\\'", REPLACE_POINT) temp_json = json.loads(string) except json.JSONDecodeError: # Anchor points often appear in Json \ t string = string.replace("\t", "") try: temp_json = json.loads(string) except json.JSONDecodeError: # If we use this method again, there are risks in this replacement algorithm string = solve_wrong(string) try: temp_json = json.loads(string) except json.JSONDecodeError as json_exception: print(string) return temp_json
def solve_wrong(string, right=0): """ //When parsing Json fails, call this method to escape the wrong string, that is, replace '"' with '\' ' :param string: Parse bad Json Character string :param right: Initial offset of string parsing,Webpage Json The data length will be more than 10... :return: Escaped string """ if len(string) < 10: # Jason, shorter than 10? What about me print("Get out,this is bullshit.") result = None while right < len(string): c = string[right] if c == '"': # Record the current double quotation mark position left = right flag = False # Query backward from current location right = left + 1 while (right + 2) < len(string): if string[right] == '"': # Find the "match" and judge whether it is the correct ending of Jason nextc = string[right + 1] if nextc == ']' or nextc == ',' or nextc == ':' or nextc == '[' or nextc == '}': if nextc == ',': # Could be the beginning of the next Json object, array, or data next_nextc = string[right + 2] if next_nextc != '{' and next_nextc != '"' and next_nextc != '[': # If it is not the beginning of the data, the description "is in this string flag = True right += 1 continue # Description is the right ending if flag: # Intercept the middle part of the quotation mark pair temp = string[left + 1: right] if result is None: # Use result to store the result after string replacement, if you change the string directly # Will lead to a series of changes result = string.replace(temp, temp.replace('"', '\\"')) else: result = result.replace(temp, temp.replace('"', '\\"')) break else: # Find the string that caused the Json parsing error. Now just move to the end and intercept the string flag = True right += 1 right += 1 return result
Algorithmic thought
It mainly relies on double pointers to determine each pair of double quotation marks ("). When the right pointer is found, judge whether it is the correct end. If not, set the flag to True. When the right pointer points to the real end, intercept the middle part of the quotation mark pair and replace it.
I just stare at the form of the target Json and summarize the end. By the way, the Json I see in the developer mode has escape symbols. But after I catch it with the program, python seems to recognize it as escape characters. It's hard...