Python crawling Json parsing error

Keywords: JSON Python

Preface

Recently, in my Python crawler, I always encounter the error of json.loads() method. Take a look at the retrieved original text, which has escaped "and", that is, \ "and \".

At the beginning, I met this problem and searched the Internet wildly, but I couldn't find the answer to my problem, so I wanted to make my own Json parsing algorithm according to the algorithm of Json parsing. However, it seems to be very troublesome, so lazy I made a replacement algorithm. For "positioning, and replace" in Json with \ ", so that there will be no error in the subsequent Json parsing

Code

Encapsulate the json.loads method, try and add the corresponding processing logic to the except ion:

def json_loads(string):
    """
    //Try to convert string to JSON object. If the conversion fails, an exception will be thrown and the wrong JSON will be recorded in json.log
    :param string:
    :return:
    """

    try:
        temp_json = json.loads(string)
    except json.JSONDecodeError:
        temp_json = None

    if temp_json is None:
        try:
            # Replace "point" is a constant defined by myself for identifying \ ', which can refer to URLDecoder. I set% 10AA
            string = str(string).replace("\\'", REPLACE_POINT)
            temp_json = json.loads(string)
        except json.JSONDecodeError:
            # Anchor points often appear in Json \ t
            string = string.replace("\t", "")
            try:
                temp_json = json.loads(string)
            except json.JSONDecodeError:
                # If we use this method again, there are risks in this replacement algorithm
                string = solve_wrong(string)
                try:
                    temp_json = json.loads(string)
                except json.JSONDecodeError as json_exception:
                    print(string)
    return temp_json
def solve_wrong(string, right=0):
    """
    //When parsing Json fails, call this method to escape the wrong string, that is, replace '"' with '\' '
    :param string: Parse bad Json Character string
    :param right: Initial offset of string parsing,Webpage Json The data length will be more than 10...
    :return: Escaped string
    """

    if len(string) < 10:
        # Jason, shorter than 10? What about me
        print("Get out,this is bullshit.")

    result = None
    while right < len(string):
        c = string[right]
        if c == '"':
            # Record the current double quotation mark position
            left = right
            flag = False
            # Query backward from current location
            right = left + 1
            while (right + 2) < len(string):
                if string[right] == '"':
                    # Find the "match" and judge whether it is the correct ending of Jason
                    nextc = string[right + 1]
                    if nextc == ']' or nextc == ',' or nextc == ':' or nextc == '[' or nextc == '}':

                        if nextc == ',':
                            # Could be the beginning of the next Json object, array, or data
                            next_nextc = string[right + 2]
                            if next_nextc != '{' and next_nextc != '"' and next_nextc != '[':
                                # If it is not the beginning of the data, the description "is in this string
                                flag = True
                                right += 1
                                continue

                        # Description is the right ending
                        if flag:
                            # Intercept the middle part of the quotation mark pair
                            temp = string[left + 1: right]
                            if result is None:
                                # Use result to store the result after string replacement, if you change the string directly
                                # Will lead to a series of changes
                                result = string.replace(temp, temp.replace('"', '\\"'))
                            else:
                                result = result.replace(temp, temp.replace('"', '\\"'))
                        break
                    else:
                        # Find the string that caused the Json parsing error. Now just move to the end and intercept the string
                        flag = True
                right += 1
        right += 1
    return result

Algorithmic thought

It mainly relies on double pointers to determine each pair of double quotation marks ("). When the right pointer is found, judge whether it is the correct end. If not, set the flag to True. When the right pointer points to the real end, intercept the middle part of the quotation mark pair and replace it.

I just stare at the form of the target Json and summarize the end. By the way, the Json I see in the developer mode has escape symbols. But after I catch it with the program, python seems to recognize it as escape characters. It's hard...

76 original articles published, 27 praised, 30000 visitors+
Private letter follow

Posted by fredriksk on Sun, 15 Mar 2020 05:11:14 -0700