The builtwith approach in Python 3 (very detailed)

Keywords: Python encoding pip Pycharm

1. First install builtwith through pip install builtwith

  1. C:\Users\Administrator>pip install builtwith  
  2. Collecting builtwith  
  3.   Downloading builtwith-1.3.2.tar.gz  
  4. Installing collected packages: builtwith  
  5.   Running setup.py install for builtwith ... done  
  6. Successfully installed builtwith-1.3.2  
C:\Users\Administrator>pip install builtwith
Collecting builtwith
  Downloading builtwith-1.3.2.tar.gz
Installing collected packages: builtwith
  Running setup.py install for builtwith ... done
Successfully installed builtwith-1.3.2

2. Create a new project in pycharm and enter the following test code
  1. import builtwith  
  2. tech_used = builtwith.parse('http://www.baidu.com')  
  3. print(tech_used)  
import builtwith
tech_used = builtwith.parse('http://www.baidu.com')
print(tech_used)

Running will result in the following errors:
  1. C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy  
  2. Traceback (most recent call last):  
  3.   File "F:/python/first/FirstPy", line 1, in <module>  
  4.     import builtwith  
  5.   File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 43  
  6.     except Exception, e:  
  7.                     ^  
  8. SyntaxError: invalid syntax  
  9.   
  10.   
  11. Process finished with exit code 1  
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
Traceback (most recent call last):
  File "F:/python/first/FirstPy", line 1, in <module>
    import builtwith
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 43
    except Exception, e:
                    ^
SyntaxError: invalid syntax


Process finished with exit code 1

The reason is that builtwith is based on version 2.x and needs to be modified in several places. In pycharm error information, double-click the error file to modify, mainly modifying the following three:
1. The writing of "Exception, e" in Python 2 is no longer supported and needs to be changed to "Exception as e".
2. Expressions after print in Python 2 need to be enclosed in parentheses in Python 3.
3. The urllib2 toolkit in Python 2 is used in builtwith. This toolkit does not exist in Python 3. The urllib2-related code needs to be modified.
1 and 2 are easy to modify. The following is mainly for point 3:
First, replace import urllib2 with the following code:
  1. import urllib.request  
  2. import urllib.error  
import urllib.request
import urllib.error
Then the relevant methods of urllib2 are replaced as follows:
  1. request = urllib.request.Request(url, None, {'User-Agent': user_agent})  
  2. response = urllib.request.urlopen(request)  
request = urllib.request.Request(url, None, {'User-Agent': user_agent})
response = urllib.request.urlopen(request)

Running the project again encountered the following errors:

  1. C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy  
  2. Traceback (most recent call last):  
  3.   File "F:/python/first/FirstPy", line 3, in <module>  
  4.     builtwith.parse('http://www.baidu.com')  
  5.   File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 62, in builtwith  
  6.     if contains(html, snippet):  
  7.   File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 105, in contains  
  8.     return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)  
  9. TypeError: cannot use a string pattern on a bytes-like object  
  10.   
  11.   
  12. Process finished with exit code 1  
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
Traceback (most recent call last):
  File "F:/python/first/FirstPy", line 3, in <module>
    builtwith.parse('http://www.baidu.com')
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 62, in builtwith
    if contains(html, snippet):
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 105, in contains
    return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)
TypeError: cannot use a string pattern on a bytes-like object


Process finished with exit code 1

  
This is because the data format returned by urllib has changed and needs to be transcoded.
  1. if html is None:  
  2.     html = response.read()  
if html is None:
    html = response.read()
Modified to
  1. if html is None:  
  2.      html = response.read()  
  3.      html = html.decode('utf-8')  
if html is None:
     html = response.read()
     html = html.decode('utf-8')

The final results are as follows:
  1. C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy  
  2. {'javascript-frameworks': ['jQuery']}  
  3.   
  4.   
  5. Process finished with exit code 0  
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
{'javascript-frameworks': ['jQuery']}


Process finished with exit code 0

But if the website is replaced by'www.163.com', the error will be reported again as follows:
  1. C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy  
  2. Error: 'utf-8' codec can't decode byte 0xcd in position 500: invalid continuation byte  
  3. Traceback (most recent call last):  
  4.   File "F:/python/first/FirstPy", line 2, in <module>  
  5.     tech_used = builtwith.parse('http://www.163.com')  
  6.   File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 63, in builtwith  
  7.     if contains(html, snippet):  
  8.   File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 106, in contains  
  9.     return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)  
  10. TypeError: cannot use a string pattern on a bytes-like object  
  11.   
  12.   
  13.   
  14. Process finished with exit code 1  
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
Error: 'utf-8' codec can't decode byte 0xcd in position 500: invalid continuation byte
Traceback (most recent call last):
  File "F:/python/first/FirstPy", line 2, in <module>
    tech_used = builtwith.parse('http://www.163.com')
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 63, in builtwith
    if contains(html, snippet):
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 106, in contains
    return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)
TypeError: cannot use a string pattern on a bytes-like object



Process finished with exit code 1

It seems that it is still a problem of encoding. Setting the encoding to `GBK', the operation is successful as follows:
  1. C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy  
  2. {'web-servers': ['Nginx']}  
  3.   
  4.   
  5. Process finished with exit code 0  
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
{'web-servers': ['Nginx']}


Process finished with exit code 0

So do different websites need different decoding methods? Here is a method to distinguish the coding format of a website.
We need to install a toolkit called chardet as follows:
  1. C:\Users\Administrator>pip install chardet  
  2. Collecting chardet  
  3.   Downloading chardet-2.3.0-py2.py3-none-any.whl (180kB)  
  4.     100% |████████████████████████████████| 184kB 616kB/s  
  5. Installing collected packages: chardet  
  6. Successfully installed chardet-2.3.0  
  7.   
  8.   
  9. C:\Users\Administrator>  
C:\Users\Administrator>pip install chardet
Collecting chardet
  Downloading chardet-2.3.0-py2.py3-none-any.whl (180kB)
    100% |████████████████████████████████| 184kB 616kB/s
Installing collected packages: chardet
Successfully installed chardet-2.3.0


C:\Users\Administrator>

After passing byte data into chardet's detect ion method, you will get a Dict with two values, one is the confidence value and the other is the encoding method.
  1. {'encoding': 'utf-8', 'confidence': 0.99}  
{'encoding': 'utf-8', 'confidence': 0.99}

Modify the builtwith code as follows:
  1. encode_type = chardet.detect(html)  
  2.   if encode_type['encoding'] == 'utf-8':  
  3.     html = html.decode('utf-8')  
  4.   else:  
  5.     html = html.decode('gbk')  
encode_type = chardet.detect(html)
  if encode_type['encoding'] == 'utf-8':
    html = html.decode('utf-8')
  else:
    html = html.decode('gbk')

Remember import chardet!!!!
When you add chardet to determine how characters are encoded, you can adapt to your website.~~~~

Posted by turing_machine on Wed, 17 Apr 2019 00:03:33 -0700