1. First install builtwith through pip install builtwith
- C:\Users\Administrator>pip install builtwith
- Collecting builtwith
- Downloading builtwith-1.3.2.tar.gz
- Installing collected packages: builtwith
- Running setup.py install for builtwith ... done
- Successfully installed builtwith-1.3.2
C:\Users\Administrator>pip install builtwith Collecting builtwith Downloading builtwith-1.3.2.tar.gz Installing collected packages: builtwith Running setup.py install for builtwith ... done Successfully installed builtwith-1.3.2
2. Create a new project in pycharm and enter the following test code
import builtwith tech_used = builtwith.parse('http://www.baidu.com') print(tech_used)
Running will result in the following errors:
- C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
- Traceback (most recent call last):
- File "F:/python/first/FirstPy", line 1, in <module>
- import builtwith
- File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 43
- except Exception, e:
- ^
- SyntaxError: invalid syntax
- Process finished with exit code 1
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy Traceback (most recent call last): File "F:/python/first/FirstPy", line 1, in <module> import builtwith File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 43 except Exception, e: ^ SyntaxError: invalid syntax Process finished with exit code 1
The reason is that builtwith is based on version 2.x and needs to be modified in several places. In pycharm error information, double-click the error file to modify, mainly modifying the following three:
1. The writing of "Exception, e" in Python 2 is no longer supported and needs to be changed to "Exception as e".
2. Expressions after print in Python 2 need to be enclosed in parentheses in Python 3.
3. The urllib2 toolkit in Python 2 is used in builtwith. This toolkit does not exist in Python 3. The urllib2-related code needs to be modified.
1 and 2 are easy to modify. The following is mainly for point 3:
First, replace import urllib2 with the following code:
Then the relevant methods of urllib2 are replaced as follows:import urllib.request import urllib.error
- request = urllib.request.Request(url, None, {'User-Agent': user_agent})
- response = urllib.request.urlopen(request)
request = urllib.request.Request(url, None, {'User-Agent': user_agent}) response = urllib.request.urlopen(request)
Running the project again encountered the following errors:
- C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
- Traceback (most recent call last):
- File "F:/python/first/FirstPy", line 3, in <module>
- builtwith.parse('http://www.baidu.com')
- File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 62, in builtwith
- if contains(html, snippet):
- File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 105, in contains
- return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)
- TypeError: cannot use a string pattern on a bytes-like object
- Process finished with exit code 1
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy Traceback (most recent call last): File "F:/python/first/FirstPy", line 3, in <module> builtwith.parse('http://www.baidu.com') File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 62, in builtwith if contains(html, snippet): File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 105, in contains return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v) TypeError: cannot use a string pattern on a bytes-like object Process finished with exit code 1
This is because the data format returned by urllib has changed and needs to be transcoded.
Modified toif html is None: html = response.read()
if html is None: html = response.read() html = html.decode('utf-8')
The final results are as follows:
- C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
- {'javascript-frameworks': ['jQuery']}
- Process finished with exit code 0
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy {'javascript-frameworks': ['jQuery']} Process finished with exit code 0
But if the website is replaced by'www.163.com', the error will be reported again as follows:
- C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
- Error: 'utf-8' codec can't decode byte 0xcd in position 500: invalid continuation byte
- Traceback (most recent call last):
- File "F:/python/first/FirstPy", line 2, in <module>
- tech_used = builtwith.parse('http://www.163.com')
- File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 63, in builtwith
- if contains(html, snippet):
- File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 106, in contains
- return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)
- TypeError: cannot use a string pattern on a bytes-like object
- Process finished with exit code 1
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy Error: 'utf-8' codec can't decode byte 0xcd in position 500: invalid continuation byte Traceback (most recent call last): File "F:/python/first/FirstPy", line 2, in <module> tech_used = builtwith.parse('http://www.163.com') File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 63, in builtwith if contains(html, snippet): File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 106, in contains return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v) TypeError: cannot use a string pattern on a bytes-like object Process finished with exit code 1
It seems that it is still a problem of encoding. Setting the encoding to `GBK', the operation is successful as follows:
- C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
- {'web-servers': ['Nginx']}
- Process finished with exit code 0
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy {'web-servers': ['Nginx']} Process finished with exit code 0
So do different websites need different decoding methods? Here is a method to distinguish the coding format of a website.
We need to install a toolkit called chardet as follows:
- C:\Users\Administrator>pip install chardet
- Collecting chardet
- Downloading chardet-2.3.0-py2.py3-none-any.whl (180kB)
- 100% |████████████████████████████████| 184kB 616kB/s
- Installing collected packages: chardet
- Successfully installed chardet-2.3.0
- C:\Users\Administrator>
C:\Users\Administrator>pip install chardet Collecting chardet Downloading chardet-2.3.0-py2.py3-none-any.whl (180kB) 100% |████████████████████████████████| 184kB 616kB/s Installing collected packages: chardet Successfully installed chardet-2.3.0 C:\Users\Administrator>
After passing byte data into chardet's detect ion method, you will get a Dict with two values, one is the confidence value and the other is the encoding method.
{'encoding': 'utf-8', 'confidence': 0.99}
Modify the builtwith code as follows:
- encode_type = chardet.detect(html)
- if encode_type['encoding'] == 'utf-8':
- html = html.decode('utf-8')
- else:
- html = html.decode('gbk')
encode_type = chardet.detect(html) if encode_type['encoding'] == 'utf-8': html = html.decode('utf-8') else: html = html.decode('gbk')
Remember import chardet!!!!
When you add chardet to determine how characters are encoded, you can adapt to your website.~~~~