Weather forecast website of python crawler -- check weather information of the latest (15 days) (regular expression)

Keywords: Python Windows

Weather forecast website of python crawler -- check weather information of the latest (15 days) (regular expression)

Train of thought:

1. First find a website that you want to view the weather forecast, select the place you want to view, and view the weather (for example: http://www.tianqi.com/xixian1/15/)

2. Open "web source code" and analyze the characteristics of the data you want to obtain

3. Use regular expression to process the data, and get the data you want. The website may be anti crawler, which needs to be bypassed. Here, use the browser agent (python's default user agent is itself, which needs to be changed to the browser's user agent, so you can bypass some simple anti crawler websites)

4. After obtaining the data, simply beautify the data

5. Write data to file (using pickle module)

 

2. Open "webpage source code" and analyze the characteristics of the data you want to obtain (different websites have different data, so specific problems need to be analyzed)

3.1 the picture of forbidden crawler by the website is as follows:

3.2 use regular expression to process data and get the data you want

The code is as follows:

Check weather forecast
import re 
import requests
from prettytable import PrettyTable
url="http://www.tianqi.com/xixian1/15/"
#Bypass site anti crawler
txt=requests.get(url,headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36","Host":"www.tianqi.com"}).text
#print(txt)
s1=re.compile(r'<b>(\d\d month\d\d day)</b>')   #date
print(s1.findall(txt))
s2=re.compile(r'<li class="temp">(.+) (-?\d+)(\W+)<b>(-?\d+)</b>℃</li>')
print(s2.findall(txt)) 
s3=re.compile('>(.{1,4})</b></li>')
print(s3.findall(txt))
s4=re.compile(r'<li>([\u4e00-\u9fa5].+)</li>')
print(s4.findall(txt))
tianqi=[]
for i in range(len(s1.findall(txt))):
    tianqi.append([s1.findall(txt)[i],s2.findall(txt)[i][0],s2.findall(txt)[i][1]+s2.findall(txt)[i][2]+s2.findall(txt)[i][3],s3.findall(txt)[i],s4.findall(txt)[i]])

print(tianqi)
ptable=PrettyTable('Date weather temperature(℃) Air quality grade'.split())
for i in tianqi:
    ptable.add_row(i)
print(ptable)

The operation effect is as follows:

5. Write data to file (pickle)

The code is as follows:

import re 
import requests
import pickle
from prettytable import PrettyTable
url="http://www.tianqi.com/xixian1/15/"
#Bypass site anti crawler
#Write content to file(serialize)
try:
    with open("tianqi.txt","rb") as f:
        txt=pickle.load(f)
        print("Results loaded")
except:
    txt=requests.get(url,headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36","Host":"www.tianqi.com"}).text
    with open("tianqi.txt","wb") as f:
        pickle.dump(txt,f)
        print("File written!")
#print(txt)
s1=re.compile(r'<b>(\d\d month\d\d day)</b>')   #date
print(s1.findall(txt))
s2=re.compile(r'<li class="temp">(.+) (-?\d+)(\W+)<b>(-?\d+)</b>℃</li>')
print(s2.findall(txt)) 
s3=re.compile('>(.{1,4})</b></li>')
print(s3.findall(txt))
s4=re.compile(r'<li>([\u4e00-\u9fa5].+)</li>')
print(s4.findall(txt))
tianqi=[]
for i in range(len(s1.findall(txt))):
    tianqi.append([s1.findall(txt)[i],s2.findall(txt)[i][0],s2.findall(txt)[i][1]+s2.findall(txt)[i][2]+s2.findall(txt)[i][3],s3.findall(txt)[i],s4.findall(txt)[i]])

print(tianqi)
ptable=PrettyTable('Date weather temperature(℃) Air quality grade'.split())
for i in tianqi:
    ptable.add_row(i)
print(ptable)

Posted by sambo80 on Mon, 09 Dec 2019 10:57:09 -0800