requests
The urllib module in Python's standard library already contains most of the functions we normally use, but its API is not very good to use. Requests inherits all the features of urllib, and API is more convenient to use, which can simplify our code.
First install
pip3 install requests
It has two requests: get and post.
get sample code
url = "https://xueqiu.com/v4/statuses/public_timeline_by_category.json?" #Request header headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36', } response = requests.get(url=url,headers=headers)
post sample code
form_data = { 'username':'17611317980', 'password':'123456', } url = 'http://127.0.0.1:8000/api/login/' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36' } response = requests.post(url=url,data=form_data,headers=headers)
Response result processing
#Get response results response.text #Page source code response.status_code #Status code response.headers #Response Head response.request.headers)#Get the request header response.content) #Get the binary data of the page response.encoding = 'utf-8' #You can set the encoding type response.encoding #Get the current encoding #If you get a json string, you can call the json method to convert the json string to a python data type response.json()
File upload
url = 'https://httpbin.org/post' #Read local files files = {'file':open('page.html','r',encoding='gbk')} response = requests.post(url=url,files=files,headers=headers) if response.status_code == 200: print('Successful file upload') print(response.text)
Setting up agents
# url = 'https://www.baidu.com/' url = 'https://httpbin.org/get' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36' } #proxy can correspond to a dictionary, list, or tuple. proxy = { #Setting up Network Agent 'https': '60.190.250.120:8080', 'http': '121.61.3.209:9999', } response = requests.get(url=url,headers=headers,proxies=proxy) if response.status_code == 200: print('Successful request') print(response.text)
Session session retention
#session: In requests, we often need to keep in touch with the requests. #At this point, we need to use session. import requests #Instantiate session() object session = requests.session() headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36' } response = session.get('http://www.baidu.com/',headers=headers) print(response.headers) print(session.cookies) #session.cookies saves the cookies information returned by the server #When session.cookies have user information, we use session.get() or session.post(). #When a request is initiated, cookies and other information are automatically brought along. response = session.get('http://www.baidu.com/',headers=headers) # Normally, we will use request.session to simulate login and complete login first. # Then use session to initiate the request
Handling HTTPS requests (ignoring) SSL certificate validation
#If a certificate authentication ssl ca certificate error occurs #Very: Defaults to `True', default to true, indicating certificate authentication #If there is a certificate authentication ssl ca certificate error, modify verify to False, indicating that certificate authentication is ignored response = requests.get(url=url,headers=headers,verify=False)
Request related parameters
# Request parameter analysis param method: Set request mode get, post, delete param url: target URL param params: After get requests url address? Parameters of back stitching Param data: (optional) Dictionary, parameter of post request. param headers: (optional) Dictionary sets the request header. param cookies: (optional) Dict or CookieJar object sets the user's cookies information. param files: (optional) Dictionary file upload (post). Param auth: Auth authentication. param timeout: Sets the timeout time for requests param allow_redirects: Sets whether redirection is allowed by default param proxies: (optional) Dictionary setup agent. :param verify: (optional) Either a boolean, Defaults to ``True``. # Ignore certificate authentication and set to False
xpath parsing
Know what xpath is?
XPath is the XML Path Language. It is a language used to find information in XML documents. It can be used to search for elements and attributes in XML documents. It also applies to HTML.
As mentioned above, what's the ghost of xml?
So look here: * XML is a markup language, very similar to HTML * XML is designed to transfer data, not display it. * XML tags need to be defined by ourselves.
Okay, let's look at some common expression paths for returning to xpath after understanding xml
Expression | describe |
---|---|
nodename | Select all child nodes of this node |
/ | Selection from the root node |
// | Select the nodes in the document from the current node that matches the selection, regardless of their location |
. | Select the current node |
... | Select the parent of the current node |
@ | Select attributes |
Don't be anxious to see how to use it.
from lxml import etree html = etree.HTML(html) #Construct an Xpath parsing object and automatically correct HTML text text() Get label text @Attribute Name Gets Tag Attribute Value //Attribute multi-value matching function contains(@class,'li') ranks = etree_xpath.xpath('//div[@class="scores_List"]/dl') # Grouped by div for dl in ranks:#Traversal grouping school_info = {} school_info['url'] = dl.xpath('./dt/a[1]/@href') school_info['icon'] =dl.xpath('./dt/a[1]/img/@src') school_info['name'] = dl.xpath('./dt/strong/a/text()')