1. Using beautiful soup
First, import the requests module and the beautiful soup module. Click on the lower left corner. Open the interpreter settings and search for the module you want to import.
[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-foncqwwb-1638094731360) (C: \ users \ Chao \ appdata \ roaming \ typora \ user images \ image-20211128160833575. PNG)]
Then, take climbing to the top 500 of the cool dog hot list as an example. First, open the cool dog official website and find the top 500 hot list. Right click detection or F12 to view the source code. By comparing the two sides, we can see that the information we want is in class ='pc '_ temp_ In the element of 'songlist', there is an a tag under the li tag, which can locate the information of the song and the author.
[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-xceox2qw-1638094731368) (C: \ users \ Chao \ appdata \ roaming \ typora user images \ image-20211128161324715. PNG)]
find() and find_all()
find() locates all the contents of the first matching tag and returns a target object. find_all() locates all tags that match, and it returns a list of target objects.
The target object can get the things inside. For example, we have located the a tag. If we want to get the songs and authors inside.
target.get('title') #Get the content after the title
target.text #Get flowers all the way target.span.text #Get Wen Yixin
[external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-iudtr1os-1638094731370) (C: \ users \ Chao \ appdata \ roaming \ typora \ typora user images \ image-20211128162155403. PNG [external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-ONvSam1B-1638094731684)(C:\Users\chao\AppData\Roaming\Typora\typora-user-images\image-20211128162401888.png)]]
Finally, the overall code is as follows
import requests from bs4 import BeautifulSoup import time level=1 with open('Cool dog.txt', 'w+', encoding='utf-8') as f: #First 24 pages for i in range(1, 24): url = 'https://www.kugou.com/yy/rank/home/' + str(i) + '-8888.html?from=rank' #Set heders, a simple fake browser headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64'} #Get web page response response res = requests.get(url=url, headers=headers) #Parsing using BeautifulSoup soup = BeautifulSoup(res.text, 'lxml') songlist = soup.find('div', class_='pc_temp_songlist').find_all('a', class_='pc_temp_songname') for each in songlist: #split() is used to split strings, and strip() is used to remove spaces at the beginning and end name = each.text.split()[0].strip() alther = each.span.text.split()[1].strip() f.write('The first'+str(level)+'Name: '+'name:' + name+' Singer:' + alther + '\n')#Format level can also be 'the {} first name'. format(level) ' level=level+1 print('Crawling page'+str(i)+'...') time.sleep(1) print('Crawling completed')
2. Regular expression
The first thing to know
character | meaning |
---|---|
. | Matches any single character (excluding line breaks) \ n |
** | Escape character (converting a character with a special meaning to a literal meaning) |
[...] | Character set. Corresponds to any character in the character set |
Predefined character set | meaning |
---|---|
\d | Matches a numeric character. Equivalent to [0-9]. |
\D | Matches a non numeric character. Equivalent to [^ 0-9]. |
\s | Matches any white space characters, including spaces, tabs, page breaks, etc. equivalent to [\ f\n\r\t\v]. |
\S | Matches any non whitespace character. Equivalent to [^ \ f\n\r\t\v]. |
\w | Matches any word character including underscores. Equivalent to '[A-Za-z0-9#]'. |
\W | Matches any non word character. Equivalent to '[^ A-Za-z0-9_]'. |
Quantifier | meaning |
---|---|
***** | Matches the previous character 0 or infinite times |
+ | Matches the previous character 1 or infinite times |
? | Matches the previous character 0 or 1 times |
{m} | Match the previous character m times |
{m,n} | Match the previous character m to n times |
Boundary matching | meaning |
---|---|
^ | Matches the beginning of a string |
$ | Match end of string |
\A | Match only the beginning of a string |
\Z | Match only the end of the string |
But the most important thing is that there are two
#. *? And* #. * is to match as many characters as possible #*? Yes, as few matches as possible import re s='xxIxxhelloxxLovexx1345hxxPythonxxch' infos=re.findall('xx(.*?)xx',s) print('.*?:',end='') print(infos) infos1=re.findall('xx(.*)xx',s) print('.*:',end='') print(infos1)
[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-f5yvvjk7-1638094731371) (C: \ users \ Chao \ appdata \ roaming \ typora user images \ image-2021112817317552. PNG)]
Or climb the top 500, the overall code
import re import requests import time level=1 with open('kugou.txt', 'w+', encoding='utf-8') as f: for i in range(1, 24): url = 'https://www.kugou.com/yy/rank/home/' + str(i) + '-8888.html?from=rank' headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64'} res = requests.get(url=url, headers=headers) page_content=res.text #Preload the regular expression and name the matching content name obj=re.compile(r'<li class=" " title="(?P<name>.*?)" data-index="',re.S) #Match the regular expression and put the result in the iterator result=obj.finditer(page_content) for it in result: #Get it through the group() function author=it.group("name").split('-',1)[0].strip() title=it.group("name").split('-',1)[1].strip() f.write('The first'+str(level)+'Name: '+'name:' + title+' Singer:' + author + '\n') level = level + 1 print('Crawling page' + str(i) + '...') time.sleep(1) print('Crawling completed')
Xpath crawl
Here is a simple climb from the first page
import requests from lxml import etree url = 'https://www.kugou.com/yy/rank/home/1-8888.html?from=rank' headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64'} res = requests.get(url=url, headers=headers) tree=etree.fromstring(res.text, parser=etree.HTMLParser()) # tree=etree.parse(res.text) content=tree.xpath('//*[@id="rankWrap"]/div[2]/ul/li/@title') for each in content: print(each)