Python climbed to the top 500 of the cool dog hot list

Keywords: Python crawler

1. Using beautiful soup

First, import the requests module and the beautiful soup module. Click on the lower left corner. Open the interpreter settings and search for the module you want to import.

[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-foncqwwb-1638094731360) (C: \ users \ Chao \ appdata \ roaming \ typora \ user images \ image-20211128160833575. PNG)]

Then, take climbing to the top 500 of the cool dog hot list as an example. First, open the cool dog official website and find the top 500 hot list. Right click detection or F12 to view the source code. By comparing the two sides, we can see that the information we want is in class ='pc '_ temp_ In the element of 'songlist', there is an a tag under the li tag, which can locate the information of the song and the author.

[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-xceox2qw-1638094731368) (C: \ users \ Chao \ appdata \ roaming \ typora user images \ image-20211128161324715. PNG)]

find() and find_all()

find() locates all the contents of the first matching tag and returns a target object. find_all() locates all tags that match, and it returns a list of target objects.

The target object can get the things inside. For example, we have located the a tag. If we want to get the songs and authors inside.

target.get('title') #Get the content after the title
target.text #Get flowers all the way
target.span.text #Get Wen Yixin

[external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-iudtr1os-1638094731370) (C: \ users \ Chao \ appdata \ roaming \ typora \ typora user images \ image-20211128162155403. PNG [external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-ONvSam1B-1638094731684)(C:\Users\chao\AppData\Roaming\Typora\typora-user-images\image-20211128162401888.png)]]

Finally, the overall code is as follows

import requests
from bs4 import BeautifulSoup
import time
level=1
with open('Cool dog.txt', 'w+', encoding='utf-8') as f:
    #First 24 pages
    for i in range(1, 24):
        url = 'https://www.kugou.com/yy/rank/home/' + str(i) + '-8888.html?from=rank'
        #Set heders, a simple fake browser
        headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64'}
        #Get web page response response
        res = requests.get(url=url, headers=headers)
        #Parsing using BeautifulSoup
        soup = BeautifulSoup(res.text, 'lxml')
        songlist = soup.find('div', class_='pc_temp_songlist').find_all('a', class_='pc_temp_songname')
        for each in songlist:
            #split() is used to split strings, and strip() is used to remove spaces at the beginning and end
            name = each.text.split()[0].strip()
            alther = each.span.text.split()[1].strip()
            f.write('The first'+str(level)+'Name: '+'name:' + name+'   Singer:' + alther + '\n')#Format level can also be 'the {} first name'. format(level) '
            level=level+1
        print('Crawling page'+str(i)+'...')
        time.sleep(1)
    print('Crawling completed')

2. Regular expression

The first thing to know

charactermeaning
.Matches any single character (excluding line breaks) \ n
**Escape character (converting a character with a special meaning to a literal meaning)
[...]Character set. Corresponds to any character in the character set
Predefined character setmeaning
\dMatches a numeric character. Equivalent to [0-9].
\DMatches a non numeric character. Equivalent to [^ 0-9].
\sMatches any white space characters, including spaces, tabs, page breaks, etc. equivalent to [\ f\n\r\t\v].
\SMatches any non whitespace character. Equivalent to [^ \ f\n\r\t\v].
\wMatches any word character including underscores. Equivalent to '[A-Za-z0-9#]'.
\WMatches any non word character. Equivalent to '[^ A-Za-z0-9_]'.
Quantifiermeaning
*****Matches the previous character 0 or infinite times
+Matches the previous character 1 or infinite times
Matches the previous character 0 or 1 times
{m}Match the previous character m times
{m,n}Match the previous character m to n times
Boundary matchingmeaning
^Matches the beginning of a string
$Match end of string
\AMatch only the beginning of a string
\ZMatch only the end of the string

But the most important thing is that there are two

#. *? And*
#. * is to match as many characters as possible
#*? Yes, as few matches as possible
import re

s='xxIxxhelloxxLovexx1345hxxPythonxxch'
infos=re.findall('xx(.*?)xx',s)
print('.*?:',end='')
print(infos)
infos1=re.findall('xx(.*)xx',s)
print('.*:',end='')
print(infos1)

[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-f5yvvjk7-1638094731371) (C: \ users \ Chao \ appdata \ roaming \ typora user images \ image-2021112817317552. PNG)]

Or climb the top 500, the overall code

import re

import requests

import time

level=1
with open('kugou.txt', 'w+', encoding='utf-8') as f:
    for i in range(1, 24):
        url = 'https://www.kugou.com/yy/rank/home/' + str(i) + '-8888.html?from=rank'
        headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64'}
        res = requests.get(url=url, headers=headers)
        page_content=res.text
        #Preload the regular expression and name the matching content name
        obj=re.compile(r'<li class=" " title="(?P<name>.*?)" data-index="',re.S)
        #Match the regular expression and put the result in the iterator
        result=obj.finditer(page_content)
        for it in result:
            #Get it through the group() function
            author=it.group("name").split('-',1)[0].strip()
            title=it.group("name").split('-',1)[1].strip()
            f.write('The first'+str(level)+'Name: '+'name:' + title+'   Singer:' + author + '\n')
            level = level + 1
        print('Crawling page' + str(i) + '...')
        time.sleep(1)
    print('Crawling completed')

Xpath crawl

Here is a simple climb from the first page

import requests
from lxml import etree

url = 'https://www.kugou.com/yy/rank/home/1-8888.html?from=rank'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64'}
res = requests.get(url=url, headers=headers)
tree=etree.fromstring(res.text, parser=etree.HTMLParser())
# tree=etree.parse(res.text)
content=tree.xpath('//*[@id="rankWrap"]/div[2]/ul/li/@title')
for each in content:
    print(each)

Posted by TheChief on Sun, 28 Nov 2021 08:36:38 -0800