python3 crawls Migu music list information (with source code)

Keywords: Python Windows

Refer to the previous article on the idea of short rent of reptile piglets https://www.cnblogs.com/aby321/p/9946831.html, and continue to be familiar with the basic reptile methods. This time, I crawled the ranking of Migu music

Migu music list home page: http://music.migu.cn/v2/music/billboard/? "From = Migu & page = 1

Note: sometimes, this program will report an error when it is running again. The reason for the error is unknown!

 

Different from pig short rent, the ranking information crawled is not in the detailed page of each song. It needs to be obtained in the page url (lines 19-25 of the code). Use the packing cycle and output it to the function get_info()

 

 

 1 """
 2 Typical paging website - Migu Music List
 3 Sometimes the operation will report an error, sometimes it is normal, and the reason is unknown
 4 """
 5 import requests
 6 from bs4 import BeautifulSoup as bs
 7 import time
 8 
 9 headers = {
10     'User-Agent':'User-Agent:Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
11 }
12 
13 #Get the URL of each song,Parameter is paging url
14 def get_link(url):
15     html_data = requests.get(url, headers = headers)
16     soup = bs(html_data.text, 'lxml')#bs4 Recommended resolution Library
17     #print(soup.prettify())   #The source code in the standardized output url (may be inconsistent with that in web page viewing, and the label writing may be irregular in web page) is grabbed based on this. If the grabbing fails, use this command to view the source code
18     links = soup.select('#js_songlist > div > div.song-name > span > a')#Pay attention to the circulation point!!!
19     ranks = soup.select('#js_songlist > div > div.song-number ')#Because there is no ranking information in the song details, you need to get the details in this section
20     #print(ranks)
21     for rank, link in zip(ranks,links):#Packaging cycle, mainly for output matching rank and link
22         rank = rank.get_text()
23         link = 'http://music.migu.cn' + link.get('href')#Observe the detailed webpage of each song and find that the previous part needs to be added manually http://music.migu.cn
24         #print(rank,link)
25         get_info(rank,link)
26 
27 #Get the details of each song, ranking, song name, singer and album name, parameters url Is the URL of each song
28 def get_info(rank,url):
29     html_data = requests.get(url, headers = headers)
30     soup = bs(html_data.text, 'lxml')#bs4 Recommended resolution Library
31     # print(soup.prettify())   #The source code in the standardized output url (may be inconsistent with that in web page viewing, and the label writing may be irregular in web page) is grabbed based on this. If the grabbing fails, use this command to view the source code
32     title = soup.select('div.container.pt50 > div.song-data > div.data-cont > div.song-name > span.song-name-text')[0].string.strip()
33 
34     # Web page copy It's all here“ body > div.wrap.clearfix.con_bg > div.con_l > div.pho_info > h4 > em",But I can't use this to crawl out the data (I don't know why),hold body Remove or use the shortest way below (use only the most recent and unique div)
35     # title = soup.select('div.pho_info > h4 > em ')
36     # Query results title The format is a one-dimensional list. You need to continue to extract list elements (generally[0]),The list element has labels before and after it. You need to continue to extract the contents of the labels get_text()perhaps string
37     singer = soup.select('div.container.pt50 > div.song-data > div.data-cont > div.song-statistic > span > a')[0].string.strip()
38     cd = soup.select('div.container.pt50 > div.song-data > div.data-cont > div.style-like > div > span > a')[0].string.strip()  # Get the property value of the label
39 
40     #Organize detailed data into dictionary format
41     data = {
42         'ranking':rank,
43         'Song name':title,
44         'singer':singer,
45         'Album':cd
46     }
47     print(data)
48 
49 
50 #Program main entrance
51 if __name__=='__main__':
52     for number in range(1,3):
53         url = 'http://music.migu.cn/v2/music/billboard/?_from=migu&page={}'.format(number)   #Structural paging url(It's not about song details url)
54         get_link(url)
55         time.sleep(1)

Output result: the order of each output data (Dictionary type) field is random, because the dictionary type data has no order. If you want to fix the order, please use the list

Take one example: This crawler template can be used for the same type of paged websites, such as top 100 of Douban movie, top list of timenet, etc

 

ps: I don't know if this list is accurate. Anyway, I haven't heard of it

Posted by toyfruit on Sun, 08 Dec 2019 09:42:50 -0800