Netease Music Spider
This is just an introductory article. Please move on. Netease Music Spider for DB
Reptiles are a problem that we wanted to study long ago.
But because of laziness and laziness
Recently, some novices who write crawlers often come to my website to practice.
Looking at the log shows that it's hard, so I decided to study how to write the crawler myself (I won't say it's homework).
This paper crawls and analyses the data of 3024 511 songs from nearly 2 W popular song lists.
Why Choose Climbing Yiyun Music
Because it's really simple.
Especially suitable for beginners
No defence at all
All-to-all open interfaces
Of course, after further research, we found that we have done some work on privacy control.
What to climb
First, let's define our goals.
What data do you want to climb?
- Climb the list of songs in the list that you hear the most?
- This may be a reference for listening to songs.
- What kind of singles are most popular?
- Find popular interest
- maybe has other holes or something
In a word, it is meaningful to tell ourselves what we do before we do something.
How to get data
Think about the process we use to listen to songs.
- A song shared by a friend
- You know the name of the song and you want to search for it.
- You hear a song on a song list
Each scenario corresponds to a series of API s
Clearing up business scenarios is critical for writing Crawlers
Let's first analyze the song list - the song scene.
Sharing a song list is equivalent to sharing a song list id - playListId
Netease Cloud does a better job here than providing an API interface directly.
Instead, it encapsulates information into html [https:// music.163.com / discover / playlist?order = hot & cat = %E5 % 8D % 8E % E8 % AF % AD & limit = 35 & offset = 1](https: // music.163.com / discover / playlist? Order = hot & cat =% E5% - 8D% - 8E% - E8% - AF% - AD & limit = 35 & offset = 1) to increase the threshold for writing Crawlers
Finding the'data-res-id'parameter in'div' is the playListId required
When playListId is found, call [http:// music.163.com / api / playlist / detail?id = 1](http: // music.163.com / api / playlist / detail?id=1) to get the required song information
Because what is returned is what JSON actually does, which is JSON parsing.
Multithreading
python is really slow
As a result, a request was sent in 2s at the beginning of the test. lz thought it would be the year of the monkey and the month of the horse after climbing for 2w.
Helpless to write a multi-threaded
for id in self.urlslist: work = threading.Thread(target=self.get_detail, args=(id,)) threadings.append(work) for work in threadings: work.start() for work in threadings: work.join()
Code
# -*- coding: utf-8 -*- # @Author: gunjianpan # @Date: 2018-10-12 20:00:17 # @Last Modified by: gunjianpan # @Last Modified time: 2018-10-14 21:53:46 # coding:utf-8 import requests from bs4 import BeautifulSoup import sqlite3 import threading import json import urllib.parse import time class Get_list(): def __init__(self): self.urlslist = ["whole", "Chinese", "Europe and America", "Japanese", "Korean", "Cantonese", "Small language", "Popular", "Rock", "Ballad", "Electronics", "Dance music", "Rap", "Light music", "Sir", "Rural", "R&B/Soul", "classical", "Nation", "England", "Metal", "Punk", "blues", "Reggae", "World Music", "Latin", "Alternative/Independent", "New Age", "Antiquity", "Rear swing", "Bossa Nova", "Early morning", "night", "Study", "work", "Noon break", "Afternoon tea", "metro", "drive", "motion", "travel", "Take a walk", "Bar", "Nostalgia", "fresh", "romantic", "sexy", "Sentimental", "Cure", "Relax", "lonely", "Be moved", "Excitement", "happy", "Be quiet", "miss", "Film and Television Original Sound", "ACG", "children", "Campus", "Game", "70 after", "80 after", "90 after", "Internet songs", "KTV", "Classic", "Cover up", "Guitar", "Piano", "instrumental music", "List", "00 after"] self.headers = { 'Host': "music.163.com", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9", "Connection": "keep-alive", 'Referer': "http://music.163.com/", "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.0 Safari/537.36"} self.time = 0 def run_list(self): start = time.time() threadings = [] for id in self.urlslist: work = threading.Thread(target=self.get_lists, args=(id,)) threadings.append(work) for work in threadings: work.start() for work in threadings: work.join() end = time.time() print(end - start) def get_lists(self, id): if "/" in id or "&" in id: f = open(id.split("/" or "&")[0] + '.txt', 'a') else: f = open(id + '.txt', 'a') count = 0 while True: url = "http://music.163.com/discover/playlist/?order=hot&cat=" + \ urllib.parse.quote_plus(id) + "&limit=35&offset=" + str(count) html = requests.get(url, headers=self.headers, verify=False).text try: table = BeautifulSoup(html, 'html.parser').find( 'ul', id='m-pl-container').find_all('li') except: break ids = [] for item in table: ids.append(item.find('div', attrs={'class': 'bottom'}).find( 'a').get('data-res-id')) count += 35 f.write(str(ids) + '\n') def get_id(self, list_id, file_d): url = 'http://music.163.com/api/playlist/detail?id=' + str(list_id) data = requests.get(url, headers=self.headers, verify=False).json() if data['code'] != 200: return [] result = data['result'] musiclist = "" tracks = result['tracks'] for track in tracks: musiclist += (track['name'] + '\n') file_d.write(musiclist) self.time = self.time + 1 def get_detail(self, id): threadings = [] if "/" in id or "&" in id: f = open(id.split("/" or "&")[0] + ".txt", 'r') else: f = open(id + ".txt", 'r') if "/" in id or "&" in id: file_d = open(id.split("/" or "&")[0] + "data.txt", 'a') else: file_d = open(id + "data.txt", 'a') for line in f.readlines(): for id in eval(line.replace('\n', '')): work = threading.Thread( target=self.get_id, args=(id, file_d)) threadings.append(work) for work in threadings: work.start() for work in threadings: work.join() print(self.time) def run_detail(self): self.time = 0 start = time.time() threadings = [] for id in self.urlslist: work = threading.Thread(target=self.get_detail, args=(id,)) threadings.append(work) for work in threadings: work.start() for work in threadings: work.join() end = time.time() print(end - start) print(self.time)
Execution operation
It is recommended to run in Docker. Never try in a physical machine.
$ docker pull ipython/notebook $ docker run -it -d --name gunjianpan-ipython10.13-1 -p 40968:80 ipython/notebook $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 72fea94b0149 ipython/notebook "/notebook.sh" 3 seconds ago Up 2 seconds 8888/tcp, 0.0.0.0:40968->80/tcp gunjianpan-ipython10.14 $ docker exec -it 72f /bin/bash
> import netease_music > a=netease_music.Get_list() # obtain palyListId list > a.run_list() # obtain play music detail list > a.run_detail() # data handle $ awk '{print $0}' *data.txt |sort|uniq -c|sort -nr >> total.txt
FAQ
- tracks on the second interface json only gets the first data
The problem with header head should be anti-crawler treatment, but I don't know why to recognize the crawler and make a special return.
- Killed during operation
When kill ing first appeared, data.txt was too big to be f.open()
So it can only be divided into files.
- Thread-error occurs in operation
Traceback (most recent call last): File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner self.run() File "/usr/lib/python3.4/threading.py", line 868, in run self._target(*self._args, **self._kwargs) File "/notebooks/music.py", line 69, in get_id data = requests.get(url, headers=self.headers, verify=False).json() File "/usr/local/lib/python3.4/dist-packages/requests/models.py", line 800, in json self.content.decode(encoding), **kwargs File "/usr/lib/python3.4/json/__init__.py", line 318, in loads return _default_decoder.decode(s) File "/usr/lib/python3.4/json/decoder.py", line 343, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.4/json/decoder.py", line 361, in raw_decode raise ValueError(errmsg("Expecting value", s, err.value)) from None ValueError: Expecting value: line 1 column 1 (char 0)
It looks like the playListId's corresponding detail is empty. Maybe the list has been deleted or something.
- Running Carton under Physical Machine
Never run under a physical machine
When lz runs in docker, the number of threads reaches 18, and the fan whines
When you start on the physical machine, the number of threads is directly broken, and the card can't work at all. Even though the number of threads has not been completely released after running, the card is still very stuck.
So multi-threaded operations like this should be done on the server.
netease_music is really the most basic reptile. There's still a long way to go.
Analysis data
A total of 302,4511 songs were collected.
556432 after weight removal
It can be seen that the number of songs earned is higher or quite a lot of songs are familiar to us.
Final, I'll attach a list of the top 50 + songs.
1224 Time 1186 Something Just Like This 1152 Alone 1129 Intro 1072 Shape of You 1062 You 1061 Hello 1026 Closer 965 Stay 913 Home 802 Faded 777 Counting Stars 765 Animals 757 Without You 757 Nevada 752 Scattered and scattered 702 Forever 690 Higher 685 Summer 683 The rest of life 673 Victory 667 Rain 662 Sugar 647 Fade 645 Life 636 いつもHe Duでも 625 Fire 621 Unity 611 Hope 607 It's windy. Cover Gaoqiao you 606 Try 604 アイロニ 601 Havana 596 HandClap 594 As boundless as the sea and sky 593 The truth that you leave 583 See You Again 578 Please Don't Go 577 Dreams 574 Hero 566 Despacito (Remix) 565 Seve 563 Lullaby 560 That Girl 553 Beautiful Now 553 Angel 552 Viva La Vida 552 Let It Go 550 Light trap 550 Let Me Love You 550 Alive 549 meet 548 Breathe 544 Luv Letter 534 I Love You 532 We Don't Talk Anymore 532 A Little Story 528 Superstar 528 Journey 523 Maps 521 Trip 520 Memories 518 Goodbye 516 Horizon 516 Flower Dance 514 Summertime 512 Dragline play 508 #Lov3 #Ngẫu Hứng 506 Like you 503 Can it be 503 Uptown Funk