Netease Music Spider

Keywords: Big Data JSON Docker IPython xml

Netease Music Spider

Blog drainage

This is just an introductory article. Please move on. Netease Music Spider for DB

Reptiles are a problem that we wanted to study long ago.

But because of laziness and laziness

Recently, some novices who write crawlers often come to my website to practice.

Looking at the log shows that it's hard, so I decided to study how to write the crawler myself (I won't say it's homework).

This paper crawls and analyses the data of 3024 511 songs from nearly 2 W popular song lists.

Why Choose Climbing Yiyun Music

Because it's really simple.

Especially suitable for beginners

No defence at all

All-to-all open interfaces

Of course, after further research, we found that we have done some work on privacy control.

What to climb

First, let's define our goals.

What data do you want to climb?

  • Climb the list of songs in the list that you hear the most?
    • This may be a reference for listening to songs.
  • What kind of singles are most popular?
    • Find popular interest
  • maybe has other holes or something

In a word, it is meaningful to tell ourselves what we do before we do something.

How to get data

Think about the process we use to listen to songs.

  1. A song shared by a friend
  2. You know the name of the song and you want to search for it.
  3. You hear a song on a song list

Each scenario corresponds to a series of API s

Clearing up business scenarios is critical for writing Crawlers

Let's first analyze the song list - the song scene.

Sharing a song list is equivalent to sharing a song list id - playListId

Netease Cloud does a better job here than providing an API interface directly.

Instead, it encapsulates information into html [https:// music.163.com / discover / playlist?order = hot & cat = %E5 % 8D % 8E % E8 % AF % AD & limit = 35 & offset = 1](https: // music.163.com / discover / playlist? Order = hot & cat =% E5% - 8D% - 8E% - E8% - AF% - AD & limit = 35 & offset = 1) to increase the threshold for writing Crawlers

Finding the'data-res-id'parameter in'div' is the playListId required

When playListId is found, call [http:// music.163.com / api / playlist / detail?id = 1](http: // music.163.com / api / playlist / detail?id=1) to get the required song information

Because what is returned is what JSON actually does, which is JSON parsing.

Multithreading

python is really slow

As a result, a request was sent in 2s at the beginning of the test. lz thought it would be the year of the monkey and the month of the horse after climbing for 2w.

Helpless to write a multi-threaded

    for id in self.urlslist:
        work = threading.Thread(target=self.get_detail, args=(id,))
        threadings.append(work)
    for work in threadings:
        work.start()
    for work in threadings:
        work.join()

Code

# -*- coding: utf-8 -*-
# @Author: gunjianpan
# @Date:   2018-10-12 20:00:17
# @Last Modified by:   gunjianpan
# @Last Modified time: 2018-10-14 21:53:46
# coding:utf-8

import requests
from bs4 import BeautifulSoup
import sqlite3
import threading
import json
import urllib.parse
import time


class Get_list():
    def __init__(self):
        self.urlslist = ["whole", "Chinese", "Europe and America", "Japanese", "Korean", "Cantonese", "Small language", "Popular", "Rock", "Ballad", "Electronics", "Dance music", "Rap", "Light music", "Sir", "Rural", "R&B/Soul", "classical", "Nation", "England", "Metal", "Punk", "blues", "Reggae", "World Music", "Latin", "Alternative/Independent", "New Age", "Antiquity", "Rear swing", "Bossa Nova", "Early morning", "night", "Study",
                         "work", "Noon break", "Afternoon tea", "metro", "drive", "motion", "travel", "Take a walk", "Bar", "Nostalgia", "fresh", "romantic", "sexy", "Sentimental", "Cure", "Relax", "lonely", "Be moved", "Excitement", "happy", "Be quiet", "miss", "Film and Television Original Sound", "ACG", "children", "Campus", "Game", "70 after", "80 after", "90 after", "Internet songs", "KTV", "Classic", "Cover up", "Guitar", "Piano", "instrumental music", "List", "00 after"]
        self.headers = {
            'Host': "music.163.com",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "zh-CN,zh;q=0.9",
            "Connection": "keep-alive",
            'Referer': "http://music.163.com/",
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.0 Safari/537.36"}
        self.time = 0

    def run_list(self):
        start = time.time()
        threadings = []
        for id in self.urlslist:
            work = threading.Thread(target=self.get_lists, args=(id,))
            threadings.append(work)
        for work in threadings:
            work.start()
        for work in threadings:
            work.join()
        end = time.time()
        print(end - start)

    def get_lists(self, id):
        if "/" in id or "&" in id:
            f = open(id.split("/" or "&")[0] + '.txt', 'a')
        else:
            f = open(id + '.txt', 'a')

        count = 0
        while True:
            url = "http://music.163.com/discover/playlist/?order=hot&cat=" + \
                urllib.parse.quote_plus(id) + "&limit=35&offset=" + str(count)
            html = requests.get(url, headers=self.headers, verify=False).text
            try:
                table = BeautifulSoup(html, 'html.parser').find(
                    'ul', id='m-pl-container').find_all('li')
            except:
                break
            ids = []
            for item in table:
                ids.append(item.find('div', attrs={'class': 'bottom'}).find(
                    'a').get('data-res-id'))
            count += 35
            f.write(str(ids) + '\n')

    def get_id(self, list_id, file_d):
        url = 'http://music.163.com/api/playlist/detail?id=' + str(list_id)
        data = requests.get(url, headers=self.headers, verify=False).json()
        if data['code'] != 200:
            return []
        result = data['result']
        musiclist = ""
        tracks = result['tracks']
        for track in tracks:
            musiclist += (track['name'] + '\n')
        file_d.write(musiclist)
        self.time = self.time + 1

    def get_detail(self, id):
        threadings = []
        if "/" in id or "&" in id:
            f = open(id.split("/" or "&")[0] + ".txt", 'r')
        else:
            f = open(id + ".txt", 'r')
        if "/" in id or "&" in id:
            file_d = open(id.split("/" or "&")[0] + "data.txt", 'a')
        else:
            file_d = open(id + "data.txt", 'a')
        for line in f.readlines():
            for id in eval(line.replace('\n', '')):
                work = threading.Thread(
                    target=self.get_id, args=(id, file_d))
                threadings.append(work)
        for work in threadings:
            work.start()
        for work in threadings:
            work.join()
        print(self.time)

    def run_detail(self):
        self.time = 0
        start = time.time()
        threadings = []
        for id in self.urlslist:
            work = threading.Thread(target=self.get_detail, args=(id,))
            threadings.append(work)
        for work in threadings:
            work.start()
        for work in threadings:
            work.join()
        end = time.time()
        print(end - start)
        print(self.time)

Execution operation

It is recommended to run in Docker. Never try in a physical machine.

$ docker pull ipython/notebook
$ docker run -it -d --name gunjianpan-ipython10.13-1 -p 40968:80 ipython/notebook
$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS                             NAMES
72fea94b0149        ipython/notebook    "/notebook.sh"      3 seconds ago       Up 2 seconds        8888/tcp, 0.0.0.0:40968->80/tcp   gunjianpan-ipython10.14
$ docker exec -it 72f /bin/bash
> import netease_music
> a=netease_music.Get_list()

# obtain palyListId list
> a.run_list()

# obtain play music detail list
> a.run_detail()

# data handle
$ awk '{print $0}' *data.txt |sort|uniq -c|sort -nr >> total.txt

FAQ

  1. tracks on the second interface json only gets the first data

The problem with header head should be anti-crawler treatment, but I don't know why to recognize the crawler and make a special return.

  1. Killed during operation

When kill ing first appeared, data.txt was too big to be f.open()

So it can only be divided into files.

  1. Thread-error occurs in operation
Traceback (most recent call last):
  File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.4/threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "/notebooks/music.py", line 69, in get_id
    data = requests.get(url, headers=self.headers, verify=False).json()
  File "/usr/local/lib/python3.4/dist-packages/requests/models.py", line 800, in json
    self.content.decode(encoding), **kwargs
  File "/usr/lib/python3.4/json/__init__.py", line 318, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.4/json/decoder.py", line 343, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.4/json/decoder.py", line 361, in raw_decode
    raise ValueError(errmsg("Expecting value", s, err.value)) from None
ValueError: Expecting value: line 1 column 1 (char 0)

It looks like the playListId's corresponding detail is empty. Maybe the list has been deleted or something.

  1. Running Carton under Physical Machine

Never run under a physical machine

When lz runs in docker, the number of threads reaches 18, and the fan whines
When you start on the physical machine, the number of threads is directly broken, and the card can't work at all. Even though the number of threads has not been completely released after running, the card is still very stuck.

So multi-threaded operations like this should be done on the server.

netease_music is really the most basic reptile. There's still a long way to go.

Analysis data

A total of 302,4511 songs were collected.

556432 after weight removal

It can be seen that the number of songs earned is higher or quite a lot of songs are familiar to us.

Final, I'll attach a list of the top 50 + songs.

1224 Time
1186 Something Just Like This
1152 Alone
1129 Intro
1072 Shape of You
1062 You
1061 Hello
1026 Closer
 965 Stay
 913 Home
 802 Faded
 777 Counting Stars
 765 Animals
 757 Without You
 757 Nevada
 752 Scattered and scattered
 702 Forever
 690 Higher
 685 Summer
 683 The rest of life
 673 Victory
 667 Rain
 662 Sugar
 647 Fade
 645 Life
 636 いつもHe Duでも
 625 Fire
 621 Unity
 611 Hope
 607 It's windy. Cover Gaoqiao you
 606 Try
 604 アイロニ
 601 Havana
 596 HandClap
 594 As boundless as the sea and sky
 593 The truth that you leave
 583 See You Again
 578 Please Don't Go
 577 Dreams
 574 Hero
 566 Despacito (Remix)
 565 Seve
 563 Lullaby
 560 That Girl
 553 Beautiful Now
 553 Angel
 552 Viva La Vida
 552 Let It Go
 550 Light trap
 550 Let Me Love You
 550 Alive
 549 meet
 548 Breathe
 544 Luv Letter
 534 I Love You
 532 We Don't Talk Anymore
 532 A Little Story
 528 Superstar
 528 Journey
 523 Maps
 521 Trip
 520 Memories
 518 Goodbye
 516 Horizon
 516 Flower Dance
 514 Summertime
 512 Dragline play
 508 #Lov3 #Ngẫu Hứng
 506 Like you
 503 Can it be
 503 Uptown Funk

Posted by Spoiler on Sat, 19 Jan 2019 14:30:13 -0800