Top 250 music of bean petals of python reptiles

Keywords: Python Ubuntu Windows network

I've been home for a long time. I can't bear it. I want to climb up and play with some data. My notebook used to be a dual system of win7 plus ubuntu 16.04. I planned to write code in ubuntu, but when I came back home, it's always a purple screen. Baidu knows how to use it, but I can't solve it. A good brother can teach me. There's a red envelope for the New Year!! It is too laggy to write code in win7 (computer card, which does not want to install Python). Today, climbing the music is top250, which is relatively simple, mainly practicing hands.

Code

import requests
import re
from bs4 import BeautifulSoup
import time
import pymongo

client = pymongo.MongoClient('localhost', 27017)
douban = client['douban']
musictop = douban['musictop']

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}
urls = ['https://music.douban.com/top250?start={}'.format(str(i)) for i in range(0,250,25)]

def get_url_music(url):
    wb_data = requests.get(url,headers=headers)
    soup = BeautifulSoup(wb_data.text,'lxml')
    music_hrefs = soup.select('a.nbg')
    for music_href in music_hrefs:
        get_music_info(music_href['href'])
        time.sleep(2)

def get_music_info(url):
    wb_data = requests.get(url,headers=headers)
    soup = BeautifulSoup(wb_data.text,'lxml')
    names = soup.select('h1 > span')
    authors = soup.select('span.pl > a')
    styles = re.findall('<span class="pl">Schools:</span> (.*?)<br />',wb_data.text,re.S)
    times = re.findall('<span class="pl">Issue time:</span> (.*?)<br />',wb_data.text,re.S)
    contents = soup.select('span.short > span')
    if len(names) == 0:
        name = 'Defect'
    else:
        name = names[0].get_text()
    if len(authors) == 0:
        author = 'nameless'
    else:
        author = authors[0].get_text()
    if len(styles) == 0:
        style = 'Unknown'
    else:
        style = styles[0].split('\n')[0]
    if len(times) == 0:
        time = 'Unknown'
    else:
        time = times[0].split('-')[0]
    if len(contents) == 0:
        content = 'nothing'
    else:
        content = contents[0].get_text()
    info = {
        'name':name,
        'author':author,
        'style':style,
        'time':time,
        'content':content
    }
    musictop.insert_one(info)

for url in urls:
    get_url_music(url)

1. The request header is added (it was not added originally, there was no data after debugging several times, and it was not good at first, but it was good again later, maybe because of the network)
2 this time is the data to enter the information page (this method was not used in the last climbing movie, and part of the data is missing)
3. Many if functions are used for data preprocessing. What's the optimization method for the powerful brother.

Posted by maddogandnoriko on Sun, 15 Dec 2019 06:48:17 -0800

Programmer Group

Top 250 music of bean petals of python reptiles

Code

Hot Keywords