I want to play rough

Keywords: Attribute

  • Hello, I've met again. These days, exams + all kinds of things have delayed me a little. I finally made another case
  • Or look at the post of Zhihu God, the post of Wakingup
  • Don't talk much
'''
    //According to the great God of knowledge

    //Code Author: Gao Jiale

'''
##Import various libraries first
import re                                      ##re regular expression
import requests                                ##Get url
from bs4 import BeautifulSoup                  ##bs4 analysis
import time                                    ##Reptiles, be polite
import random                                  ##At first, I want to use random number to convert head. Try it. It's OK
import os                                      ##File by file and folder by page
from lxml import etree                         ##xpath Library

##Define a class, reptile
class spider():
    ##Initialization
    def __init__(self):                        ##Initialization function
        self.url = 'http://tieba.baidu.com'     ##Make his initialization function the home page of the Post Bar
        return
    ##Get the url of the Post Bar
    def getUrl(self,url):                       ##Function to get url (parameter is a url, url)
        html = requests.get(url)                ##html is the source code obtained from requests.get
        html_text = html.text                   ##HTML ﹣ text is to get the code at the top and then write it in
        return html_text                        ##This function returns html_text for assignment

    ##Get the title of the post
    def title(self,url):
        html = self.getUrl(url)                 ##html is the code obtained with getUrl
        soup = BeautifulSoup(html,'lxml')       ##soup is a structure tree parsed with beautiflsop, and lxml is used
        title = soup.find(class_="core_title_txt")##Then get the node of class = "core" title "text"
        #print('The title of this post bar is:',title.get_text())    ##This has been annotated, and the explanation is useless
        return title.get_text()                     ##What is returned is the content of the title node found above


    ##Get the total number of pages of the post
    def pages(self,url):                        ##This function gets the total number of pages
        html = self.getUrl(url)                 ##html is the code obtained with getUrl
        soup = etree.HTML(html)                 ##Then the soup uses etree.HTML to parse the code obtained above, so as to use xpath
        pages = soup.xpath('.//div[@id="thread_theme_5"]//li[@class="l_reply_num"]/span[last()]')##Because it's a bit troublesome for me to use regular, and it's a bit troublesome for me to use bs4, so I used xpath. I just started using xpath, so forgive me for being too vegetable
        for page in pages:                      ##The xpath on the top is the last span of li in the div of class = "L" reply "no matter where it is in the node no matter where it is
            break;                              ##Because what I get is a list, I need to traverse it. Because I only get one value, so I end the loop directly
        return page.text                        ##Because the upper side returns one like that, the return value carries the text to return the text



    ##Get owner and content
    def content(self,url):                      ##Get the content of the building owner. The parameter is url
        html = self.getUrl(url)                 ##html is the code obtained with getUrl
        ##I use regular. It's a bit troublesome. It's easy to use it in my post. Otherwise, it may be wrong. It's recommended to use xpath or Bs4
        zhengze = re.compile('<li.*?_author_name.*?_blank.*?(.*?)</a>.*?d_post_content.*?save_face_bg_hidden.*?d_post_content j_d_post_content.*?>(.*?)</div>',re.S)
        contents = re.findall(zhengze,html)     ##Above is to define a rule. Here is to use all the nodes in the html filter to return to the list
        number = 1 ##Counter, this is a memorizer to remember the floor
        for i in contents:
            print('The first%d Post\n Floor owner:%s\n Content:%s\n'%(number,i[0],i[1]))
            print()
            number+=1

    ##Get pictures
    def img(self,url):                          ##Get picture function
        html = self.getUrl(url)                 ##html is also the code obtained with getUrl
        soup = BeautifulSoup(html,'lxml')       ##soup is the code parsed with BS4 and lxml
        images = soup.select('img[class="BDE_Image"]')##Then use select to find the img tags that match 'class = "BDE [image'", and return to the list
        number = 0                                 ##This is a counter,
        imgs = []                               ##Create an empty list here so that the image address can be saved here. You will ask if there is a list on the top. In fact, there is a list on the top, but the list of nodes has other attributes. We only need the scr attribute
        for i in images:                        ##It's a way to figure out how many pictures there are on a page. Of course, it's the most loser way
            number+=1                           ##Read a node, number+1
        if number > 0:                          ##Count down, if number > 0
            print('Awesome, awesome, here's%s Picture, run after climbing, exciting'%number)##Output this
        else:                                   ##On the contrary, if it is < = 0, say it
            print('There's no picture here. Run')
        number = 1                              ##Here number is at = 1 for counting
        for image in images:                    ##Use image to read in images, that is, in the image node that you climb to
            img = image.get('src')              ##Here get(src) is that img only obtains the src attribute of this node, which is pure url
            print('Crawling, No%d Photo:'%number,img)##Output the url of the image above
            imgs.append(img)                    ##Then append the top pure connection img to the empty list created at the beginning, which will not become a pure connection list
            number+=1                           ##number+1 for counting
        return imgs                             ##Return the list of pure connections


    ##Create folders, and pictures
    def make(self,name):                        ##Function to create a folder
        dir = os.getcwd()+'\The first'+str(name)+'Folder'  ##First define the name of the file, that is, os.getcwd gets the current working path and + folder name
        # Remove first space
        dir = dir.strip()
        # Remove tail \ symbol
        dir = dir.rstrip('\\')
        ##Determine whether the directory exists
        panduan = os.path.exists(dir)               ##Exists determines whether the directory exists. Yes = true, no = False
        if panduan:                                 ##If there is, it already exists
            print(dir,'Already exist')
        else:                                       ##On the contrary, that is, false, no
            os.makedirs(dir)                        ##Just create this folder. Of course, my path is absolute
            print(dir,'Created successfully')

        return dir                                  ##Return the path of that folder, so as to switch the path below


    ##Save pictures
    def saveimage(self,url):                        ##Save the picture
        images = self.img(url)                      ##images is the list of pure connections obtained with img
        num = 1                                     ##num=1 counter
        for img in images:                          ##Loop img to read from the list of pure connections returned
            fil = str(num) + '.jpg'                 ##Then define the picture name.jpg
            html = requests.get(img).content        ##Because a direct address can't be written into a picture. To write binary, you need to get the content of that picture first, and then. Content will be converted into binary
            f = open(fil,'wb')                      ##Then open the file of fil, which is writable and readable. If not, create it
            f.write(html)                           ##Then write the binary of the picture
            f.close()                               ##Then close it. Close it. It's a good kid
            num+=1                                  ##Of course, the counter needs + 1
            print('The first%s Save successfully'%fil)                ##Tips: it is more powerful to save the picture successfully

    ##Control, a stick bar control
    def all(self,number):                           ##This function encapsulates all methods here
        ##The address of the post bar is self.url + post number
        url = self.url+'/p/'+str(number)            ##Because the address is also the home page + post number, this is also the first page
        ##Get title
        title = self.title(url)                     ##Get title with Title Method
        print('The title of this post is%s'%title)

        ##Get total pages
        page = int(self.pages(url))                 ##Use pages to get the total number of pages of the post, and convert it to integer type, so as to + 1

        ##Get all pages and pictures according to the total number of pages
        num = 1                                     ##num, here's the count, the number of pages
        print('First pages')                                ##Here is the first page
        ##Get content
        self.content(url)                           ##The first page does not add pn, so it should be obtained separately. The url is the top page + post number. Here is the content and the owner
        self.img(url)                               ##Here is the details of getting pictures
        dir = self.make(num)                        ##This is to create a folder, because a page saves a folder, so this is the first one
        os.chdir(dir)                               ##After creating the folder on the top, convert it to the folder below
        self.saveimage(url)                         ##Then save it and run
        num+=1                                      ##The first page is saved, of course, the second page, and the number of pages behind it is different,
        for i in range(num,page+1):                 ##The total number of pages obtained before is here, because to obtain all, the first page has been obtained before, here is from page 2 to stop after the total number of pages + 1, why + 1? Because for takes the first and doesn't marry the last
            url_2 = url + '?pn=' + str(i)           ##Here is to get the url of each page, because every page after the second page is home + post number +? pn= page number.
            print('The first%d page'%i)                         ##Count page
            time.sleep(1)                           ##Reptiles should be polite, polite, so sleep, 1. In fact, I plan to think about 0.1, but for the sake of world peace, it's a bit bigger
            self.content(url_2)                     ##Get this on page i
            self.img(url_2)                         ##Get picture details on page i
            dir = self.make(i)                      ##Then create the i-th folder, based on the number of I
            os.chdir(dir)                           ##Then switch to the newly created folder i
            time.sleep(1)                           ##And then talk about politeness
            self.saveimage(url_2)                   ##Then save the picture on page i
            time.sleep(1)                           ##Let's talk about politeness



img = spider()
img.all(int(input('Enter post number')))
  • (this is a small activity in the heart, which can be omitted)
    The school changed teachers and started the front-end examples. I secretly told you that my major is ui design, and the ui that can't code is not a good artist. So I decided to study the front-end well. It's said that this can avoid the contradiction between work and programmers in the future. Hahaha. Of course, I also want to seize the reptile. I'm not easy to have such an opportunity.

  • Here, this is the post bar, a post in the travel bar, crawling pictures and content, you can see the beautiful scenery without leaving home, and you can also go out for any travel. It's better to tap the code in the dormitory. emm is right about the regularity I used when grabbing the landlord and the content. Maybe I can't use other rules. I don't have time to change the xpath (I don't know whether I can change it or not). I'm sorry.

  • Well, I'll see you when I get here. I'll see you next time. Stick to it, come on, work hard.

Posted by sualcavab on Sun, 09 Feb 2020 10:02:52 -0800