Crawling information of graduate tutors of China University of Metrology

Keywords: Python Windows encoding

I learned a lot when I climbed this website and got a better understanding of the basic knowledge of python.
First, the usage of. join. It's better to handle the crawled text.
Then, the following:: usage of xpath is used. It takes all peers behind the node, which is very easy to use.
Then there's a puzzle. I don't know why. When I use string to crawl all the information of the tutor, I can only crawl the last line. I need to use text (). I hope it's my vscode problem.
In the future, if there is no special situation, the code of climbing the school website will not be posted, which is not very interesting.
Next is the code

import requests
from lxml import etree
import random
from string import punctuation
import re
import time
import pymongo
from pymongo import MongoClient

def download(url):
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
    time.sleep(1)
    r = requests.get(url,headers=headers)
    r.encoding='utf-8'
    return etree.HTML(r.text)

def w_d(list):
    str = '\n'.join(list)
    return str

def write_down(teacher_imfo):
    client=MongoClient()
    db=client.jiliang_teacher_data
    collection = db.teacher_imfo
    collection.insert_one(teacher_imfo)
    print('Downloading:'+teacher_imfo['Full name:'])


def deep_spider(url):
    selector = download(url)
    a=selector.xpath('//*[@align="center"]')
    tag=a[1].xpath('following::td/span/text()')
    cont=a[2].xpath('following::td/text()')
    dict={}
    for i in range(1,10):
        cont.append(' ')
    for i in range(len(tag)):
        dict[tag[i]]=cont[i]
    keti=selector.xpath('/html/body/div[3]/div/table/tr[9]/td/p/text()')
    huojiang=selector.xpath('/html/body/div[3]/div/table/tr[11]/td/p/text()')
    jinqi=selector.xpath('/html/body/div[3]/div/table/tr[13]/td/p/text()')
    zhuchi=selector.xpath('/html/body/div[3]/div/table/tr[15]/td/p/text()')
    jianli=selector.xpath('/html/body/div[3]/div/table/tr[17]/td/p/text()')
    qita=selector.xpath('/html/body/div[3]/div/table/tr[19]/td/p/text()')
    dict['topic']=w_d(keti)
    dict['Prize winning']=w_d(huojiang)
    dict['Main achievements published recently']=w_d(jinqi)
    dict['Scientific research projects presided over and completed']=w_d(zhuchi)
    dict['Curriculum vitae']=w_d(jianli)
    dict['Other']=w_d(qita)
    write_down(dict)
 
def spider_zong(url):
    selector = download(url)
    xueyuan = selector.xpath('//*[@align="center"]/b/text()')[1:]
    detail_url=selector.xpath('//*[@target="_blank"]/@href')[8:]
    for j in range(len(detail_url)):
        for i in detail_url:
            if 'http' in i or '#' in i :
                detail_url.remove(i)
    for i in detail_url:
        order_url='https://yjsb.cjlu.edu.cn/yjsy/daoshi/{}'.format(i)
        deep_spider(order_url)

spider_zong('https://yjsb.cjlu.edu.cn/yjsy/daoshi/index.jspx')

Generally speaking, it's just like this. There's nothing new about it.

Posted by Obsession on Sat, 02 Nov 2019 19:00:52 -0700