I learned a lot when I climbed this website and got a better understanding of the basic knowledge of python.
First, the usage of. join. It's better to handle the crawled text.
Then, the following:: usage of xpath is used. It takes all peers behind the node, which is very easy to use.
Then there's a puzzle. I don't know why. When I use string to crawl all the information of the tutor, I can only crawl the last line. I need to use text (). I hope it's my vscode problem.
In the future, if there is no special situation, the code of climbing the school website will not be posted, which is not very interesting.
Next is the code
import requests from lxml import etree import random from string import punctuation import re import time import pymongo from pymongo import MongoClient def download(url): headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'} time.sleep(1) r = requests.get(url,headers=headers) r.encoding='utf-8' return etree.HTML(r.text) def w_d(list): str = '\n'.join(list) return str def write_down(teacher_imfo): client=MongoClient() db=client.jiliang_teacher_data collection = db.teacher_imfo collection.insert_one(teacher_imfo) print('Downloading:'+teacher_imfo['Full name:']) def deep_spider(url): selector = download(url) a=selector.xpath('//*[@align="center"]') tag=a[1].xpath('following::td/span/text()') cont=a[2].xpath('following::td/text()') dict={} for i in range(1,10): cont.append(' ') for i in range(len(tag)): dict[tag[i]]=cont[i] keti=selector.xpath('/html/body/div[3]/div/table/tr[9]/td/p/text()') huojiang=selector.xpath('/html/body/div[3]/div/table/tr[11]/td/p/text()') jinqi=selector.xpath('/html/body/div[3]/div/table/tr[13]/td/p/text()') zhuchi=selector.xpath('/html/body/div[3]/div/table/tr[15]/td/p/text()') jianli=selector.xpath('/html/body/div[3]/div/table/tr[17]/td/p/text()') qita=selector.xpath('/html/body/div[3]/div/table/tr[19]/td/p/text()') dict['topic']=w_d(keti) dict['Prize winning']=w_d(huojiang) dict['Main achievements published recently']=w_d(jinqi) dict['Scientific research projects presided over and completed']=w_d(zhuchi) dict['Curriculum vitae']=w_d(jianli) dict['Other']=w_d(qita) write_down(dict) def spider_zong(url): selector = download(url) xueyuan = selector.xpath('//*[@align="center"]/b/text()')[1:] detail_url=selector.xpath('//*[@target="_blank"]/@href')[8:] for j in range(len(detail_url)): for i in detail_url: if 'http' in i or '#' in i : detail_url.remove(i) for i in detail_url: order_url='https://yjsb.cjlu.edu.cn/yjsy/daoshi/{}'.format(i) deep_spider(order_url) spider_zong('https://yjsb.cjlu.edu.cn/yjsy/daoshi/index.jspx')
Generally speaking, it's just like this. There's nothing new about it.