1, Basic steps of crawling web pages
1.1 determine crawling data
The function of a crawler is to crawl the data needed in a large number of web pages, and determine what data is needed before that. Taking the second-hand house price as an example, if we need to do a cross-sectional regression analysis of the second-hand house price in Changsha, we need to find out the factors related to the second-hand house price: the type of second-hand house (how many rooms and halls? Chaoyang? Is there a balcony? New decoration?) , location of second-hand house, unit price (yuan / m2), property right period, etc.
1.2 determine the webpage to be crawled
The speed of crawlers is relatively fast, but it is impossible to search from the whole Internet. In order to crawl the data you need as much as possible, you need to determine the sites where the data you need appears. Take the second-hand house price as an example. If we have determined the data we need to find, we need to search the second-hand house price on Baidu and Google, and find several websites with data: settle down, chain home, etc., then our goal is to crawl these commercial websites.
1.3 Analysis page
Firstly, it should be analyzed whether the web page can be crawled , you can write a simple code for automatically crawling web pages, and set up a random user agent for preliminary crawling. At this time, you can not write very detailed code, such as only crawling title. If you find that the anti crawling of the website is very serious, you can block the ip address after visiting the website several times, require to log in again or verify the code. You can use cookies to save the account password for the login problem, and the verification code can also simulate the cracking or the automatic cracking for some money. But generally, it can't prevent others from blocking your ip address. The best way is to find an agent. If you find that someone else doesn't seal your ip address when you climb a web page of about 12000, it means that the data of this web site can be crawled, and then you can find useful data in the web page, but the data visible in the web page may not be written in html, but it may be written in json or database, so you need to check the source code of the web page. If you check it from "check" alone, it still doesn't work Some of the rendering results are shown. You must view the html code from "view page source code". Find out which part of the web page data is useful, and the general structure of these pages is the same. At this time, just determine the root url, and start from the root to traverse these pages.
1.4 crawling web pages
There are two ways to traverse a crawler, which can also be described as "depth first" and "breadth first". Starting from the root, a crawler crawls a web page from this web map, but different ways can also determine the quality of the web page, such as setting the number of crawling web pages to 1000. If you use depth first search, there may be too many topics but not or too few other topics, And breadth first will make each topic evenly distributed, so it is recommended to write crawlers with breadth first.
In the process of crawling, we need to get rid of a lot of junk web pages, or only limit a certain kind of web page to crawling, generally limit the domain name and filter the url with regular expression.
After entering the url to be crawled, the web page is parsed. There are many ways of parsing, such as regular expression, beautiful soup, and XPath. The purpose is the same. Regular expression is the most basic and must be mastered. The latter two are used to parse the xml tree structure. Just learn one. In the process of parsing, we should pay attention to the universality of beautifulsoup or XPath parsing, and use less specific way of subscript parsing, because this is equivalent to dead writing. Inserting an advertisement in the middle of a web page may lead to subscript displacement and lack of access to data.
1.5 save data information
Save the crawled data in the database? json?xlsx?csv? And so on. For each type of data, it is divided into folders. The write mode is binary? encoding format? All of these need to be considered.
2, Common reptile
- requests
response = requests.get(url, timeout=1, headers=headers1)
- header
headers1={ 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36' }
- regular expression
response.encoding = 'utf-8' html = response.text # Analytic web page urls = re.findall('href="(.*?)"', html) urls += re.findall('src="(.*?)"', html)
- Preservation
dir_name="blog_url/asdf"+str(cnt_page)+".txt" with open(dir_name, 'w',encoding='utf-8') as f: f.write(url+" ") for url in urls: str1 = url.split('.')[-1] if "blog.sina" in url and (('/' in str1) or 'html'in str1 or "cn" in str1): f.write(url+" ")
- Extensive search & de duplication
cnt_page=0 is_history=set() url_queue=Queue() def spide_xx(url):#Crawl from a root interface and save it in dir? Name global cnt_page url_queue.put(url) while url_queue.empty()==False and cnt_page<10000: try: cnt_page = cnt_page + 1 cur_url=url_queue.get(); is_history.add(cur_url) get_url_txt(cur_url) urls=get_url_src(cur_url) for url in urls: str = url.split('.')[-1] if "blog.sina" in url and (('/' in str) or 'html'in str or "cn" in str): if (url in is_history) == False: print(url) url_queue.put(url) except: print("Access timeout!") return
3, Scratch framework
This is not about installation. Suppose it has been installed, let's take the chain home website as an example to crawl the basic information of Changsha second-hand house and introduce the basic usage of scratch.
Command scratch startproject XXXX to create a scratch frame crawler project
cd xxxx
Scratch genspider - t crawl XXX domain name create crawlspider crawler
Write the driver script main.py:
from scrapy import cmdline cmdline.execute('scrapy crawl housing_price_crawl'.split())
Used to start the crawler, crawler xxx "xxx" is the name of the starting crawler, here is housing_price_crawler.py
Briefly explain what each generated py file does: house ﹣ price ﹣ crawl.py is used to analyze web pages and start crawling rules; item.py encapsulates the data to be crawled into a class and then transmits it in the form of objects, so it is necessary to define item.py in advance; middlewars.py's own middleware. The middleware processing is mainly used for anti crawler and set random user agent , random ip are all set here; pipelines.py pipeline is used to store data in batches after crawling, where the storage rules for item objects are defined; setting.py is used to configure the settings of the scratch framework, whether to read robot.txt, open middleware, pipeline, user agent, how many pages to crawl and stop, whether depth first or breadth first, etc.
Write housing ﹣ price ﹣ crawl.py. Because the page flipping url chain is wrapped in div and cannot get the exact url, it cannot get all the URLs of the current page through the LinkExtractor, so it can only write its own code to parse the url of the next page. Extract the community name, place name, rooms and halls of each page, and the unit price. It is recommended to check whether the xpath or the regular is correct by using the extraction of the scratch shell.
#items.py import scrapy class ChangshahousingpriceItem(scrapy.Item): name=scrapy.Field() position=scrapy.Field() type=scrapy.Field() price=scrapy.Field()
# -*- coding: utf-8 -*- housing_price_crawl.py import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from ChangShaHousingPrice.items import ChangshahousingpriceItem import json from scrapy.loader import ItemLoader class HousingPriceCrawlSpider(CrawlSpider): name = 'housing_price_crawl' allowed_domains = ['cs.lianjia.com'] start_urls = ['https://cs.lianjia.com/ershoufang/yuhua/rs%E9%95%BF%E6%B2%99/'] rules = ( Rule(LinkExtractor(allow=r'.*/ershoufang/.*/rs Changsha/'), callback='page_request', follow=True), #Rule(LinkExtractor(allow=r'https://cs.lianjia.com/ershoufang/\d+.html'), callback='parse_item', follow=False), ) def page_request(self, response): # link=LinkExtractor(allow=r'https://cs.lianjia.com/ershoufang/\d+.html') # print(link.extract_links(response)) root_path=response.xpath("//div[@class='page-box house-lst-page-box']/@page-url").get() max_page=response.xpath("//div[@class='page-box house-lst-page-box']/@page-data").get() if(max_page is not None): max_page = json.loads(max_page) max_page=max_page["totalPage"] root_path+='/' for i in range(1,max_page+1): path=root_path.replace('{page}',str(i)) path='https://cs.lianjia.com'+path print(path) yield scrapy.Request(path,callback=self.page_info) def page_info(self,response): link = LinkExtractor(allow=r'https://cs.lianjia.com/ershoufang/\d+.html') urls=link.extract_links(response) for url in urls: url=url.url yield scrapy.Request(url, callback=self.parse_item) def parse_item(self, response): l=ItemLoader(item=ChangshahousingpriceItem(),response=response) l.add_xpath('name',"//div[@class='communityName']/a[@class='info ']/text()") l.add_value('position'," ".join(response.xpath("//div[@class='areaName']/span[@class='info']/a[@target='_blank']/text()").getall())) l.add_value('type',response.xpath("//div[@class='mainInfo']/text()").get()) l.add_value('price',response.xpath("//span[@class='unitPriceValue']/text()").get()+response.xpath("//span[@class='unitPriceValue']/i/text()").get()) # item=ChangshahousingpriceItem() # item['name']=response.xpath("//div[@class='communityName']/a[@class='info ']/text()").get() # item['position']=" ".join(response.xpath("//div[@class='areaName']/span[@class='info']/a[@target='_blank']/text()").getall()) # item['type']=response.xpath("//div[@class='mainInfo']/text()").get() # item['price']=response.xpath("//span[@class='unitPriceValue']/text()").get()+response.xpath("//span[@class='unitPriceValue']/i/text()").get() return l.load_item()
Middleware set up random user agent
class ChangshahousingpriceDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. USER_AGENTS = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36', 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6788.400 QQBrowser/10.3.2816.400', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0', 'Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.8.1.11) Gecko/20080118 Firefox/2.0.0.11', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Deepnet Explorer 1.5.3; Smart 2x2; .NET CLR 2.0.50727; .NET CLR 1.1.4322; InfoPath.1)', 'ELinks/0.9.3 (textmode; Linux 2.6.9-kanotix-8 i686; 127x41)', 'Mozilla/5.0 (X11; U; Linux x86_64; it-it) AppleWebKit/534.26+ (KHTML, like Gecko) Ubuntu/11.04 Epiphany/2.30.6', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; x64; fr; rv:1.9.2.13) Gecko/20101203 Firebird/3.6.13', 'Mozilla/5.0 (Macintosh; PPC Mac OS X 10_5_8) AppleWebKit/537.3+ (KHTML, like Gecko) iCab/5.0 Safari/533.16', 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.13) Gecko/20100916 Iceape/2.0.8', 'Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20121201 icecat/17.0.1', 'Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20121202 Firefox/17.0 Iceweasel/17.0.1', 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; AS; rv:11.0) like Gecko', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.21pre) Gecko K-Meleon/1.7.0', ] @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): user_agent = random.choice(self.USER_AGENTS) request.headers['User-Agent'] = user_agent return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
Settings stored in csv file: pipelines.py
import csv import os class ChangshahousingpricePipeline(object): def __init__(self): # The location of the csv file without prior creation store_file = os.path.dirname(__file__) + '/spiders/Second hand house price in Changsha.csv' # Open (create) file self.file = open(store_file, 'w+',newline="",encoding='utf-8') # csv writing self.writer = csv.writer(self.file) def process_item(self, item, spider): # Judge that the field value is not empty and then write it to the file self.writer.writerow((item['name'],item['position'],item['type'],item['price'])) return item def close_spider(self,spider): self.file.close()
setting configuration
BOT_NAME = 'ChangShaHousingPrice' SPIDER_MODULES = ['ChangShaHousingPrice.spiders'] NEWSPIDER_MODULE = 'ChangShaHousingPrice.spiders' ROBOTSTXT_OBEY = False DOWNLOADER_MIDDLEWARES = { 'ChangShaHousingPrice.middlewares.ChangshahousingpriceDownloaderMiddleware': 543, } ITEM_PIPELINES = { 'ChangShaHousingPrice.pipelines.ChangshahousingpricePipeline': 300, }
Result: