Chapter 1 Introduction to Reptiles
- Requests and Beautiful Soup crawl python.org
- urllib3 and Beautiful Soup crawl python.org
- Scrapy crawls python.org
- Selenium and hantomJs crawl Python.org
Address of the latest version of this article
Please confirm to open: https://www.python.org/events/pythonevents
Install requests, bs4, and then we start to exemplify 1: Requests and Beautiful Soup crawl python.org, install your own google if you have problems, if you are really unsure, you can ask questions in groups, or private chat to consult qq37391319, which requires a fee, qqq red envelope 10 yuan.
#!python # pip3 install requests bs4
Requests and Beautiful Soup crawl python.org
- Goal: Climbing https://www.python.org/events/python-events/ Name, place, and time of the event in.
01_events_with_requests.py
#!python import requests from bs4 import BeautifulSoup def get_upcoming_events(url): req = requests.get(url) soup = BeautifulSoup(req.text, 'lxml') events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li') for event in events: event_details = dict() event_details['name'] = event.find('h3').find("a").text event_details['location'] = event.find('span', {'class', 'event-location'}).text event_details['time'] = event.find('time').text print(event_details) get_upcoming_events('https://www.python.org/events/python-events/')
Implementation results:
#!python $ python3 01_events_with_requests.py {'name': 'PyCon US 2018', 'location': 'Cleveland, Ohio, USA', 'time': '09 May – 18 May 2018'} {'name': 'DjangoCon Europe 2018', 'location': 'Heidelberg, Germany', 'time': '23 May – 28 May 2018'} {'name': 'PyCon APAC 2018', 'location': 'NUS School of Computing / COM1, 13 Computing Drive, Singapore 117417, Singapore', 'time': '31 May – 03 June 2018'} {'name': 'PyCon CZ 2018', 'location': 'Prague, Czech Republic', 'time': '01 June – 04 June 2018'} {'name': 'PyConTW 2018', 'location': 'Taipei, Taiwan', 'time': '01 June – 03 June 2018'} {'name': 'PyLondinium', 'location': 'London, UK', 'time': '08 June – 11 June 2018'}
Note: Because the content of the event may not be the same, the results will not be the same each time.
After-class exercises: crawling with requests https://china-testing.github.io/ There are 10 titles on the front page of the blog.
Reference Answer:
01_blog_title.py
#!python import requests from bs4 import BeautifulSoup def get_upcoming_events(url): req = requests.get(url) soup = BeautifulSoup(req.text, 'lxml') events = soup.findAll('article') for event in events: event_details = {} event_details['name'] = event.find('h1').find("a").text print(event_details) get_upcoming_events('https://china-testing.github.io/')
Implementation results:
#!python $ python3 01_blog_title.py {'name': '10 Minute Society API test'} {'name': 'python Quick Start Course on Data Analysis 4-Data aggregation'} {'name': 'python Quick Start Course on Data Analysis 6-reforming'} {'name': 'python Quick Start Course on Data Analysis 5-Processing missing data'} {'name': 'python Introduction to Library-pytesseract: OCR Optical Character Recognition'} {'name': 'Beginner's Advice on Software Automated Testing'} {'name': 'Use opencv Conversion 3 d picture'} {'name': 'python opencv3 Example(Object Recognition and Augmented Reality)2-Edge Detection and Application of Image Filter'} {'name': 'numpy Learning Guide 3 rd3:Common functions'} {'name': 'numpy Learning Guide 3 rd2:NumPy Basics'}
urllib3 and Beautiful Soup crawl python.org
Code: 02_events_with_urlib3.py
#!python import urllib3 from bs4 import BeautifulSoup def get_upcoming_events(url): req = urllib3.PoolManager() res = req.request('GET', url) soup = BeautifulSoup(res.data, 'html.parser') events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li') for event in events: event_details = dict() event_details['name'] = event.find('h3').find("a").text event_details['location'] = event.find('span', {'class', 'event-location'}).text event_details['time'] = event.find('time').text print(event_details) get_upcoming_events('https://www.python.org/events/python-events/')
Requests encapsulate urllib3, usually directly using requests.
Scrapy crawls python.org
Scrapy is a very popular open source Python capture framework for extracting data. Scrapy provides all of these functions as well as many other built-in modules and extensions. When it comes to mining with Python, it's also our preferred tool.
Scrapy offers many powerful features worth mentioning:
- Built-in extensions to generate HTTP requests and handle compression, authentication, caching, manipulation of user agents and HTTP headers
- Built-in support for selection and extraction selector languages such as data CSS and XPath, as well as the use of regular expressions to select content and links.
- Coding support to handle language and non-standard coding declarations
- Flexible API s to reuse and write custom middleware and pipelines provide clean and simple ways to achieve automation and other tasks. For example, download assets (such as images or media) and store data in storage, such as file systems, S3, databases, etc.
There are several ways to use Scrapy. One is the program pattern where we create crawlers and spiders in our code. Scrapy templates or generator projects can also be configured and run from the command line. This book will follow the program pattern, because its code is in a single file.
Code: 03_events_with_scrapy.py
#!python import scrapy from scrapy.crawler import CrawlerProcess class PythonEventsSpider(scrapy.Spider): name = 'pythoneventsspider' start_urls = ['https://www.python.org/events/python-events/',] found_events = [] def parse(self, response): for event in response.xpath('//ul[contains(@class, "list-recent-events")]/li'): event_details = dict() event_details['name'] = event.xpath('h3[@class="event-title"]/a/text()').extract_first() event_details['location'] = event.xpath('p/span[@class="event-location"]/text()').extract_first() event_details['time'] = event.xpath('p/time/text()').extract_first() self.found_events.append(event_details) if __name__ == "__main__": process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'}) process.crawl(PythonEventsSpider) spider = next(iter(process.crawlers)).spider process.start() for event in spider.found_events: print(event)
After-class exercises: crawling with scrapy https://china-testing.github.io/ There are 10 titles on the front page of the blog.
Reference Answer:
03_blog_with_scrapy.py
#!python from scrapy.crawler import CrawlerProcess class PythonEventsSpider(scrapy.Spider): name = 'pythoneventsspider' start_urls = ['https://china-testing.github.io/',] found_events = [] def parse(self, response): for event in response.xpath('//article//h1'): event_details = dict() event_details['name'] = event.xpath('a/text()').extract_first() self.found_events.append(event_details) if __name__ == "__main__": process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'}) process.crawl(PythonEventsSpider) spider = next(iter(process.crawlers)).spider process.start() for event in spider.found_events: print(event)
Selenium and hantomJs crawl Python.org
04_events_with_selenium.py
#!python from selenium import webdriver def get_upcoming_events(url): driver = webdriver.Chrome() driver.get(url) events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li') for event in events: event_details = dict() event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text event_details['time'] = event.find_element_by_xpath('p/time').text print(event_details) driver.close() get_upcoming_events('https://www.python.org/events/python-events/')
Instead of using driver = webdriver.PhantomJS('phantomjs'), you can use an interface-free approach, code as follows:
05_events_with_phantomjs.py
#!python from selenium import webdriver def get_upcoming_events(url): driver = webdriver.Chrome() driver.get(url) events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li') for event in events: event_details = dict() event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text event_details['time'] = event.find_element_by_xpath('p/time').text print(event_details) driver.close() get_upcoming_events('https://www.python.org/events/python-events/')
However, selenium's headless mode can better replace phantomjs.
04_events_with_selenium_headless.py
#!python from selenium import webdriver def get_upcoming_events(url): options = webdriver.ChromeOptions() options.add_argument('headless') driver = webdriver.Chrome(chrome_options=options) driver.get(url) events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li') for event in events: event_details = dict() event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text event_details['time'] = event.find_element_by_xpath('p/time').text print(event_details) driver.close() get_upcoming_events('https://www.python.org/events/python-events/')
Lovely python test development library Please give some compliments on github. Thank you.
python Chinese Library Documents Summary
[Xuefeng magnetite blog] Python 3 standard library - Chinese version
[Xuefeng Needle Blog] Python 3 Quick Start Tutorial
On-line Training Outline for Interface Automation Performance Testing
python test development automation test data analysis
For more information, please pay attention to it. Xuefeng Needle: A Brief Book
Technical support qq q group: 144081101 (the group file exists in videos recorded later) 591302926 567351477 nail free group: 21745728
Daoist technology-hand examination and traditional Chinese medicine nail group 21734177 qq q: 391441566 184175668 338228106 see palm, face, tongue, lottery, physique recognition. The service charge starts at 50 yuan per person. Please contact the nail or wechat pythontesting