[Snow Peak Magnetite Blog] Introduction to python reptile cookbook 1

Chapter 1 Introduction to Reptiles

Requests and Beautiful Soup crawl python.org
urllib3 and Beautiful Soup crawl python.org
Scrapy crawls python.org
Selenium and hantomJs crawl Python.org

Address of the latest version of this article

Please confirm to open: https://www.python.org/events/pythonevents
Install requests, bs4, and then we start to exemplify 1: Requests and Beautiful Soup crawl python.org, install your own google if you have problems, if you are really unsure, you can ask questions in groups, or private chat to consult qq37391319, which requires a fee, qqq red envelope 10 yuan.

#!python

# pip3 install requests bs4

Requests and Beautiful Soup crawl python.org

Goal: Climbing https://www.python.org/events/python-events/ Name, place, and time of the event in.

01_events_with_requests.py

#!python

import requests
from bs4 import BeautifulSoup

def get_upcoming_events(url):
    req = requests.get(url)

    soup = BeautifulSoup(req.text, 'lxml')

    events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')

    for event in events:
        event_details = dict()
        event_details['name'] = event.find('h3').find("a").text
        event_details['location'] = event.find('span', {'class', 'event-location'}).text
        event_details['time'] = event.find('time').text
        print(event_details)

get_upcoming_events('https://www.python.org/events/python-events/')

Implementation results:

#!python

$ python3 01_events_with_requests.py 
{'name': 'PyCon US 2018', 'location': 'Cleveland, Ohio, USA', 'time': '09 May – 18 May  2018'}
{'name': 'DjangoCon Europe 2018', 'location': 'Heidelberg, Germany', 'time': '23 May – 28 May  2018'}
{'name': 'PyCon APAC 2018', 'location': 'NUS School of Computing / COM1, 13 Computing Drive, Singapore 117417, Singapore', 'time': '31 May – 03 June  2018'}
{'name': 'PyCon CZ 2018', 'location': 'Prague, Czech Republic', 'time': '01 June – 04 June  2018'}
{'name': 'PyConTW 2018', 'location': 'Taipei, Taiwan', 'time': '01 June – 03 June  2018'}
{'name': 'PyLondinium', 'location': 'London, UK', 'time': '08 June – 11 June  2018'}

Note: Because the content of the event may not be the same, the results will not be the same each time.

After-class exercises: crawling with requests https://china-testing.github.io/ There are 10 titles on the front page of the blog.

Reference Answer:

01_blog_title.py

#!python

import requests
from bs4 import BeautifulSoup

def get_upcoming_events(url):
    req = requests.get(url)

    soup = BeautifulSoup(req.text, 'lxml')

    events = soup.findAll('article')

    for event in events:
        event_details = {}
        event_details['name'] = event.find('h1').find("a").text
        print(event_details)

get_upcoming_events('https://china-testing.github.io/')

Implementation results:

#!python

$ python3 01_blog_title.py 
{'name': '10 Minute Society API test'}
{'name': 'python Quick Start Course on Data Analysis 4-Data aggregation'}
{'name': 'python Quick Start Course on Data Analysis 6-reforming'}
{'name': 'python Quick Start Course on Data Analysis 5-Processing missing data'}
{'name': 'python Introduction to Library-pytesseract: OCR Optical Character Recognition'}
{'name': 'Beginner's Advice on Software Automated Testing'}
{'name': 'Use opencv Conversion 3 d picture'}
{'name': 'python opencv3 Example(Object Recognition and Augmented Reality)2-Edge Detection and Application of Image Filter'}
{'name': 'numpy Learning Guide 3 rd3:Common functions'}
{'name': 'numpy Learning Guide 3 rd2:NumPy Basics'}

urllib3 and Beautiful Soup crawl python.org

Code: 02_events_with_urlib3.py

#!python

import urllib3
from bs4 import BeautifulSoup

def get_upcoming_events(url):
    req = urllib3.PoolManager()
    res = req.request('GET', url)

    soup = BeautifulSoup(res.data, 'html.parser')

    events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')

    for event in events:
        event_details = dict()
        event_details['name'] = event.find('h3').find("a").text
        event_details['location'] = event.find('span', {'class', 'event-location'}).text
        event_details['time'] = event.find('time').text
        print(event_details)

get_upcoming_events('https://www.python.org/events/python-events/')

Requests encapsulate urllib3, usually directly using requests.

Scrapy crawls python.org

Scrapy is a very popular open source Python capture framework for extracting data. Scrapy provides all of these functions as well as many other built-in modules and extensions. When it comes to mining with Python, it's also our preferred tool.
Scrapy offers many powerful features worth mentioning:

Built-in extensions to generate HTTP requests and handle compression, authentication, caching, manipulation of user agents and HTTP headers
Built-in support for selection and extraction selector languages such as data CSS and XPath, as well as the use of regular expressions to select content and links.
Coding support to handle language and non-standard coding declarations
Flexible API s to reuse and write custom middleware and pipelines provide clean and simple ways to achieve automation and other tasks. For example, download assets (such as images or media) and store data in storage, such as file systems, S3, databases, etc.

There are several ways to use Scrapy. One is the program pattern where we create crawlers and spiders in our code. Scrapy templates or generator projects can also be configured and run from the command line. This book will follow the program pattern, because its code is in a single file.

Code: 03_events_with_scrapy.py

#!python

import scrapy
from scrapy.crawler import CrawlerProcess

class PythonEventsSpider(scrapy.Spider):
    name = 'pythoneventsspider'

    start_urls = ['https://www.python.org/events/python-events/',]
    found_events = []

    def parse(self, response):
        for event in response.xpath('//ul[contains(@class, "list-recent-events")]/li'):
            event_details = dict()
            event_details['name'] = event.xpath('h3[@class="event-title"]/a/text()').extract_first()
            event_details['location'] = event.xpath('p/span[@class="event-location"]/text()').extract_first()
            event_details['time'] = event.xpath('p/time/text()').extract_first()
            self.found_events.append(event_details)

if __name__ == "__main__":
    process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'})
    process.crawl(PythonEventsSpider)
    spider = next(iter(process.crawlers)).spider
    process.start()

    for event in spider.found_events: print(event)

After-class exercises: crawling with scrapy https://china-testing.github.io/ There are 10 titles on the front page of the blog.

Reference Answer:

03_blog_with_scrapy.py

#!python

from scrapy.crawler import CrawlerProcess

class PythonEventsSpider(scrapy.Spider):
    name = 'pythoneventsspider'

    start_urls = ['https://china-testing.github.io/',]
    found_events = []

    def parse(self, response):
        for event in response.xpath('//article//h1'):
            event_details = dict()
            event_details['name'] = event.xpath('a/text()').extract_first()
            self.found_events.append(event_details)

if __name__ == "__main__":
    process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'})
    process.crawl(PythonEventsSpider)
    spider = next(iter(process.crawlers)).spider
    process.start()

    for event in spider.found_events: print(event)

Selenium and hantomJs crawl Python.org

04_events_with_selenium.py

#!python

from selenium import webdriver

def get_upcoming_events(url):
    driver = webdriver.Chrome()
    driver.get(url)

    events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li')

    for event in events:
        event_details = dict()
        event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text
        event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text
        event_details['time'] = event.find_element_by_xpath('p/time').text
        print(event_details)

    driver.close()

get_upcoming_events('https://www.python.org/events/python-events/')

Instead of using driver = webdriver.PhantomJS('phantomjs'), you can use an interface-free approach, code as follows:

05_events_with_phantomjs.py

#!python

from selenium import webdriver

def get_upcoming_events(url):
    driver = webdriver.Chrome()
    driver.get(url)

    events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li')

    for event in events:
        event_details = dict()
        event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text
        event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text
        event_details['time'] = event.find_element_by_xpath('p/time').text
        print(event_details)

    driver.close()

get_upcoming_events('https://www.python.org/events/python-events/')

However, selenium's headless mode can better replace phantomjs.

04_events_with_selenium_headless.py

#!python

from selenium import webdriver

def get_upcoming_events(url):
    
    options = webdriver.ChromeOptions()
    options.add_argument('headless')
    driver = webdriver.Chrome(chrome_options=options)
    driver.get(url)

    events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li')

    for event in events:
        event_details = dict()
        event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text
        event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text
        event_details['time'] = event.find_element_by_xpath('p/time').text
        print(event_details)

    driver.close()

get_upcoming_events('https://www.python.org/events/python-events/')

Lovely python test development library Please give some compliments on github. Thank you.
python Chinese Library Documents Summary
[Xuefeng magnetite blog] Python 3 standard library - Chinese version
[Xuefeng Needle Blog] Python 3 Quick Start Tutorial
On-line Training Outline for Interface Automation Performance Testing
python test development automation test data analysis
For more information, please pay attention to it. Xuefeng Needle: A Brief Book

Technical support qq q group: 144081101 (the group file exists in videos recorded later) 591302926 567351477 nail free group: 21745728
Daoist technology-hand examination and traditional Chinese medicine nail group 21734177 qq q: 391441566 184175668 338228106 see palm, face, tongue, lottery, physique recognition. The service charge starts at 50 yuan per person. Please contact the nail or wechat pythontesting

Posted by ajcalvert on Tue, 14 May 2019 15:45:00 -0700

Programmer Group

[Snow Peak Magnetite Blog] Introduction to python reptile cookbook 1

Requests and Beautiful Soup crawl python.org

urllib3 and Beautiful Soup crawl python.org

Scrapy crawls python.org

Selenium and hantomJs crawl Python.org

Hot Keywords