Summary of Expiration:
- [Python Crawler] 1. Requests and Responses of HTTP and HTTPS of Crawler Principles
- [Python Crawler] 2. Definition, classification, flow and encoding format of crawler principles
- [Python Crawler] 3. Requests HTTP Library for Data Grabbing
- [Python Crawler] 4. Fiddler, an HTTP/HTTPS package tool for data capture
- [Python Crawler] 5. Regular expression re module for data extraction
- [Python Crawler] 6. XPath and lxml Class Library for Data Extraction
- [Python Crawler] 7. JSON and JsonPATH for Structured Data Extraction
Xpath helper or copy xpath in chrome are all data extracted from elementation, but crawlers get the corresponding url response, which is often different from elements because of the use of JavaScript, jQuery, Ajax or DHTML(Dynamic HTML, DHTML) technology to change/load the content of the page, the data in the web page is not directly rendered, but is obtained asynchronously by the front end;We can try to collect content from JavaScript code and run it with a third-party library in Python (time consuming); in addition, some pages generate dynamic token s from JavaScript's cryptographic libraries, which are then confused.We can only slowly debug to find the encryption principle, but it also takes time and effort.
To solve this problem, Python solves the above problem: select the built-in browser engine crawler (PhantomJS, Selenium), run the page in the browser engine, directly collect the pages you see in the browser, get the data, and get the correct results.Today we're going to learn Selenium and PhantomJS for dynamic HTML processing.
1. Selenium and PhantomJS
(1)Selenium
Selenium It is an automated test tool for the Web. It was originally developed for automated testing of Web sites. Types like the keypad wizards we use to play games can operate automatically according to specified commands, unlike Selenium, which runs directly on browsers and supports all major browsers (including non-interface browsers such as Phantom JS).
Selenium can let the browser automatically load the page, get the data it needs, even take a screenshot of the page, or determine if certain actions on the site occur based on our instructions.
Selenium does not have its own browser and does not support browser functionality. It needs to be combined with third-party browsers to use it.But sometimes we need to have it run embedded in code, so we can use a tool called PhantomJS instead of a real browser.
Selenium libraries can be downloaded from the PyPI website https://pypi.python.org/simple/selenium Or you can install it with a command using the third-party manager pip: sudo pip install selenium
Selenium Official Reference Document: http://selenium-python.readthedocs.io/index.html
(2)PhantomJS
PhantomJS Is a Webkit-based headless browser that loads websites into memory and executes JavaScript on pages, since it does not display a graphical interface, it runs more efficiently than a full browser.
If we combine Selenium with Pentom JS, we can run a very powerful web crawl that handles JavaScrip, cookies, headers, and whatever our real users need to do.
- PhantomJS is a fully functional (though no interface) browser, not a Python library, so it doesn't need to be installed like other Python libraries, but we can call PhantomJS directly from Selenium.
- You can use the command to install in Ubuntu 16.04: sudo apt-get install phantomjs
- If other systems cannot be installed, they can be downloaded from its official website, http://phantomjs.org/download.html.
- PhantomJS Official Reference Document: http://phantomjs.org/documentation
2. Quick Start
There is an API called WebDriver in the Selenium library.WebDriver is a bit like a browser that can load a website, but it can also be used to find page elements, interact with elements on the page (send text, click, and so on), and perform other actions to run web crawlers, just like BeautifulSoup or other Selector objects.
# IPython2 test code # Import webdriver from selenium import webdriver # Keys package to be introduced when keyboard key operations are invoked from selenium.webdriver.common.keys import Keys # Call PhantomJS browser specified by environment variable to create browser object driver = webdriver.PhantomJS() # If the PhantomJS location is not specified in the environment variable # driver = webdriver.PhantomJS(executable_path="./phantomjs")) # The get method waits until the page is fully loaded before continuing the program, where the test usually selects time.sleep(2) driver.get("http://www.baidu.com/") # Gets the text content of the id tag of the page named wrapper data = driver.find_element_by_id("wrapper").text # Print page title "Baidu, you know" print driver.title # Generate a snapshot of the current page and save it driver.save_screenshot("baidu.png") # id="kw" is Baidu search input box, input string "Great Wall" driver.find_element_by_id("kw").send_keys(u"The Great Wall") # id="su" is Baidu search button, click() is analog Click driver.find_element_by_id("su").click() # ctrl+a Select All Input Box Contents driver.find_element_by_id("kw").send_keys(Keys.CONTROL,'a') # ctrl+x cuts the contents of the input box driver.find_element_by_id("kw").send_keys(Keys.CONTROL,'x') # Get href value driver.find_element_by_xpath("//div[@id='u1']/a[2]").get_attribute('href') # Simulate Enter Enter key instead of clicking driver.find_element_by_id("su").send_keys(Keys.RETURN) # Clear Input Box Contents driver.find_element_by_id("kw").clear() # Close the current page, if there is only one page, the browser will be closed # driver.close() # Close Browser driver.quit()
3. Page Operation
1. Load web pages:
- from selenium import webdriver
- driver = webdriver.PhantomJS("c:.../pantomjs.exe")
- driver.get("http://www.baidu.com/") driver.save_screenshot("Great Wall.png")
2. Positioning and operation:
- driver.find_element_by_id("kw"). send_keys("Great Wall")
- driver.find_element_by_id("su").click()
3. View the request information:
- driver.page_source returns the page source
driver.title returns page title
drive.current_url returns the URL of the current page
driver.get_cookies() returns page cookies - size Gets the dimension of an element
Text Gets the text of an element
get_attribute(name) Gets the attribute value of an element
tag_name Gets the tagName of the element
location gets the coordinates of the element, finds the element to get, and then calls the method
is_displayed() Sets whether the element is visible
is_enabled() to determine if an element is used
is_selected() to determine if an element is selected
4. Mouse operation:
- click(elem) click mouse click element elem
- click_and_hold(elem) Press the left mouse button on an element
- context_click(elem) Right-click elem, save as, etc.
- double_click(elem) Double-click on the element elem to zoom in on the map web
- drag_and_drop(source,target) Drag mouse, source element press left button move to target element release
- move_to_element(elem) mouse over an element
- perform() stores behavior in ActionChains by calling the function
5. Keyboard Operations
- send_keys(Keys.ENTER) Press Enter (no difference from Keys.RETURN, key values are 13)
- send_keys(Keys.TAB) Press the Tab tab key
- send_keys(Keys.SPACE) Press Spacebar
- space send_keys(Kyes.ESCAPE) Press the back key Esc
- send_keys(Keys.BACK_SPACE) Press the delete key
- BackSpace send_keys(Keys.SHIFT) Press shift
- send_keys(Keys.CONTROL) Press the Ctrl key
- send_keys(Keys.ARROW_DOWN) Press down the mouse cursor key
- send_keys(Keys.CONTROL,'a') key combination Ctrl+A
- send_keys(Keys.CONTROL,'c') key combinations copy Ctrl+C
- send_keys(Keys.CONTROL,'x') key combinations cut Ctrl+X
- send_keys(Keys.CONTROL,'v') key combination paste Ctrl+V
6. JavaScript operations
- driver.execute_script("some javascript code here");
7. Exit
- driver.close() #Exit the current page
- driver.quit() #Exit browser
(1) Locate elements (WebElements)
Selenium's WebDriver provides various ways to find elements. For element selection, there are API single element selection as follows:
find_element_by_id find_elements_by_name find_elements_by_xpath find_elements_by_link_text find_elements_by_partial_link_text find_elements_by_tag_name find_elements_by_class_name find_elements_by_css_selector
The difference between find_elements and find_elements is that it returns a list and a list.
-
By ID
<div id="coolestWidgetEvah">...</div>
-
Realization
element = driver.find_element_by_id("coolestWidgetEvah") ------------------------ or ------------------------- from selenium.webdriver.common.by import By element = driver.find_element(by=By.ID, value="coolestWidgetEvah")
-
-
By Class Name
<div class="cheese"><span>Cheddar</span></div><div class="cheese"><span>Gouda</span></div>
-
Realization
cheeses = driver.find_elements_by_class_name("cheese") ------------------------ or ------------------------- from selenium.webdriver.common.by import By cheeses = driver.find_elements(By.CLASS_NAME, "cheese")
-
-
By Tag Name
<iframe src="..."></iframe>
-
Realization
frame = driver.find_element_by_tag_name("iframe") ------------------------ or ------------------------- from selenium.webdriver.common.by import By frame = driver.find_element(By.TAG_NAME, "iframe")
-
-
By Name
<input name="cheese" type="text"/>
-
Realization
cheese = driver.find_element_by_name("cheese") ------------------------ or ------------------------- from selenium.webdriver.common.by import By cheese = driver.find_element(By.NAME, "cheese")
-
-
By Link Text
<a href="http://www.google.com/search?q=cheese">cheese</a>
-
Realization
cheese = driver.find_element_by_link_text("cheese") ------------------------ or ------------------------- from selenium.webdriver.common.by import By cheese = driver.find_element(By.LINK_TEXT, "cheese")
-
-
By Partial Link Text
<a href="http://www.google.com/search?q=cheese">search for cheese</a>>
-
Realization
cheese = driver.find_element_by_partial_link_text("cheese") ------------------------ or ------------------------- from selenium.webdriver.common.by import By cheese = driver.find_element(By.PARTIAL_LINK_TEXT, "cheese")
-
-
By CSS
<div id="food"><span class="dairy">milk</span><span class="dairy aged">cheese</span></div>
-
Realization
cheese = driver.find_element_by_css_selector("#food span.dairy.aged") ------------------------ or ------------------------- from selenium.webdriver.common.by import By cheese = driver.find_element(By.CSS_SELECTOR, "#food span.dairy.aged")
-
-
By XPath
<input type="text" name="example" /> <INPUT type="text" name="other" />
-
Realization
inputs = driver.find_elements_by_xpath("//input") ------------------------ or ------------------------- from selenium.webdriver.common.by import By inputs = driver.find_elements(By.XPATH, "//input")
-
(2) Mouse action
Sometimes, we need to simulate some mouse actions on the page, such as double-click, right-click, drag and even hold. We can do this by importing the ActionChains class. The common operation elements are as follows:
- clear Clears the contents of an element
- Send_keys Simulate key input [Prevent encoding errors using send_keys(u "Chinese user name") if Chinese is required]
- Click click element
- Submit submit form
#Import ActionChains class from selenium.webdriver import ActionChains # Mouse moves to ac position ac = driver.find_element_by_xpath('element') ActionChains(driver).move_to_element(ac).perform() # Click at the ac location ac = driver.find_element_by_xpath("elementA") ActionChains(driver).move_to_element(ac).click(ac).perform() # Double-click in the ac position ac = driver.find_element_by_xpath("elementB") ActionChains(driver).move_to_element(ac).double_click(ac).perform() # Right-click in ac position ac = driver.find_element_by_xpath("elementC") ActionChains(driver).move_to_element(ac).context_click(ac).perform() # Left click hold at ac position ac = driver.find_element_by_xpath('elementF') ActionChains(driver).move_to_element(ac).click_and_hold(ac).perform() # Drag ac1 to ac2 position ac1 = driver.find_element_by_xpath('elementD') ac2 = driver.find_element_by_xpath('elementE') ActionChains(driver).drag_and_drop(ac1, ac2).perform()
(3) Fill in the form
We already know how to enter text into text boxes, but sometimes we encounter drop-down boxes with <select> </select> tags.Clicking directly on the options in the drop-down box is not always possible.
<select id="status" class="form-control valid" onchange="" name="status"> <option value=""></option> <option value="0">Not audited</option> <option value="1">First Inspection Passed</option> <option value="2">Review Passed</option> <option value="3">Audit Failed</option> </select>
Selenium specifically provides the Select class to handle drop-down boxes. In fact, there is a method called Select in WebDriver that can help us do these things:
# Import Select Class from selenium.webdriver.support.ui import Select # Find the tab for name select = Select(driver.find_element_by_name('status')) # select.select_by_index(1) select.select_by_value("0") select.select_by_visible_text(u"Not audited")
These are three ways to choose a drop-down box, which can be selected by index, by value, or by text.Be careful:
- Index index starts at 0
- Value is an attribute value of the option tag, not a value that appears in the drop-down box
- visible_text is the value of the option label text, which is displayed in the drop-down box
What about canceling all selections?It's simple:
select.deselect_all()
(4) Bounce window handling
When you trigger an event, a pop-up prompt appears on the page. Handle the prompt or get the prompt information as follows:
alert = driver.switch_to_alert()
(5) Page switching
A browser must have many windows, so we must have a way to switch windows.Switch windows as follows:
driver.switch_to.window("this is window name")
You can also use the window_handles method to get the operation objects for each window.For example:
for handle in driver.window_handles: driver.switch_to_window(handle)
(6) Page forward and backward
Forward and backward functions of operation pages:
driver.forward() #Forward driver.back() # Back off
(7) Cookies
Gets each Cookies value for the page, using the following
for cookie in driver.get_cookies(): print "%s -> %s" % (cookie['name'], cookie['value'])
Delete Cookies as follows
# By name driver.delete_cookie("CookieName") # all driver.delete_all_cookies()
4. JavaScript Executor
In this section, we discuss how to use JavaScript to click or manipulate Web elements in Python Selenium WebDriver.
Potential operations using JavaScript:
- Get element text or attributes
- Find an element
- Do something about the element, such as click()
- Change the properties of an element
- Scroll to an element or location on a Web page
- Wait until the page is loaded
(1) How to use JavaScript in WebDriver
Python Selenium WebDriver provides a built-in method:
driver.execute_script("some javascript code here");
There are two ways we can execute JavaScript in a browser.
Method 1: Execute JavaScript at the document root level
In this case, we use the methods provided by JavaScript to capture the elements we want to use, then declare some operations on them and execute the JavaScript using WebDriver.When executed, WebDriver injects JavaScript statements into the browser, and the script executes the task.For example:
jS = "document.getElementsByName('username')[0].click();"driver.execute_script(javaScript)
Step 1: We're using JavaScript to check and get the element through the property Name.(In addition, you can use the'id'and'class' attributes.)
Step 2: Declare and click elements using JavaScript.
Step 3: Call the execute_script() method and pass the JavaScript we created as a string value.
Method 2: Execute JavaScript at the element level
In this case, we use WebDriver to capture the element we want to use, then use JavaScript to declare some operations on it, and use WebDriver to execute this JavaScript by passing the web element as a parameter to JavaScript.
userName = driver.find_element_by_xpath("//button[@name='username']") driver.execute_script("arguments[0].click();", userName)
Step 1: Check and capture elements using the methods provided by WebDriver: find_element_by_xpath
Step 2: Declare and click elements using JavaScript: arguments[0].click() Step 3: execute_script()
Step 3: execute_script() calls the method using the JavaScript statement we created as a string value and captures the Web element using WebDriver as a parameter: driver.execute_script("arguments[0].click();),
userName)
The above two lines of code can be shortened to the following format, so we use WebDriver to find an element, declare some JavaScript functions, and execute JavaScript using WebDriver.
driver.execute_script("arguments[0].click();",driver.find_element_by_xpath("//button[@name='username']"))
In addition, you can have multiple JavaScript operations in your statement:
userName = driver.find_element_by_xpath("//button[@name='username']") password = driver.find_element_by_xpath("//button[@name='password']") driver.execute_script("arguments[0].click();arguments[1].click();", userName, password) #driver.execute_script("arguments[1].click();arguments[0].click();", userName, password)
In this case, it is important to use the order of the web elements.
Actual Warfare:
from selenium import webdriver driver = webdriver.PhantomJS() driver.get("https://www.baidu.com/") # Red the search input box js = "var q=document.getElementById(\"kw\");q.style.border=\"2px solid red\";" driver.execute_script(js) #Hide Baidu Pictures img = driver.find_element_by_xpath("//*[@id='lg']/img") driver.execute_script('$(arguments[0]).fadeOut()',img) # Scroll down to the bottom of the page driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Scroll down 10000 pixels js = "document.body.scrollTop=10000" #js="var q=document.documentElement.scrollTop=10000" driver.execute_script(js) #Get values from Web elements print driver.execute_script('return document.getElementById("fsr").innerText') driver.quit()
Get a value from a Web element using driver.execute_script to report a WebDriver exception:
selenium.common.exceptions.WebDriverException: Message: unknown error: Cannot read property 'innerText' of null
Solution: JavaScript cannot find the element to operate on, check if it exists.
5. Page Waiting
Now more and more web pages are using Ajax technology, so programs can't determine when an element is fully loaded.If the actual page waits too long for a dom element to appear, but your code uses the WebElement directly, you throw a NullPointer exception.
To avoid this difficulty in locating elements and increase the probability of producing ElementNotVisibleException s.So Selenium offers two ways to wait, implicit and explicit.
Implicit wait is to wait for a specific time, explicit wait is to specify a condition until it is established.
A. Implicit Waiting
Implicit wait is simpler by simply setting a wait time in seconds.
from selenium import webdriver driver = webdriver.Chrome() driver.implicitly_wait(10) # seconds driver.get("http://www.xxxxx.com/loading") myDynamicElement = driver.find_element_by_id("myDynamicElement")
Of course, if not set, the default wait time is 0.
B. Explicit wait
Explicitly wait for a condition to be specified, then set the maximum wait time.If no element is found at this time, an exception is thrown.
from selenium import webdriver from selenium.webdriver.common.by import By # WebDriverWait library, responsible for circular waiting from selenium.webdriver.support.ui import WebDriverWait # expected_conditions class, responsible for starting conditions from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Chrome() driver.get("http://www.xxxxx.com/loading") try: # The page loops until id="myDynamicElement" appears element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "myDynamicElement")) ) finally: driver.quit()
If you don't write parameters, the program calls 0.5s once by default to see if the element has been generated, and returns immediately if it already exists.
Here are some built-in wait conditions that you can call directly instead of writing your own.
title_is title_contains presence_of_element_located visibility_of_element_located visibility_of presence_of_all_elements_located text_to_be_present_in_element text_to_be_present_in_element_value frame_to_be_available_and_switch_to_it invisibility_of_element_located element_to_be_clickable – it is Displayed and Enabled. staleness_of element_to_be_selected element_located_to_be_selected element_selection_state_to_be element_located_selection_state_to_be alert_is_present
4. Demonstration of Actual War
Log on to fighting fish (demo site simulated login):
#coding=utf-8 import time from selenium import webdriver from selenium.webdriver.common.keys import Keys class Douyu(): def __init__(self): self.url = "https://www.douyu.com/" self.driver = webdriver.PhantomJS() def log_in(self): self.driver.get(self.url) time.sleep(3)#Sleep for 3 minutes and wait for the page to load self.driver.save_screenshot("0.jpg") #Enter account self.driver.find_element_by_xpath('//*[@id="form_email"]').send_keys("xxxxx@qq.com") #Input password self.driver.find_element_by_xpath('//*[@id="form_password"]').send_keys("xxxx") #Click Login self.driver.find_element_by_class_name("bn-submit").click() time.sleep(2) self.driver.save_screenshot("douyu.jpg") #Output cookies after login print(self.driver.get_cookies()) def __del__(self): '''Call the built-in sparse method, which is called automatically when the program exits //Similarly, you can call close when a file is open and disconnect the database link ''' self.driver.quit() if __name__ == "__main__": douyu = Douyu() #instantiation douyu.log_in() #The login method is then called
Crawl all room information from the live Dogfish Platform (demo dynamic page simulation click):
#coding=utf-8 from selenium import webdriver import json import time class Douyu: # 1. Send a request for the first page def __init__(self): self.driver = webdriver.PhantomJS() self.driver.get("https://www.douyu.com/directory/all") #Request Home Page #Get no page content def get_content(self): time.sleep(3) #Wait three seconds for each request to be sent and for the page to load li_list = self.driver.find_elements_by_xpath('//ul[@id="live-list-contentbox"]/li') contents = [] for i in li_list: #Walk through the room list item = {} item["img"] = i.find_element_by_xpath("./a//img").get_attribute("src") #Get picture of room item["title"] = i.find_element_by_xpath("./a").get_attribute("title") #Get room name item["category"] = i.find_element_by_xpath("./a/div[@class='mes']/div/span").text #Get Room Classification item["name"] = i.find_element_by_xpath("./a/div[@class='mes']/p/span[1]").text #Get the host name item["watch_num"] = i.find_element_by_xpath("./a/div[@class='mes']/p/span[2]").text #Get Number of Viewers print(item) contents.append(item) return contents #Save Local def save_content(self,contents): f = open("douyu.txt","a") for content in contents: json.dump(content,f,ensure_ascii=False,indent=2) f.write("\n") f.close() def run(self): #1. Send a request for the first page #2. Get the information on the first page contents = self.get_content() #Save Content self.save_content(contents) #3. Cycle through the next page button until the corresponding class name on the next page is no longer "shark-pager-next" while self.driver.find_element_by_class_name("shark-pager-next"): #Determine if there is a next page #Click the button on the next page self.driver.find_element_by_class_name("shark-pager-next").click() # # 4. Continue to get the contents of the next page contents = self.get_content() #4.1. Save Content self.save_content(contents) if __name__ == "__main__": douyu = Douyu() douyu.run()
Later Content Summary:
- [Python Crawler] 9. Tesseract for Machine Vision and Machine Image Recognition
- [Python Crawler] Ten, Scrapy Framework
If you have any questions or good suggestions, look forward to your comments and comments!