[Python Crawler] 8. Selenium and PhantomJS for Dynamic HTML Processing

Keywords: Selenium Javascript Python JSON

Summary of Expiration:

Xpath helper or copy xpath in chrome are all data extracted from elementation, but crawlers get the corresponding url response, which is often different from elements because of the use of JavaScript, jQuery, Ajax or DHTML(Dynamic HTML, DHTML) technology to change/load the content of the page, the data in the web page is not directly rendered, but is obtained asynchronously by the front end;We can try to collect content from JavaScript code and run it with a third-party library in Python (time consuming); in addition, some pages generate dynamic token s from JavaScript's cryptographic libraries, which are then confused.We can only slowly debug to find the encryption principle, but it also takes time and effort.

To solve this problem, Python solves the above problem: select the built-in browser engine crawler (PhantomJS, Selenium), run the page in the browser engine, directly collect the pages you see in the browser, get the data, and get the correct results.Today we're going to learn Selenium and PhantomJS for dynamic HTML processing.

1. Selenium and PhantomJS

(1)Selenium

Selenium It is an automated test tool for the Web. It was originally developed for automated testing of Web sites. Types like the keypad wizards we use to play games can operate automatically according to specified commands, unlike Selenium, which runs directly on browsers and supports all major browsers (including non-interface browsers such as Phantom JS).

Selenium can let the browser automatically load the page, get the data it needs, even take a screenshot of the page, or determine if certain actions on the site occur based on our instructions.

Selenium does not have its own browser and does not support browser functionality. It needs to be combined with third-party browsers to use it.But sometimes we need to have it run embedded in code, so we can use a tool called PhantomJS instead of a real browser.

Selenium libraries can be downloaded from the PyPI website https://pypi.python.org/simple/selenium Or you can install it with a command using the third-party manager pip: sudo pip install selenium

Selenium Official Reference Document: http://selenium-python.readthedocs.io/index.html

(2)PhantomJS

PhantomJS Is a Webkit-based headless browser that loads websites into memory and executes JavaScript on pages, since it does not display a graphical interface, it runs more efficiently than a full browser.

If we combine Selenium with Pentom JS, we can run a very powerful web crawl that handles JavaScrip, cookies, headers, and whatever our real users need to do.

  • PhantomJS is a fully functional (though no interface) browser, not a Python library, so it doesn't need to be installed like other Python libraries, but we can call PhantomJS directly from Selenium.
  • You can use the command to install in Ubuntu 16.04: sudo apt-get install phantomjs
  • If other systems cannot be installed, they can be downloaded from its official website, http://phantomjs.org/download.html.
  • PhantomJS Official Reference Document: http://phantomjs.org/documentation

2. Quick Start

There is an API called WebDriver in the Selenium library.WebDriver is a bit like a browser that can load a website, but it can also be used to find page elements, interact with elements on the page (send text, click, and so on), and perform other actions to run web crawlers, just like BeautifulSoup or other Selector objects.

# IPython2 test code

# Import webdriver
from selenium import webdriver

# Keys package to be introduced when keyboard key operations are invoked
from selenium.webdriver.common.keys import Keys

# Call PhantomJS browser specified by environment variable to create browser object
driver = webdriver.PhantomJS()

# If the PhantomJS location is not specified in the environment variable
# driver = webdriver.PhantomJS(executable_path="./phantomjs"))

# The get method waits until the page is fully loaded before continuing the program, where the test usually selects time.sleep(2)
driver.get("http://www.baidu.com/")

# Gets the text content of the id tag of the page named wrapper
data = driver.find_element_by_id("wrapper").text

# Print page title "Baidu, you know"
print driver.title

# Generate a snapshot of the current page and save it
driver.save_screenshot("baidu.png")

# id="kw" is Baidu search input box, input string "Great Wall"
driver.find_element_by_id("kw").send_keys(u"The Great Wall")

# id="su" is Baidu search button, click() is analog Click
driver.find_element_by_id("su").click()

# ctrl+a Select All Input Box Contents
driver.find_element_by_id("kw").send_keys(Keys.CONTROL,'a')

# ctrl+x cuts the contents of the input box
driver.find_element_by_id("kw").send_keys(Keys.CONTROL,'x')

# Get href value
driver.find_element_by_xpath("//div[@id='u1']/a[2]").get_attribute('href')

# Simulate Enter Enter key instead of clicking
driver.find_element_by_id("su").send_keys(Keys.RETURN)

# Clear Input Box Contents
driver.find_element_by_id("kw").clear()

# Close the current page, if there is only one page, the browser will be closed
# driver.close()

# Close Browser
driver.quit()

3. Page Operation

1. Load web pages:

  • from selenium import webdriver
  • driver = webdriver.PhantomJS("c:.../pantomjs.exe")
  • driver.get("http://www.baidu.com/") driver.save_screenshot("Great Wall.png")

2. Positioning and operation:

  • driver.find_element_by_id("kw"). send_keys("Great Wall")
  • driver.find_element_by_id("su").click()

3. View the request information:

  • driver.page_source returns the page source
    driver.title returns page title
    drive.current_url returns the URL of the current page
    driver.get_cookies() returns page cookies
  • size Gets the dimension of an element
    Text Gets the text of an element
    get_attribute(name) Gets the attribute value of an element
    tag_name Gets the tagName of the element
    location gets the coordinates of the element, finds the element to get, and then calls the method
    is_displayed() Sets whether the element is visible
    is_enabled() to determine if an element is used
    is_selected() to determine if an element is selected

4. Mouse operation:

  • click(elem) click mouse click element elem
  • click_and_hold(elem) Press the left mouse button on an element
  • context_click(elem) Right-click elem, save as, etc.
  • double_click(elem) Double-click on the element elem to zoom in on the map web
  • drag_and_drop(source,target) Drag mouse, source element press left button move to target element release
  • move_to_element(elem) mouse over an element
  • perform() stores behavior in ActionChains by calling the function

5. Keyboard Operations

  • send_keys(Keys.ENTER) Press Enter (no difference from Keys.RETURN, key values are 13)
  • send_keys(Keys.TAB) Press the Tab tab key
  • send_keys(Keys.SPACE) Press Spacebar
  • space send_keys(Kyes.ESCAPE) Press the back key Esc
  • send_keys(Keys.BACK_SPACE) Press the delete key
  • BackSpace send_keys(Keys.SHIFT) Press shift
  • send_keys(Keys.CONTROL) Press the Ctrl key
  • send_keys(Keys.ARROW_DOWN) Press down the mouse cursor key
  • send_keys(Keys.CONTROL,'a') key combination Ctrl+A
  • send_keys(Keys.CONTROL,'c') key combinations copy Ctrl+C
  • send_keys(Keys.CONTROL,'x') key combinations cut Ctrl+X
  • send_keys(Keys.CONTROL,'v') key combination paste Ctrl+V

6. JavaScript operations

  • driver.execute_script("some javascript code here");

7. Exit

  • driver.close() #Exit the current page
  • driver.quit() #Exit browser

(1) Locate elements (WebElements)

Selenium's WebDriver provides various ways to find elements. For element selection, there are API single element selection as follows:

find_element_by_id
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector

The difference between find_elements and find_elements is that it returns a list and a list.

  1. By ID

    <div id="coolestWidgetEvah">...</div>
    
    • Realization

      element = driver.find_element_by_id("coolestWidgetEvah")
      ------------------------ or -------------------------
      from selenium.webdriver.common.by import By element = driver.find_element(by=By.ID, value="coolestWidgetEvah")
      
  2. By Class Name

    <div class="cheese"><span>Cheddar</span></div><div class="cheese"><span>Gouda</span></div>
    
    • Realization

      cheeses = driver.find_elements_by_class_name("cheese")
      ------------------------ or -------------------------
      from selenium.webdriver.common.by import By
      cheeses = driver.find_elements(By.CLASS_NAME, "cheese")
      
  3. By Tag Name

    <iframe src="..."></iframe>
    
    • Realization

      frame = driver.find_element_by_tag_name("iframe")
      ------------------------ or -------------------------
      from selenium.webdriver.common.by import By
      frame = driver.find_element(By.TAG_NAME, "iframe")
      
  4. By Name

    <input name="cheese" type="text"/>
    
    • Realization

      cheese = driver.find_element_by_name("cheese")
      ------------------------ or -------------------------
      from selenium.webdriver.common.by import By
      cheese = driver.find_element(By.NAME, "cheese")
      
  5. By Link Text

    <a href="http://www.google.com/search?q=cheese">cheese</a>
    
    • Realization

      cheese = driver.find_element_by_link_text("cheese")
      ------------------------ or -------------------------
      from selenium.webdriver.common.by import By
      cheese = driver.find_element(By.LINK_TEXT, "cheese")
      
  6. By Partial Link Text

    <a href="http://www.google.com/search?q=cheese">search for cheese</a>>
    
    • Realization

      cheese = driver.find_element_by_partial_link_text("cheese")
      ------------------------ or -------------------------
      from selenium.webdriver.common.by import By
      cheese = driver.find_element(By.PARTIAL_LINK_TEXT, "cheese")
      
  7. By CSS

    <div id="food"><span class="dairy">milk</span><span class="dairy aged">cheese</span></div>
    
    • Realization

      cheese = driver.find_element_by_css_selector("#food span.dairy.aged")
      ------------------------ or -------------------------
      from selenium.webdriver.common.by import By
      cheese = driver.find_element(By.CSS_SELECTOR, "#food span.dairy.aged")
      
  8. By XPath

    <input type="text" name="example" />
    <INPUT type="text" name="other" />
    
    • Realization

      inputs = driver.find_elements_by_xpath("//input")
      ------------------------ or -------------------------
      from selenium.webdriver.common.by import By
      inputs = driver.find_elements(By.XPATH, "//input")
      

(2) Mouse action

Sometimes, we need to simulate some mouse actions on the page, such as double-click, right-click, drag and even hold. We can do this by importing the ActionChains class. The common operation elements are as follows:

  • clear Clears the contents of an element
  • Send_keys Simulate key input [Prevent encoding errors using send_keys(u "Chinese user name") if Chinese is required]
  • Click click element
  • Submit submit form
#Import ActionChains class
from selenium.webdriver import ActionChains

# Mouse moves to ac position
ac = driver.find_element_by_xpath('element')
ActionChains(driver).move_to_element(ac).perform()

# Click at the ac location
ac = driver.find_element_by_xpath("elementA")
ActionChains(driver).move_to_element(ac).click(ac).perform()

# Double-click in the ac position
ac = driver.find_element_by_xpath("elementB")
ActionChains(driver).move_to_element(ac).double_click(ac).perform()

# Right-click in ac position
ac = driver.find_element_by_xpath("elementC")
ActionChains(driver).move_to_element(ac).context_click(ac).perform()

# Left click hold at ac position
ac = driver.find_element_by_xpath('elementF')
ActionChains(driver).move_to_element(ac).click_and_hold(ac).perform()

# Drag ac1 to ac2 position
ac1 = driver.find_element_by_xpath('elementD')
ac2 = driver.find_element_by_xpath('elementE')
ActionChains(driver).drag_and_drop(ac1, ac2).perform()

(3) Fill in the form

We already know how to enter text into text boxes, but sometimes we encounter drop-down boxes with <select> </select> tags.Clicking directly on the options in the drop-down box is not always possible.

<select id="status" class="form-control valid" onchange="" name="status">
    <option value=""></option>
    <option value="0">Not audited</option>
    <option value="1">First Inspection Passed</option>
    <option value="2">Review Passed</option>
    <option value="3">Audit Failed</option>
</select>

Selenium specifically provides the Select class to handle drop-down boxes. In fact, there is a method called Select in WebDriver that can help us do these things:

# Import Select Class
from selenium.webdriver.support.ui import Select

# Find the tab for name
select = Select(driver.find_element_by_name('status'))

# 
select.select_by_index(1)
select.select_by_value("0")
select.select_by_visible_text(u"Not audited")

These are three ways to choose a drop-down box, which can be selected by index, by value, or by text.Be careful:

  • Index index starts at 0
  • Value is an attribute value of the option tag, not a value that appears in the drop-down box
  • visible_text is the value of the option label text, which is displayed in the drop-down box

What about canceling all selections?It's simple:

select.deselect_all()

(4) Bounce window handling

When you trigger an event, a pop-up prompt appears on the page. Handle the prompt or get the prompt information as follows:

alert = driver.switch_to_alert()

(5) Page switching

A browser must have many windows, so we must have a way to switch windows.Switch windows as follows:

driver.switch_to.window("this is window name")

You can also use the window_handles method to get the operation objects for each window.For example:

for handle in driver.window_handles:
    driver.switch_to_window(handle)

(6) Page forward and backward

Forward and backward functions of operation pages:

driver.forward()     #Forward
driver.back()       # Back off

(7) Cookies

Gets each Cookies value for the page, using the following

for cookie in driver.get_cookies():
    print "%s -> %s" % (cookie['name'], cookie['value'])

Delete Cookies as follows

# By name
driver.delete_cookie("CookieName")

# all
driver.delete_all_cookies()

4. JavaScript Executor

In this section, we discuss how to use JavaScript to click or manipulate Web elements in Python Selenium WebDriver.

Potential operations using JavaScript:

  • Get element text or attributes
  • Find an element
  • Do something about the element, such as click()
  • Change the properties of an element
  • Scroll to an element or location on a Web page
  • Wait until the page is loaded

(1) How to use JavaScript in WebDriver

Python Selenium WebDriver provides a built-in method:

driver.execute_script("some javascript code here");

There are two ways we can execute JavaScript in a browser.

Method 1: Execute JavaScript at the document root level

In this case, we use the methods provided by JavaScript to capture the elements we want to use, then declare some operations on them and execute the JavaScript using WebDriver.When executed, WebDriver injects JavaScript statements into the browser, and the script executes the task.For example:

jS = "document.getElementsByName('username')[0].click();"driver.execute_script(javaScript)

Step 1: We're using JavaScript to check and get the element through the property Name.(In addition, you can use the'id'and'class' attributes.)

Step 2: Declare and click elements using JavaScript.

Step 3: Call the execute_script() method and pass the JavaScript we created as a string value.

Method 2: Execute JavaScript at the element level

In this case, we use WebDriver to capture the element we want to use, then use JavaScript to declare some operations on it, and use WebDriver to execute this JavaScript by passing the web element as a parameter to JavaScript.

userName = driver.find_element_by_xpath("//button[@name='username']")
driver.execute_script("arguments[0].click();", userName)

Step 1: Check and capture elements using the methods provided by WebDriver: find_element_by_xpath
Step 2: Declare and click elements using JavaScript: arguments[0].click() Step 3: execute_script()
Step 3: execute_script() calls the method using the JavaScript statement we created as a string value and captures the Web element using WebDriver as a parameter: driver.execute_script("arguments[0].click();),
userName)

The above two lines of code can be shortened to the following format, so we use WebDriver to find an element, declare some JavaScript functions, and execute JavaScript using WebDriver.

driver.execute_script("arguments[0].click();",driver.find_element_by_xpath("//button[@name='username']"))

In addition, you can have multiple JavaScript operations in your statement:

userName = driver.find_element_by_xpath("//button[@name='username']")
password = driver.find_element_by_xpath("//button[@name='password']")
driver.execute_script("arguments[0].click();arguments[1].click();", userName, password)
#driver.execute_script("arguments[1].click();arguments[0].click();", userName, password)

In this case, it is important to use the order of the web elements.

Actual Warfare:
from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get("https://www.baidu.com/")

# Red the search input box
js = "var q=document.getElementById(\"kw\");q.style.border=\"2px solid red\";"
driver.execute_script(js)

#Hide Baidu Pictures
img = driver.find_element_by_xpath("//*[@id='lg']/img")
driver.execute_script('$(arguments[0]).fadeOut()',img)

# Scroll down to the bottom of the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Scroll down 10000 pixels
js = "document.body.scrollTop=10000"
#js="var q=document.documentElement.scrollTop=10000"
driver.execute_script(js)

#Get values from Web elements
print driver.execute_script('return document.getElementById("fsr").innerText')

driver.quit()

Get a value from a Web element using driver.execute_script to report a WebDriver exception:
selenium.common.exceptions.WebDriverException: Message: unknown error: Cannot read property 'innerText' of null
Solution: JavaScript cannot find the element to operate on, check if it exists.

5. Page Waiting

Now more and more web pages are using Ajax technology, so programs can't determine when an element is fully loaded.If the actual page waits too long for a dom element to appear, but your code uses the WebElement directly, you throw a NullPointer exception.

To avoid this difficulty in locating elements and increase the probability of producing ElementNotVisibleException s.So Selenium offers two ways to wait, implicit and explicit.

Implicit wait is to wait for a specific time, explicit wait is to specify a condition until it is established.

A. Implicit Waiting

Implicit wait is simpler by simply setting a wait time in seconds.

from selenium import webdriver

driver = webdriver.Chrome()
driver.implicitly_wait(10) # seconds
driver.get("http://www.xxxxx.com/loading")
myDynamicElement = driver.find_element_by_id("myDynamicElement")

Of course, if not set, the default wait time is 0.

B. Explicit wait

Explicitly wait for a condition to be specified, then set the maximum wait time.If no element is found at this time, an exception is thrown.

from selenium import webdriver
from selenium.webdriver.common.by import By
# WebDriverWait library, responsible for circular waiting
from selenium.webdriver.support.ui import WebDriverWait
# expected_conditions class, responsible for starting conditions
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.xxxxx.com/loading")
try:
    # The page loops until id="myDynamicElement" appears
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
    )
finally:
    driver.quit()

If you don't write parameters, the program calls 0.5s once by default to see if the element has been generated, and returns immediately if it already exists.

Here are some built-in wait conditions that you can call directly instead of writing your own.

title_is
title_contains
presence_of_element_located
visibility_of_element_located
visibility_of
presence_of_all_elements_located
text_to_be_present_in_element
text_to_be_present_in_element_value
frame_to_be_available_and_switch_to_it
invisibility_of_element_located
element_to_be_clickable – it is Displayed and Enabled.
staleness_of
element_to_be_selected
element_located_to_be_selected
element_selection_state_to_be
element_located_selection_state_to_be
alert_is_present

4. Demonstration of Actual War

Log on to fighting fish (demo site simulated login):

#coding=utf-8
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

class Douyu():
    def __init__(self):
        self.url = "https://www.douyu.com/"
        self.driver = webdriver.PhantomJS()

    def log_in(self):
        self.driver.get(self.url)
        time.sleep(3)#Sleep for 3 minutes and wait for the page to load
        self.driver.save_screenshot("0.jpg")
        #Enter account
        self.driver.find_element_by_xpath('//*[@id="form_email"]').send_keys("xxxxx@qq.com")
        #Input password
        self.driver.find_element_by_xpath('//*[@id="form_password"]').send_keys("xxxx")
        #Click Login
        self.driver.find_element_by_class_name("bn-submit").click()
        time.sleep(2)
        self.driver.save_screenshot("douyu.jpg")
        #Output cookies after login
        print(self.driver.get_cookies())

    def __del__(self):
        '''Call the built-in sparse method, which is called automatically when the program exits
        //Similarly, you can call close when a file is open and disconnect the database link
        '''
        self.driver.quit()

if __name__ == "__main__":
    douyu = Douyu() #instantiation
    douyu.log_in()  #The login method is then called

Crawl all room information from the live Dogfish Platform (demo dynamic page simulation click):

#coding=utf-8
from selenium import webdriver
import json
import time

class Douyu:
    # 1. Send a request for the first page
    def __init__(self):
        self.driver = webdriver.PhantomJS()
        self.driver.get("https://www.douyu.com/directory/all") #Request Home Page

    #Get no page content
    def get_content(self):
        time.sleep(3) #Wait three seconds for each request to be sent and for the page to load
        li_list = self.driver.find_elements_by_xpath('//ul[@id="live-list-contentbox"]/li')
        contents = []
        for i in li_list: #Walk through the room list
            item = {}
            item["img"] = i.find_element_by_xpath("./a//img").get_attribute("src") #Get picture of room
            item["title"] = i.find_element_by_xpath("./a").get_attribute("title") #Get room name
            item["category"] = i.find_element_by_xpath("./a/div[@class='mes']/div/span").text #Get Room Classification
            item["name"] = i.find_element_by_xpath("./a/div[@class='mes']/p/span[1]").text #Get the host name
            item["watch_num"] = i.find_element_by_xpath("./a/div[@class='mes']/p/span[2]").text #Get Number of Viewers
            print(item)
            contents.append(item)
        return contents
    #Save Local
    def save_content(self,contents):
        f = open("douyu.txt","a")
        for content in contents:
            json.dump(content,f,ensure_ascii=False,indent=2)
            f.write("\n")
        f.close()

    def run(self):
        #1. Send a request for the first page
        #2. Get the information on the first page
        contents = self.get_content()
            #Save Content
        self.save_content(contents)
        #3. Cycle through the next page button until the corresponding class name on the next page is no longer "shark-pager-next"
        while self.driver.find_element_by_class_name("shark-pager-next"): #Determine if there is a next page
            #Click the button on the next page
            self.driver.find_element_by_class_name("shark-pager-next").click() #
            # 4. Continue to get the contents of the next page
            contents = self.get_content()
            #4.1. Save Content
            self.save_content(contents)

if __name__ == "__main__":
    douyu = Douyu()
    douyu.run()

Later Content Summary:

  • [Python Crawler] 9. Tesseract for Machine Vision and Machine Image Recognition
  • [Python Crawler] Ten, Scrapy Framework

If you have any questions or good suggestions, look forward to your comments and comments!

82 original articles published, 539 awarded, 730,000 visits+
His message board follow

Posted by digitalecartoons on Sat, 15 Feb 2020 18:03:25 -0800