web crawler explanation - PhantomJS virtual browser + selenium module operation PhantomJS

Keywords: Python Selenium network Programming

PhantomJS Virtual Browser

phantomjs is a headless browser based on JS webkit kernel, that is, a browser without a display interface. With this software, you can get any information loaded by the web site js, that is, information loaded asynchronously by the browser

Unzip the Phantom JS file after downloading, unzip the folder, and cut it into the python installation folder

Then add the system environment variable to the bin folder in the PhantomJS folder

cdm input command: PhantomJS appears with the following information indicating successful installation

selenium module is a module for python to operate PhantomJS software

selenium module PhantomJS software

webdriver.PhantomJS() instantiates PhantomJS browser object
get('url') to visit the website
find_element_by_xpath('xpath expression') Finds the corresponding element through an XPath expression
clear() Clears the contents of the input box
send_keys('Content') writes content to the input box
click() click event
get_screenshot_as_file('Screenshot save path name') Saves a screenshot of a Web page to this directory
page_source Get htnl Source for Web Page
quit() Close PhantomJS browser

I don't know what to add to my learning
python Learning Communication Buttons qun,784758214
//There are good learning video tutorials, development tools and e-books in the group.
//Share with you the current talent needs of the python enterprise and how to learn Python from a zero-based perspective, and what to learn
#!/usr/bin/env python
# -*- coding:utf8 -*-
from selenium import webdriver  #Import selenium module to operate PhantomJS
import os
import time
import re

llqdx = webdriver.PhantomJS()  #Instantiate PhantomJS Browser Object
llqdx.get("https://www.baidu.com/") #Visit the web address

# time.sleep(3)   #Wait 3 seconds
# llqdx.get_screenshot_as_file('H:/py/17/img/123.jpg')  #Save screenshots of web pages to this directory

#Simulate user action
llqdx.find_element_by_xpath('//*[@id="kw"]').clear() #Find the input box through an xpath expression, and clear() empties the contents of the input box
llqdx.find_element_by_xpath('//*[@id="kw"]').send_keys('hawking network') #Find the input box through the xpath expression, send_keys() writes the content to the input box
llqdx.find_element_by_xpath('//*[@id="su"]').click()#Find the search button through the xpath expression, click() click the event

time.sleep(3)   #Wait 3 seconds
llqdx.get_screenshot_as_file('H:/py/17/img/123.jpg')  #Save screenshots of web pages to this directory

neir = llqdx.page_source   #Get Web Content
print(neir)
llqdx.quit()    #Close Browser

pat = "<title>(.*?)</title>"
title = re.compile(pat).findall(neir)  #Regularly Match Page Title
print(title)

If you are still confused in the world of programming, you can join us in Python learning to deduct qun:784758214 and see how our forefathers learned.Exchange experience.From basic Python scripts to web development, crawlers, django, data mining, and so on, zero-based to project actual data are organized.For every Python buddy!Share some learning methods and small details that need attention, Click to join us python learner cluster

PhantomJS browser masquerade, and scrollbar loading data

Some websites load data dynamically and require scrollbars to scroll through the data

Implementation Code

DesiredCapabilities to disguise browser objects
execute_script() executes js code

current_url Gets the current URL

#!/usr/bin/env python
# -*- coding:utf8 -*-
from selenium import webdriver  #Import selenium module to operate PhantomJS
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities   #Import Browser Masquerade Module
import os
import time
import re

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap['phantomjs.page.settings.userAgent'] = ('Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0')
print(dcap)
llqdx = webdriver.PhantomJS(desired_capabilities=dcap)  #Instantiate PhantomJS Browser Object

llqdx.get("https://www.jd.com/") #Visit the web address

#Simulate user action
for j in range(20):
    js3 = 'window.scrollTo('+str(j*1280)+','+str((j+1)*1280)+')'
    llqdx.execute_script(js3)  #Execute js language scrollbar
    time.sleep(1)

llqdx.get_screenshot_as_file('H:/py/17/img/123.jpg')  #Save screenshots of web pages to this directory

url = llqdx.current_url
print(url)

neir = llqdx.page_source   #Get Web Content
print(neir)
llqdx.quit()    #Close Browser

pat = "<title>(.*?)</title>"
title = re.compile(pat).findall(neir)  #Regularly Match Page Title
print(title)

Posted by sabbagh on Tue, 03 Sep 2019 15:13:16 -0700