selenium basic usage
Running environment: from selenium.webdriver import Chrome
1. Create browser object
b = Chrome('files/chromedriver')
2. Open the page
b.get('https://www.qidian.com/rank/yuepiao/month10/')
3. Get web data
print(b.page_source)
4. Close the web page
b.close()
selenium common configurations
Running environment: from selenium.webdriver import Chrome, ChromeOptions
import time
1. Set the setting object of Google browser
options = ChromeOptions()
1) Cancel test environment
options.add_experimental_option('excludeSwitches', ['enable-automation'])
2) Cancel picture loading - acceleration
options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})
2. Create a browser and open a web page
b = Chrome('files/chromedriver', options=options) b.get('https://www.jd.com') print(b.page_source)
Get and manipulate web page tags
Running environment: from selenium.webdriver import Chrome
from selenium.webdriver.common.keys import Keys
# goods = input('Please enter the product type you want to obtain: ') b = Chrome('files/chromedriver') b.get('https://www.jd.com')
1. Get labels
Browser object.find_element_by... - return label
Browser object.find_elements_by... - returns a list in which the elements are labels
search = b.find_element_by_id('key') # b.find_element_by_css_selector('#key')
2. Operation label
1) Input box operation (input label): input content
search.send_keys('computer') # Press enter # search.send_keys(Keys.ENTER)
2) Click the tab (click the button or hyperlink)
Get the label to click
search_btn = b.find_element_by_xpath('//div[@role="serachbox"]/button')
click
search_btn.click()
Exercise: 51job crawls 5 pages of 'data analysis' position data, analyzes and obtains: position name, salary, company name and company type
from selenium.webdriver import Chrome from selenium.webdriver.common.keys import Keys import time from lxml import etree b = Chrome('files/chromedriver') def get_html_by_chrome(): url = 'https://www.51job.com' b.get(url) search_input = b.find_element_by_id('kwdselectid') search_input.send_keys('Data analysis') search_input.send_keys(Keys.ENTER) # Click 5 times next page for _ in range(5): print('=======================================================================================\n') # print(b.page_source) analysis_data(b.page_source) time.sleep(1) next = b.find_element_by_class_name('next') next.click() def analysis_data(html: str): html_node = etree.HTML(html) all_job_div = html_node.xpath('//div[@class="j_joblist"]/div[@class="e"]') for job_div in all_job_div: # Job name job_name = job_div.xpath('./a/p[@class="t"]/span[1]/text()')[0] # salary try: salary = job_div.xpath('./a/p[@class="info"]/span[1]/text()')[0] except IndexError: salary = 'Face to face' # corporate name company_name = job_div.xpath('./div[@class="er"]/a/text()')[0] # Company type try: company_type = job_div.xpath('./div[@class="er"]/p[@class="int at"]/text()')[0] except IndexError: company_type = 'nothing' print(job_name, salary, company_name, company_type) if __name__ == '__main__': get_html_by_chrome()
Page scrolling
Running environment: from selenium.webdriver import Chrome
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time
1. Open JD search 'computer' and press enter
b = Chrome('files/chromedriver') b.get('https://www.jd.com') search_input = b.find_element_by_id('key') search_input.send_keys('computer') search_input.send_keys(Keys.ENTER) # print(b.page_source) time.sleep(1)
2. Scroll slowly to the specified position
height = 0 while True: height += 500 if height > 9000: break # Execute js scrolling Code: window.scrollTo(x, y) b.execute_script(f'window.scrollTo(0, {height})') time.sleep(1) # soup = BeautifulSoup(b.page_source, 'lxml') # all_goods_li = soup.select('#J_goodsList li') # print(len(all_goods_li))
wait for
Running environment: from selenium.webdriver import Chrome
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
b = Chrome('files/chromedriver') b.get('https://www.jd.com')
1. Implicit waiting
When obtaining a web page tag, if the tag cannot be found in the web page under normal circumstances, the program will directly report an error;
Implicit waiting is to set a waiting time when the tag cannot be obtained. As long as the tag is obtained within the waiting time, no error will be reported
b.implicitly_wait(10) # Set the waiting time to 10 seconds, which is globally valid
2. Explicit wait
1) First create a wait object: webdriverwait (browser object, timeout)
wait = WebDriverWait(b, 5) wait2 = WebDriverWait(b, 10)
2) Add condition
Wait for object. Until (condition) - wait until the condition is established, and the wait ends
Wait for object.until_ Not (condition) - wait until the condition does not hold, and wait for the end
Common conditions:
EC.presence_of_element_located((By.X, value)) - judge whether an element is added to the dom tree (judge whether a tag is loaded into the web page, not necessarily visible). When the condition is true, return the corresponding tag
EC.visibility_of_element_located((By.X, value)) - judge whether a label is visible (not hidden, and the width and height of the element are not equal to 0). When the condition is true, return the corresponding label
EC.text_to_be_present_in_element((By.X, value), data) - judge whether the tag content in a tag contains the expected string, and return Boolean True when the condition is True
EC.text_to_be_present_in_element_value((By.X, value), data) - judge whether the value attribute in a tag contains the expected string, and return Boolean True when the condition is True
EC.element_to_be_clickable((By.X, value)) - judge whether a tag can be clicked, and return the corresponding tag when the condition is true
# EC.presence_ of_ element_ Located ((how to determine the label and value)) wait.until(EC.presence_of_element_located((By.ID, 'key'))) search_input = b.find_element_by_id('key') # The content of the input tag (input box) is the value of the value attribute wait2.until(EC.text_to_be_present_in_element_value((By.ID, 'key'), 'computer')) search_input.send_keys(Keys.ENTER) print('===============end==============')