Today's practice is to grab the news content locally, instead of printing the full text, you only need to print the first two or three paragraphs, so you can directly locate the p-tag of the first three paragraphs
content1=driver.find_element_by_xpath("//*[@id='newsmain-ej']/div/div[1]/div[1]/div[4]/div/p[1]").text content2=driver.find_element_by_xpath("//*[@id='newsmain-ej']/div/div[1]/div[1]/div[4]/div/p[2]").text content3=driver.find_element_by_xpath("//*[@id='newsmain-ej']/div/div[1]/div[1]/div[4]/div/p[3]").text
However, an article was reported wrong when it was actually fetched, because it was short and had no third paragraph - selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element:
So I thought about whether I could use try except to deal with the exception and judge whether there is a third paragraph. Check that there is an exception NoSuchElementException. To use the NoSuchElementException exception, import from selenium.common.exceptions import NoSuchElementException at the beginning.
If there is no third paragraph, that is, p[3] does not exist, only the first and second paragraphs will be printed. If there is no exception, the first three paragraphs will be printed:
try: content3=driver.find_element_by_xpath("//*[@id='newsmain-ej']/div/div[1]/div[1]/div[4]/div/p[3]").text except NoSuchElementException: #If there is no third paragraph and a NoSuchElementException exception occurs, only the first and second paragraphs are printed. print(content1,content2) else: print(content1,content2,content3)
The complete code is as follows:
from selenium import webdriver from selenium.common.exceptions import NoSuchElementException import time driver=webdriver.Chrome() driver.get("http://news.cnpc.com.cn/hynews/") time.sleep(1) links=driver.find_elements_by_xpath("//*[@id='newsmain-ej']/div/div[1]/div[2]/div[2]/div/ul/*/a") length=len(links) for i in range(0,21): #Grabbing too many old ones doesn't make sense. Grabbing 20 is enough links=driver.find_elements_by_xpath("//*[@id='newsmain-ej']/div/div[1]/div[2]/div[2]/div/ul/*/a") link=links[i] link.click() time.sleep(1) handles=driver.window_handles index_handle=driver.current_window_handle for handle in handles: if handle != index_handle: driver.switch_to.window(handle) else: continue title=driver.find_element_by_xpath("//*[@id='newsmain-ej']/div/div[1]/div[1]/div[2]/h2/a").text print(i+1,title) content1=driver.find_element_by_xpath("//*[@ id='newsmain-ej']/div/div[1]/div[1]/div[4]/div/p[1]").text content2=driver.find_element_by_xpath("//*[@id='newsmain-ej']/div/div[1]/div[1]/div[4]/div/p[2]").text try: content3=driver.find_element_by_xpath("//*[@id='newsmain-ej']/div/div[1]/div[1]/div[4]/div/p[3]").text except NoSuchElementException: print(content1,content2) else: print(content1,content2,content3) print("\n") driver.close() time.sleep(1) driver.switch_to_window(index_handle) print("--CNPC grabs 20 news————") print("\n")