[python crawler] use selenium to get Baidu search results and related keywords marked in red

Keywords: Selenium brew vim Google

I. environment construction

1. Install chrome driver

brew cask install chromedriver

2. Install selenium

pip3 install selenium

3. Install beautiful soup4

pip3 install beautifulsoup4

4. Test with the following code

from selenium import webdriver 

driver = webdriver.Chrome() # The chrome browser is called here

driver.get('https://www.baidu.com')  

print(driver.title)

driver.quit()

5. if you report wrong

raise WebDriverException("Can not connect to the Service %s" % self.path) selenium.common.exceptions.WebDriverException: Message: Can not connect to the Service /usr/local/bin/chromedriver

There are two solutions:

a) make sure your chrome driver is in your environment variable directory

My storage directory: usr/local/bin/chromedriver

Check method: enter which chromedriver in terminal

b) if 127.0.0.1 localhost is missing, Cannot connect to the service... Error will appear

Check mode: ping localhost

host Directory: / private/etc/

Use vim to modify.

6. If the error is reported, the version of chrome does not match

Enter in the browser address barchrome://version/ View chrome version

Go to the official website of chrome driver to view the corresponding version and download the corresponding chrome driver: https://sites.google.com/a/chromium.org/chromedriver/ , download to local, and extract to / usr/local/bin / folder.

2. Obtain search results and related keywords marked in red

Use the following code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
from bs4 import BeautifulSoup

   
browser_path = "/usr/local/bin/chromedriver"
browser = webdriver.Chrome(browser_path)
browser.get('https://www.baidu.com')
browser_input = browser.find_element_by_id('kw')
browser_input.clear()
query = "Yang Guofu, spicy hot "
browser_input.send_keys(query)
browser_input.send_keys(Keys.RETURN)
ignored_exceptions = (NoSuchElementException, StaleElementReferenceException,)
try:
    WebDriverWait(browser, 10, ignored_exceptions=ignored_exceptions) \
                .until(EC.title_contains(query))
except:
    continue
# Using beautifulsop to parse search results
bsobj = BeautifulSoup(browser.page_source, features="html.parser")

# Get search results queue
search_results = bsobj.find_all('div', {'class': 'result c-container'})

# For each search result
for item in search_results:
    # Get all text for the title of each search result
    text = search_item.h3.a.get_text(strip=True)
    # Get the red key for the title of each search result
    keywords = search_item.h3.a.find_all('em')
    # Get all text in the summary content of each search result
    # text = search_item.div.get_text(strip=True)
    # Get the red keywords in the summary content of each search result
    # keywords = search_item.div.find_all('em')
    print(text)
    print(keywords)

browser.close()

 

Reference website: https://blog.csdn.net/Excaliburrr/article/details/79164163

                  https://blog.csdn.net/piglite/article/details/86352734

                  https://www.jianshu.com/p/cc45e1e15586

Posted by ari_aaron on Wed, 20 Nov 2019 10:14:02 -0800