From login to crawling: Python anti crawls to get tens of thousands of public business data of a certain Baocheng

Keywords: Selenium Python Google pip

I don't know when I began to enjoy crawling thousands of data!

This article will use Python anti crawl technology to explain how to obtain tens of thousands of public commercial data of a certain Baocheng.

catalog

1 preliminary preparation

2 case details

2.1 import module

2.2 core code

2.3 general code

3 summary statement

1 preliminary preparation

Python environment: Python 3.8.2

Python compiler: JetBrains pycharm2018.1.2 x64

Third party libraries and modules: selenium, time, csv, re

In addition, you need a browser driver: webDriver

selenium is a third-party library, which needs to be installed separately. Enter the following command line at the terminal

pip install selenium

inputAndIf no error is reported, the installation of the third-party library is successful.

Let's talk about how to install the browser driver (take Google browser as an example):

First, download the browser drive WebDriver

Download address of Chrome browser drive: http://npm.taobao.org/mirrors/chromedriver/

firefox's drive download address: https://github.com/mozilla/geckodriver/releases

Drive download address of Edge browser: https://developer.microsoft.com/en-us/micrsosft-edage/tools/webdriver

Safari browser's drive download address: https://webkit.org/blog/6900/webdriver-support-in-safari-10/

Take Google browser for example. You need to know the version number of the browser first

Just the frontIt's OK if it's well matched, and it's OK if it's in a big direction. Then find a matching version and download it

Download it and test it later

# Import webdriver from selenium
from selenium import webdriver

# Specify the chrom e driver (download to the local browser drive and locate the address to it)
driver = webdriver.Chrome('E:/software/chromedriver_win32/chromedriver.exe')

# get method to open the specified URL
driver.get('http://www.baidu.com')

At this point, the preparation is ready. Next, enter the reptile case explanation

2 case details

2.1 import module

Import the third-party library and related modules mentioned above

from selenium.webdriver import ActionChains # Import action chain
from selenium import webdriver
import time
import csv
import re

2.2 core code

Determine target page: Taobao.com (official website)

Write code to automatically open target web page

# Incoming browser driver local address
driver = webdriver.Chrome('E:/software/chromedriver_win32/chromedriver.exe')
# Incoming destination page address
driver.get('https://www.taobao.com/')

Maximize browser

driver.maximize_window() # Maximize browser

Pass in keywords and search products automatically

keyword = input('Please enter the product name you want to search:')
driver.find_element_by_id('q').send_keys(keyword)  # Accurately locate the search box of Taobao according to the id value of "check" and pass in the keyword
driver.find_element_by_class_name('btn-search').click() # Locate the search button according to the class label 'BTN search' and click

At this time, we found that we need to log in to view the search content. Then we need to solve the problem of logging in

Pass in the account password (here, use F12 to locate its xpath value)

driver.find_element_by_xpath('//*[@id="fm-login-id"]').send_keys('account ')
driver.find_element_by_xpath('//*[@id="fm-login-password"]').send_keys('password ')

Solve the problem of man-machine verification (anti climbing, sliding the slider to the right)

login = driver.find_element_by_xpath('//*[@id="nc_1_n1z "] ') ා find the slider through xpath
action = ActionChains(driver)  # Create an action chain
action.click_and_hold(on_element=login) # Click not to release
action.move_by_offset(xoffset=300-42, yoffset=0) # Slide by axis
action.pause(0.5).release().perform() # Set the chain call time (slider time) and release the mouse perform() to execute the action chain

Get the target data value of the entire page (for loop)

divs = driver.find_elements_by_xpath('//div[@class="items"]/div[@class="item J_MouserOnverReq  "]')
for div in divs:
    info = div.find_element_by_xpath('.//div[@class="row row-2 title"]/a').text
    price = div.find_element_by_xpath('.//strong').text
    deal = div.find_element_by_xpath('.//div[@class="deal-cnt"]').text
    shop = div.find_element_by_xpath('.//div[@class="shop"]/a').text

Save file (stored in csv format)

with open('data.csv', mode='a', newline="") as csvfile:
    csvWriter = csv.writer(csvfile, delimiter=',')
    csvWriter.writerow([info, price, deal, shop])

The above is to crawl one page of data, so how to crawl multiple pages of data

Get total pages

page = driver.find_element_by_xpath('//*[@ id = "mainsrp pager"] / div / div / div / div [1] '). Text ා get the total page number label
page_list = re.findall('(\d+)', page)  # Regular expression gets multiple exact numeric data [list returned]
page_num = page_list[0]   # String type data

for loops through all pages to get all the data of the product

driver.get('https://s.taobao.com/search?q={}&s={}'.format(keyword, page_num*44))
page_num += 1

It is worth noting that the page address of the above code is summarized according to the rule of viewing the multi page address data

Obviously, from thePage data address start, whichValue from, which is generated by the superposition law of the number 44.

2.3 general code

from selenium.webdriver import ActionChains # Import action chain
from selenium import webdriver
import time
import csv
import re

# Search for keywords and log in to Taobao
def search_product(key):
    driver.get('https://www.taobao.com/')
    driver.find_element_by_id('q').send_keys(key)  # Accurately locate the search box of Taobao according to the id value of "check" and pass in the keyword
    driver.find_element_by_class_name('btn-search').click() # Locate the search button according to the class label 'BTN search' and click

    driver.implicitly_wait(10) # Implicitly wait (in seconds) until the page is rendered, and then no longer wait
    driver.maximize_window() # Maximize browser

    # Solve login (login anti climbing: e.g. with slider)
    driver.find_element_by_xpath('//*[@id="fm-login-id"]').send_keys('fill in account name / mobile number here ')
    time.sleep(1)
    driver.find_element_by_xpath('//*[@id="fm-login-password"]').send_keys('fill in account password here ')
    time.sleep(2)

    # Solve slider
    login = driver.find_element_by_xpath('//*[@id="nc_1_n1z "] ') ා find the slider through xpath
    action = ActionChains(driver)  # Create an action chain
    action.click_and_hold(on_element=login) # Click not to release
    action.move_by_offset(xoffset=300-42, yoffset=0) # Slide by axis
    action.pause(0.5).release().perform() # Set the chain call time (slider time) and release the mouse perform() to execute the action chain
    driver.find_element_by_xpath('//*[@ id = "login form"] / div [4] / button '). Click() (click to log in and redirect to the previous keyword)
    driver.implicitly_wait(10) # Implicit wait

    page = driver.find_element_by_xpath('//*[@ id = "mainsrp pager"] / div / div / div / div [1] '). Text ා get the total page number label
    page_list = re.findall('(\d+)', page)  # Regular expression gets multiple exact numeric data [list returned]
    page_num = page_list[0]   # String type data

    return int(page_num)

# Crawling data and saving
def get_data():
    divs = driver.find_elements_by_xpath('//div[@class="items"]/div[@class="item J_MouserOnverReq  "]')
    for div in divs:
        info = div.find_element_by_xpath('.//div[@class="row row-2 title"]/a').text
        price = div.find_element_by_xpath('.//strong').text
        deal = div.find_element_by_xpath('.//div[@class="deal-cnt"]').text
        shop = div.find_element_by_xpath('.//div[@class="shop"]/a').text
        print(info, price, deal, shop, sep='|')

        # preservation
        with open('data.csv', mode='a', newline="") as csvfile:
            csvWriter = csv.writer(csvfile, delimiter=',')
            csvWriter.writerow([info, price, deal, shop])

def main():
    print('Crawling page 1 data...')
    page = search_product(keyword)
    get_data()

    # Data acquisition after page 2
    page_num = 1   # page_num * 44
    while page_num != page:
        print('*' * 100)
        print('Climbing to the top{}Page data'.format(page_num+1))
        print('*' * 100)
        driver.get('https://s.taobao.com/search?q={}&s={}'.format(keyword, page_num*44))
        driver.implicitly_wait(10)  # Implicit wait
        get_data()
        page_num += 1

    driver.quit()

if __name__ == '__main__':
    driver = webdriver.Chrome('E:/software/chromedriver_win32/chromedriver.exe')
    # keyword = 'computer'
    keyword = input('Please enter the product name you want to search:')
    main()

Screenshot display of total operation effect

This is the screenshot of PyCharm operation effect

This is the screenshot after opening the csv file

3 summary statement

Recently, I was preparing for the final exam. After July, I will write a crawler column: Python Network Data crawling and analysis "from beginner to proficient"
Interested friends can first pay attention to a wave!

For more original articles and classified columns, please click here → My home page.

★ copyright notice: This is CSDN blogger "Rongzai! The most beautiful boy! The copyright agreement of CC 4.0 BY-SA.
For reprint, please attach the original source link and this statement.

Welcome to leave a message, learn and communicate together~~~

Thank you for reading

END

Posted by regoch on Thu, 25 Jun 2020 03:59:10 -0700