Data acquisition experiment 5

Operation ①

1.1 operation content

Assignment ①: Requirements: be familiar with Selenium's search for HTML elements, crawling Ajax web page data, waiting for HTML elements, etc.

Use Selenium framework to crawl the information and pictures of certain commodities in Jingdong Mall.

Candidate sites: http://www.jd.com/

Key words: Students' free choice

Output information: the output information of MYSQL is as follows

1.2 problem solving process

1.2.1 access to web pages

Enter the JD homepage first, enter the search keywords through the search box, and then find the store:

Acquisition of search input box:

Just find the input component with id="key".

Find button:

The button under the first div tag of id = "search" and the button under the second div tag.

Code part:

driver.get(url)  # Visit the home page
inputit = driver.find_element_by_xpath('//*[@id="key"]')
inputit.send_keys(keyWord)
findbutton = driver.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
findbutton.click()
1.2.2 acquisition of target data

You can see that the commodity data exists in a list under the div tag of class = "GL warp Clearfix".

So get the product list first:

shoplist = driver.find_element_by_xpath('//*[@class="gl-warp clearfix"]')

Then, find the corresponding content. Take the price as an example:

You can see that the price information exists in the text under the / strong/i path under the div tag of class = "p-price" under the commodity list, so find:

price = shoplist.find_elements_by_xpath('.//div[@class="p-price"]/strong/i')
1.2.3 page turning and pull-down

Page turning is the button to find the next page:

a label of class = "PN next". Click.

def loadNext():                 #Used to go to the next page
    time.sleep(1)
    try:
        NextPage = driver.find_element_by_xpath('//a[@class="pn-next"]')
        NextPage.click()
    except:
        print("last page")

JD's web page pull-down will load the content of the next page after all. The content of one page is 30 commodities. All 60 commodities will be loaded after the pull-down. If not, data (such as pictures) may not be loaded and null may occur.

def loadAll():
    for i in range(100):                     #There is no drop-down to the end, and almost all loading is completed
        js = 'window.scrollTo(0,%s)' % (i * 100)                #js script
        driver.execute_script(js)                       #Submit js script
        time.sleep(0.07)

Stop each point to ensure that all data is loaded.

1.3 output

Saved pictures:

Database:

1.4 experience

selenium basic use, web content search, input, click and other events.

Use of selenium submission js script

Operation ②

2.1 operation contents

Requirements: be familiar with Selenium's search for HTML elements, user simulated Login, crawling Ajax web page data, waiting for HTML elements, etc. use Selenium framework + MySQL to crawl the course resource information of China mooc network (course number, course name, teaching progress, course status, course picture address) At the same time, the picture is stored in the imgs folder under the root directory of the local project, and the name of the picture is stored with the course name.

Candidate website: China mooc website: https://www.icourse163.org

Output information: MYSQL database storage and output format

2.2 problem solving process

2.2.1 implementation of simulated Login

First, find the window for entering the account and password

Click login | registration in the upper right to pop up the login window. The address obtained directly by clicking Copy Xpath is

//*[@id="auto-id-1637751328911"]

The id is used to search, but the id here will change every time you enter this page, so you can't find it by id,

So find it through the class-"unlogin" tag.

The same problem applies to the search of other content on this website,

After the above input box for entering mobile phone number and password is displayed, the input component cannot be found

This is because the web page loads a new frame:

You need to switch to this new:

frame = driver.find_element_by_xpath('//div[@class="ux-login-set-container"]/iframe')
driver.switch_to.frame(frame)

Before proceeding

Then search and enter the account password and click login:

The input boxes of account and password are placed under the label of class="u-input box", which can be used to find:

The login button id="submitBtn" can be found directly without change.

inputUserName =driver.find_element_by_xpath('//div[@class="u-input box"][1]/input')
inputUserName.send_keys("")
inputPasswd = driver.find_element_by_xpath('//div[@class="inputbox"]/div[2]/input[2]')
inputPasswd.send_keys("")
LoginButton = driver.find_element_by_xpath('//*[@id="submitBtn"]')
LoginButton.click()
2.2.2 data acquisition

After login, the page becomes after login:

The execution through webdriver will appear:

The window in the lower right corner will block my course button so that the button cannot be clicked:

This window needs to be closed:

agree=driver.find_element_by_xpath('//button[@class="btn ok"]')
agree.click()

Then you can enter my course page:

You can see that all the required information exists in the div tag under class = "course panel body wrapper", and then

course = driver.find_element_by_xpath('.//div[@class="course-panel-body-wrapper"]')
names = course.find_elements_by_xpath('.//span[@class="text"]')
schools= course.find_elements_by_xpath('.//div[@class="school"]')
courset = course.find_elements_by_xpath('.//span[@class="course-progress-text-span"]')
time = course.find_elements_by_xpath('.//div[@class="course-status"]')
hrefs = course.find_elements_by_xpath('.//div[@class="img"]/img')

2.3 output

Data in mysql

2.4 experience

selenium realizes the conversion of simulated Login and window (frame).

selenium wait mechanism.

The fifth experiment source code

Posted by tcl4p on Wed, 24 Nov 2021 08:27:57 -0800