Operation ①
1.1 operation content
Assignment ①: Requirements: be familiar with Selenium's search for HTML elements, crawling Ajax web page data, waiting for HTML elements, etc.
Use Selenium framework to crawl the information and pictures of certain commodities in Jingdong Mall.
Candidate sites: http://www.jd.com/
Key words: Students' free choice
Output information: the output information of MYSQL is as follows
1.2 problem solving process
1.2.1 access to web pages
Enter the JD homepage first, enter the search keywords through the search box, and then find the store:
Acquisition of search input box:
Just find the input component with id="key".
Find button:
The button under the first div tag of id = "search" and the button under the second div tag.
Code part:
driver.get(url) # Visit the home page inputit = driver.find_element_by_xpath('//*[@id="key"]') inputit.send_keys(keyWord) findbutton = driver.find_element_by_xpath('//*[@id="search"]/div/div[2]/button') findbutton.click()
1.2.2 acquisition of target data
You can see that the commodity data exists in a list under the div tag of class = "GL warp Clearfix".
So get the product list first:
shoplist = driver.find_element_by_xpath('//*[@class="gl-warp clearfix"]')
Then, find the corresponding content. Take the price as an example:
You can see that the price information exists in the text under the / strong/i path under the div tag of class = "p-price" under the commodity list, so find:
price = shoplist.find_elements_by_xpath('.//div[@class="p-price"]/strong/i')
1.2.3 page turning and pull-down
Page turning is the button to find the next page:
a label of class = "PN next". Click.
def loadNext(): #Used to go to the next page time.sleep(1) try: NextPage = driver.find_element_by_xpath('//a[@class="pn-next"]') NextPage.click() except: print("last page")
JD's web page pull-down will load the content of the next page after all. The content of one page is 30 commodities. All 60 commodities will be loaded after the pull-down. If not, data (such as pictures) may not be loaded and null may occur.
def loadAll(): for i in range(100): #There is no drop-down to the end, and almost all loading is completed js = 'window.scrollTo(0,%s)' % (i * 100) #js script driver.execute_script(js) #Submit js script time.sleep(0.07)
Stop each point to ensure that all data is loaded.
1.3 output
Saved pictures:
Database:
1.4 experience
selenium basic use, web content search, input, click and other events.
Use of selenium submission js script
Operation ②
2.1 operation contents
Requirements: be familiar with Selenium's search for HTML elements, user simulated Login, crawling Ajax web page data, waiting for HTML elements, etc. use Selenium framework + MySQL to crawl the course resource information of China mooc network (course number, course name, teaching progress, course status, course picture address) At the same time, the picture is stored in the imgs folder under the root directory of the local project, and the name of the picture is stored with the course name.
Candidate website: China mooc website: https://www.icourse163.org
Output information: MYSQL database storage and output format
2.2 problem solving process
2.2.1 implementation of simulated Login
First, find the window for entering the account and password
Click login | registration in the upper right to pop up the login window. The address obtained directly by clicking Copy Xpath is
//*[@id="auto-id-1637751328911"]
The id is used to search, but the id here will change every time you enter this page, so you can't find it by id,
So find it through the class-"unlogin" tag.
The same problem applies to the search of other content on this website,
After the above input box for entering mobile phone number and password is displayed, the input component cannot be found
This is because the web page loads a new frame:
You need to switch to this new:
frame = driver.find_element_by_xpath('//div[@class="ux-login-set-container"]/iframe') driver.switch_to.frame(frame)
Before proceeding
Then search and enter the account password and click login:
The input boxes of account and password are placed under the label of class="u-input box", which can be used to find:
The login button id="submitBtn" can be found directly without change.
inputUserName =driver.find_element_by_xpath('//div[@class="u-input box"][1]/input') inputUserName.send_keys("") inputPasswd = driver.find_element_by_xpath('//div[@class="inputbox"]/div[2]/input[2]') inputPasswd.send_keys("") LoginButton = driver.find_element_by_xpath('//*[@id="submitBtn"]') LoginButton.click()
2.2.2 data acquisition
After login, the page becomes after login:
The execution through webdriver will appear:
The window in the lower right corner will block my course button so that the button cannot be clicked:
This window needs to be closed:
agree=driver.find_element_by_xpath('//button[@class="btn ok"]') agree.click()
Then you can enter my course page:
You can see that all the required information exists in the div tag under class = "course panel body wrapper", and then
course = driver.find_element_by_xpath('.//div[@class="course-panel-body-wrapper"]') names = course.find_elements_by_xpath('.//span[@class="text"]') schools= course.find_elements_by_xpath('.//div[@class="school"]') courset = course.find_elements_by_xpath('.//span[@class="course-progress-text-span"]') time = course.find_elements_by_xpath('.//div[@class="course-status"]') hrefs = course.find_elements_by_xpath('.//div[@class="img"]/img')
2.3 output
Data in mysql
2.4 experience
selenium realizes the conversion of simulated Login and window (frame).
selenium wait mechanism.