[data acquisition and fusion] the fifth experiment

The fifth practice of data mining

Assignment 1

Jingdong information crawling experiment

Job content

Requirements: be familiar with Selenium's search for HTML elements, crawling Ajax web page data, waiting for HTML elements, etc. Use Selenium framework to crawl the information and pictures of certain commodities in Jingdong Mall.
Candidate sites: http://www.jd.com/
Key words: Students' free choice

Practice process

Reproduce selenium crawling Jingdong Mall, core crawling code

    def processSpider(self):
        try:
            time.sleep(1)
            print(self.driver.current_url)
            lis =self.driver.find_elements_by_xpath("//div[@id='J_goodsList']//li[@class='gl-item']")
            for li in lis:
            # We find that the image is either in src or in data-lazy-img attribute
                self.count += 1  # Quantity plus one
                try:
                    src1 = li.find_element_by_xpath(".//div[@class='p-img']//a//img").get_attribute("src")
                except:
                    src1 = ""
                try:
                    src2 = li.find_element_by_xpath(".//div[@class='p-img']//a//img").get_attribute("data-lazy-img")
                except:
                    src2 = ""

                try:
                    price = li.find_element_by_xpath(".//div[@class='p-price']//i").text
                except:
                    price = "0"

                try:
                    note = li.find_element_by_xpath(".//div[@class='p-name p-name-type-2']//em").text
                    mark = note.split(" ")[0]
                    mark = mark.replace("Love Dongdong\n", "")
                    mark = mark.replace(",", "")
                    note = note.replace("Love Dongdong\n", "")
                    note = note.replace(",", "")

                except:
                    note = ""
                    mark = ""
                self.No = self.No + 1
                no = str(self.No)
                while len(no) < 6:
                    no = "0" + no
                print(no, mark, price)
                if src1:
                    src1 = urllib.request.urljoin(self.driver.current_url, src1)
                    p = src1.rfind(".")
                    mFile = no + src1[p:]
                elif src2:
                    src2 = urllib.request.urljoin(self.driver.current_url, src2)

                    p = src2.rfind(".")
                    mFile = no + src2[p:]
                if src1 or src2:
                    T = threading.Thread(target=self.download, args=(src1, src2, mFile))
                    T.setDaemon(False)
                    T.start()
                    self.threads.append(T)
                else:
                    mFile = ""
                self.insertDB(no, mark, price, note, mFile)  # insert database
                if self.count >= 100:   # If the number exceeds 100, exit
                    print("Crawling to 100 items, end crawling")
                    return
                # Remove the data from the next page until the last page
                try:
                    self.driver.find_element_by_xpath("//span[@class='p-num']//a[@class='pn-next disabled']")
                except:
                    nextPage = self.driver.find_element_by_xpath("//span[@class='p-num']//a[@class='pn-next']")
                    time.sleep(1)
                    nextPage.click()
                    self.processSpider()
        except Exception as err:
                print(err)

Code saved to database

    def startUp(self, url, key):
        # Initializing Chrome browser
        chrome_options = Options()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('--disable-gpu')
        self.driver = webdriver.Chrome(chrome_options=chrome_options)

        # Initializing variables
        self.threads = []
        self.No = 0
        self.imgNo = 0
        self.count = 0  # Record commodity quantity
        # Initializing database
        try:
            self.con = sqlite3.connect("phones.db")
            self.cursor = self.con.cursor()
            try:
                # If there are tables, delete them
                self.cursor.execute("drop table phones")
            except:
                pass
            try:
                sql = "create  table  phones  (mNo  varchar(32) primary key, mMark varchar(256),mPrice varchar(32),mNote varchar(1024),mFile varchar(256))"
                self.cursor.execute(sql)
            except:
                pass
        except Exception as err:
            print(err)
        # Initializing images folder
        try:
            if not os.path.exists(MySpider.imagePath):
                os.mkdir(MySpider.imagePath)
            images = os.listdir(MySpider.imagePath)
            for img in images:
                s = os.path.join(MySpider.imagePath, img)
                os.remove(s)
        except Exception as err:
            print(err)
        self.driver.get(url)   # Get web page
        keyInput = self.driver.find_element_by_id("key")  # Find input box
        keyInput.send_keys(key)   # Enter keywords
        keyInput.send_keys(Keys.ENTER)

    def closeUp(self):
        try:

            self.con.commit()
            self.con.close()
            self.driver.close()

        except Exception as err:
            print(err)


    def insertDB(self, mNo, mMark, mPrice, mNote, mFile):
        try:
            sql = "insert into phones (mNo,mMark,mPrice,mNote,mFile) values (?,?,?,?,?)"
            self.cursor.execute(sql, (mNo, mMark, mPrice, mNote, mFile))
        except Exception as err:
            print(err)

experimental result

As a result, I can only crawl 100 items

Database save results

Picture saving results:

Experimental experience

In this experiment, I consolidated my knowledge of selenium crawling website, deeply studied selenium related operations, and learned to store databases in two databases, including MYSQL and SQLite

Assignment 2

MOOC crawling test

Job content

Requirements: be familiar with Selenium's search for HTML elements, realizing user simulated Login, crawling Ajax web page data, waiting for HTML elements, etc. Use Selenium framework + MySQL to crawl the course resource information of China mooc network (course number, course name, teaching progress, course status and course picture address), and store the picture in the imgs folder under the root directory of the local project. The name of the picture is stored with the course name.
Candidate website: China mooc website: https://www.icourse163.org

Practice process

1. Simulate login, find the login button, click the login button, find the account and password input box, enter the login information with sendkey(), and click the login button. The only place where there is a jam is that the slider verification will appear. Later, check Baidu. There is a blog that also encountered such a problem. He uses a name called browser.switch_to.frame(), there will be no slider verification. I tried it and it's OK

def search(url):
    try:
        browser.get(url)
        time.sleep(1)  # Pause for a second
        browser.find_element_by_xpath('//*[@id="app"]/div/div/div[1]/div[3]/div[3]/div').click()
        # Click the login button
        time.sleep(1)  # Pause for a second
        browser.find_element_by_xpath('/html/body/div[13]/div[2]/div/div/div/div/div[2]/span').click()
        # Click other login methods
        time.sleep(1)  # Pause for a second
        browser.find_element_by_xpath('/html/body/div[13]/div[2]/div/div/div/div/div/div[1]/div/div[1]/div[1]/ul/li[2]').click()
        # Click the mobile number to log in
        time.sleep(1)  # Pause for a second
        browser.switch_to.frame(browser.find_elements_by_tag_name("iframe")[1])
        # Jump to mobile login interface
        time.sleep(1)  # Pause for a second
        browser.find_element_by_xpath("//input[@id='phoneipt']").send_keys("15160468797")
        browser.find_element_by_xpath('//input[@placeholder = "please enter password"]').send_keys("@huan301505")
        # Enter login information
        time.sleep(1)  # Pause for a second
        browser.find_element_by_xpath('//*[@id="submitBtn"]').click()
        # Click login
        time.sleep(3)  # Pause for three seconds and log in
        browser.find_element_by_xpath('//*[@id="privacy-ok"]').click()
        # Click the "my course" button
        time.sleep(1)  # Pause for a second
        browser.find_element_by_xpath('//*[@id="app"]/div/div/div[1]/div[3]/div[4]/div').click()
        time.sleep(1)
        print("Course information is as follows:")
        get_data()  # Get the first page of course data content
    except Exception as err:
        print(err)

2. Crawl relevant information and save data

def get_data():
    global total_num
    name_and_img = browser.find_elements_by_xpath('//div[@class="course-card-wrapper"]//div[@class="box"]//div[@class="img"]//img')
    school = browser.find_elements_by_xpath('//div[@class="course-card-wrapper"]//div[@class="box"]//div[@class="body"]//div[@class="school"]//a')
    schedule = browser.find_elements_by_xpath('//div[@class="course-card-wrapper"]//div[@class="box"]//div[@class="body"]//div[@class="text"]//a')
    course_status = browser.find_elements_by_xpath('//div[@class="course-card-wrapper"]//div[@class="box"]//div[@class="body"]//div[@class="course-status"]')
    for i in range(len(name_and_img)):
        name = name_and_img[i].get_attribute('alt')  # name
        image_url = name_and_img[i].get_attribute('src')  # Picture URL
        school_name = school[i].text
        C_schedule = schedule[i].text
        C_course_status = course_status[i].text
        print(name + "\t\t" + school_name + "\t\t" + C_schedule + "\t\t" + C_course_status + "\t\t" + image_url)
        image_name = "./img/The first" + str(page_num) + "Page" + str(i+1) + "Zhang.jpg"  # The path and location where the picture is to be saved
        urllib.request.urlretrieve(image_url, filename=image_name)  # Save picture
        mooc_DB.insertDB(total_num, name, school_name, C_schedule, C_course_status, image_url)  # insert data
        total_num += 1 # Total quantity+1

3. Page turning. I use the. click() method to find the button on the next page and click it directly to turn the page

def next_page():
    search(url)
    global page_num
    while page_num < 2:
        browser.find_element_by_xpath('//*[@id="j-coursewrap"]/div/div[2]/ul/li[4]/a').click()
        # Click next
        time.sleep(1)
        page_num += 1
        get_data()  # Get data for my course
    mooc_DB.closeDB()  # close database

4. Finally, there is the code saved in the database, which is very similar to the previous code, connecting, closing and inserting

class mooc:
    # Open database
    def openDB(self):
        print("open")
        try:
            self.con = pymysql.connect(host="127.0.0.1", port=3306, user="root",
                                       passwd="huan301505", db="scraw", charset="utf8")
            # The first step is to connect to the mysql database
            self.cursor = self.con.cursor()  # Set cursor
            # You need to create a database outside in advance
            self.cursor.execute("delete from mooc")  # Empty data table
            self.opened = True
        except Exception as err:
            print(err)
            self.opened = False

    # close database
    def closeDB(self):
        try:
            if self.opened:
                self.con.commit()  # Submit
                self.con.close()  # Close connection
                self.opened = False
            print("closed")
            print("Climb successfully!")
        except Exception as err:
            print(err)

    # insert database
    def insertDB(self, id, name, school, schedule, course_status, image_url):
        try:
            self.cursor.execute(
                "insert into mooc(id,name,school,schedule,course_status,image_url) values (%s,%s,%s,%s,%s,%s)",
                (id, name, school, schedule, course_status, image_url))
        except Exception as err:
            print(err)

experimental result

Console results:

For the database results, I insert mysql, and then view it. I use the work provided by mysql_ Bench visual interface:

Picture saving results:

Experimental experience

In this experiment, I consolidated my knowledge of selenium crawling website, deeply studied selenium related operations, and learned to store databases in two databases, including MYSQL and SQLite

Assignment 3

Flume experiment

Job content

Requirements: understand Flume architecture and key features, and master the use of Flume to complete log collection tasks. Complete Flume log collection experiment, including the following steps:

Task 1: open MapReduce service

Task 2: generate test data from Python script

Task 3: configure Kafka

Task 4: install Flume client

Task 5: configure Flume to collect data

Practice process

Task 1: open MapReduce service

Follow the pdf released by the teacher step by step. Finally, open MRS and enter the manager interface as follows

Task 2: generate test data from Python script

1. Open xshell and establish a connection

2. Transfer the autodatapython.py file to the opt/client / folder

3. Implementation

Task 3: configure Kafka

First, set the environment variable and execute the source command to make the variable effective
Create topic in kafka
To view topic information:

Task 4: install Flume client

Enter the MRS Manager cluster management interface, open service management, click flume, enter flume service, and click download client

Unzip the flume client file, verify the package, and unzip the "MRS_Flume_ClientConfig.tar" file

Install flume environment variable

Unzip the flume client

Install flume client

Restart flume service

Task 5: configure Flume to collect data

Modify the configuration file and import the file directly from the outside

Open another window and execute the python test instructions in 2.2.1. The original window can capture data

Experimental experience

This experiment mainly introduces the acquisition of real-time stream data through flume, preliminarily understands the relevant knowledge of flume log acquisition, and benefits a lot.

Finally, attach the code of my previous assignment: Data acquisition and fusion: data acquisition and fusion practice assignment - Gitee.com

Posted by sageman on Thu, 25 Nov 2021 18:29:15 -0800

Programmer Group

[data acquisition and fusion] the fifth experiment

The fifth practice of data mining

Assignment 1

Jingdong information crawling experiment

Job content

Practice process

experimental result

Experimental experience

Assignment 2

MOOC crawling test

Job content

Practice process

experimental result

Experimental experience

Assignment 3

Flume experiment

Job content

Practice process

Experimental experience

Hot Keywords