Python crawler introductory tutorial [4]: American Airlines unlogged image crawling

Keywords: Programming Python network Windows

Unlogged Pictures on American Airlines Network--Brief Introduction

The last article has been written for a long time. Next, I will continue to finish the crawler of American Airlines. The crawler in this set of tutorials may not add much valuable technical points to your actual work, because it is only a set of introductory tutorials, you can automatically bypass the old bird, or take me with you.

The Unlogged Picture of American Air Network--Crawler Analysis

First of all, we have crawled to more than N users'personal homepages, which I got through link splicing.

http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/list.html

In this page, we need to find several key points, we found that the plane photography click into the picture list page. Next, start the code.

Get all list pages

I got 70,000 user data (50,000 +) from my last blog and read it into python.

This place, I used a better python library pandas, if you are not familiar with, you can imitate my code first, I write the comments are complete.

import pandas as pd

# User Picture List Page Template
user_list_url = "http://www.moko.cc/post/{}/list.html"
# Store list pages for all users
user_profiles = []

def read_data():
    # pandas reads data from csv
    df = pd.read_csv("./moko70000.csv")   #Files can be downloaded at the end of this article
    # Remove duplicate nickname data
    df = df.drop_duplicates(["nikename"])
    # Decline the order according to the number of fans
    profiles = df.sort_values("follows", ascending=False)["profile"]

    for i in profiles:
        # Stitching links
        user_profiles.append(user_list_url.format(i))

if __name__ == '__main__':
    read_data()
    print(user_profiles)
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

The data is already available. Next, we need to get the picture list page, find out the rules, and see the key information as follows. Finding the right place is the matter of regular expressions.

Write a regular expression quickly <p class="title"><a hidefocus="ture".*?href="(.*?)" class="mwC u">.*?\((\d+?)\)</a></p>

Introducing re,requests module

import requests
import re

# Get the picture list page
def get_img_list_page():
    # Fixed an address for easy testing
    test_url = "http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/list.html"
    response = requests.get(test_url,headers=headers,timeout=3)
    page_text = response.text
    pattern = re.compile('<p class="title"><a hidefocus="ture".*?href="(.*?)" class="mwC u">.*?\((\d+?)\)</a></p>')
    # Get page_list
    page_list = pattern.findall(page_text)

Operational results

[('/post/da39db43246047c79dcaef44c201492d/category/304475/1.html', '85'), ('/post/da39db43246047c79dcaef44c201492d/category/304476/1.html', '2'), ('/post/da39db43246047c79dcaef44c201492d/category/304473/1.html', '0')]

Continue to improve the code, we found that the data obtained above, there is a "0" generation, need to filter out.

# Get the picture list page
def get_img_list_page():
    # Fixed an address for easy testing
    test_url = "http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/list.html"
    response = requests.get(test_url,headers=headers,timeout=3)
    page_text = response.text
    pattern = re.compile('<p class="title"><a hidefocus="ture".*?href="(.*?)" class="mwC u">.*?\((\d+?)\)</a></p>')
    # Get page_list
    page_list = pattern.findall(page_text)
    # Filtering data
    for page in page_list:
        if page[1] == '0':
            page_list.remove(page)
    print(page_list)

To get the entrance to the list page, we need to get all the list pages below. This place needs to click on the link below to see.

http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/category/304475/1.html

This page has four pages, each page shows data 4*7=28. So the basic formula is math.ceil(85/28) Next, the links are generated. We need to convert the above links into

http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/category/304475/1.html
http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/category/304475/2.html
http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/category/304475/3.html
http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/category/304475/4.html

    page_count =  math.ceil(int(totle)/28)+1
    for i in range(1,page_count):
        # Regular expression substitution
        pages = re.sub(r'\d+?\.html',str(i)+".html",start_page)
        all_pages.append(base_url.format(pages))

When we go back to enough links, for beginners, you can do this step first, and store these links in a csv file for subsequent development.

# Get all the pages
def get_all_list_page(start_page,totle):

    page_count =  math.ceil(int(totle)/28)+1
    for i in range(1,page_count):
        pages = re.sub(r'\d+?\.html',str(i)+".html",start_page)
        all_pages.append(base_url.format(pages))

    print("Acquired{}Bar data".format(len(all_pages)))
    if(len(all_pages)>1000):
        pd.DataFrame(all_pages).to_csv("./pages.csv",mode="a+")
        all_pages.clear()

Let the crawler fly for a while. I got 80,000 + pieces of data here.

Okay, there's the list data. Next, let's continue to work on this data. Does it feel a bit slow and the code is a bit LOW? Okay, I admit that it's actually lazy for novice writers. I'll go back and change it to object-oriented and multi-threaded in an article.

Next, we will analyze it again based on the crawled data.

for example http://www.moko.cc/post/nimusi/category/31793/1.html In this page, we need to get the address of the red box, why or this? Because clicking on this image and entering it is the complete picture list.

We're still using crawlers to get it. Several steps

  1. Loop through the list of data we just had
  2. Crawl Web Source
  3. Regular expressions match all links
def read_list_data():
    # Read data
    img_list = pd.read_csv("./pages.csv",names=["no","url"])["url"]

    # Cyclic operation data
    for img_list_page in img_list:
        try:
            response = requests.get(img_list_page,headers=headers,timeout=3)
        except Exception as e:
            print(e)
            continue
        # Regular Expressions Get Picture List Pages
        pattern = re.compile('<a hidefocus="ture" alt="(.*?)".*? href="(.*?)".*?>VIEW MORE</a>')
        img_box = pattern.findall(response.text)

        need_links = []  # Picture folders to be crawled
        for img in img_box:
            need_links.append(img)

            # Create directories
            file_path = "./downs/{}".format(str(img[0]).replace('/', ''))

            if not os.path.exists(file_path):
                os.mkdir(file_path)  # Create directories

        for need in need_links:
            # Get Details Page Picture Links
            get_my_imgs(base_url.format(need[1]), need[0])

Several key points of the above code

        pattern = re.compile('<a hidefocus="ture" alt="(.*?)".*? href="(.*?)".*?>VIEW MORE</a>')
        img_box = pattern.findall(response.text)

        need_links = []  # Picture folders to be crawled
        for img in img_box:
            need_links.append(img)

Get the crawl directory, where I matched two parts, mainly for creating folders To create folders, you need to use the os module. Remember to import it.

            # Create directories
            file_path = "./downs/{}".format(str(img[0]).replace('/', ''))

            if not os.path.exists(file_path):
                os.mkdir(file_path)  # Create directories

After getting the picture links on the details page, grab all the picture links in one visit

#Get Details Page Data
def get_my_imgs(img,title):
    print(img)
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"}
    response = requests.get(img, headers=headers, timeout=3)
    pattern = re.compile('<img src2="(.*?)".*?>')
    all_imgs = pattern.findall(response.text)
    for download_img in all_imgs:
        downs_imgs(download_img,title)

Finally, write a method of downloading pictures, all the code is completed, the pictures save the local address, using a timestamp.


def downs_imgs(img,title):

    headers ={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"}
    response = requests.get(img,headers=headers,timeout=3)
    content = response.content
    file_name = str(int(time.time()))+".jpg"
    file = "./downs/{}/{}".format(str(title).replace('/','').strip(),file_name)
    with open(file,"wb+") as f:
        f.write(content)

    print("Complete")
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

Run the code, wait for the drawing

The code runs and finds an error.

The reason is the problem of the path. There is a special character in the path. We need to deal with it in a similar way. Take care of it yourself.

Data acquisition, that's what it looks like.

What needs to be perfected in the code

  1. The code is divided into two parts and process-oriented. It's very bad and needs improvement.
  2. There is too much duplicate code in the part of network request, which needs abstraction and error handling. At present, it is possible to report errors.
  3. Code single-threaded, inefficient, can be improved with reference to the previous two articles
  4. Without analog login, only 6 pictures can be crawled, which is why the data is saved first to facilitate later direct transformation.

Posted by ThoughtRiot on Thu, 25 Jul 2019 02:54:29 -0700