Third Big Job in Data Acquisition

Job One

content

  • Requirements: Specify a website to crawl all the pictures in this website, such as China Meteorological Network (China Meteorological Network). http://www.weather.com.cn ). Crawl using single-threaded and multithreaded methods, respectively. (Limit the number of crawled pictures to the last 3 digits of the school number)
  • Output information: The downloaded Url information is output in the console, and the downloaded pictures are stored in the images subfile, and screenshots are given.

Code & Result

Analyzing the html information of China Meteorological Network, we found that the link of the picture is in the'src'attribute value of the img tag and the other link is in the'href' attribute of the a tag.

  • Single Thread:
    Use requests+re to match links to pictures and other links within the page. Traverse through other links and crawl pictures in them.
    It is important to note that the link to some pictures you get is not a standard link like "https://...". Skip such a link.
def spider(url, headers):
    global count
    global urls
    if count > 132:  # If 132 copies have been downloaded, return
        return
    if url not in urls:  # Determine if the link has been crawled or, if not, join the urls list and crawl it
        urls.append(url)
        try:
            data = getHTMLText(url, headers)  # Get page information
            links = re.findall(r'a href="(.*?)"', data, re.S)  # Regular matching to get page links on the page
            photos = re.findall(r'img src="(.*?)"', data, re.S)  # Regular matching to get links to pictures
            for i in range(len(photos)):
                if photos[i].startswith("http"):  # Determine if picture links are up to standard
                    download(photos[i], headers)  # Pass the link to the picture into the download function to download the picture
                if count > 132:  # Stop crawling if 132 pictures have been downloaded
                    break
            if count <= 132:  # If less than 132 pictures are downloaded, access the new link and crawl the pictures
                for link in links:
                    spider(link, headers)
        except Exception as err:
            print(err)

Write a download function, download the corresponding linked pictures, and name them in the crawl order.

def download(url, count):  # Download Pictures
    if count > 132:  #  132 downloaded returns
        return
    try:
        data = getPhoto(url, headers)  # Get picture information
        with open('./images/%d%s' % (count, url[-4:]), 'wb') as f:  # Rename pictures using count and their original extension
            f.write(data)  # Write to local file
            print("{}   {}".format(count, url))   # Print out picture names and links in the console
    except Exception as err:
        print(err)

As you can see, the pictures are downloaded in the order they were crawled.

  • Multithreaded:
    Requs+re is still used, but thread creation is added, pictures are crawled and downloaded concurrently, and the rest of the code is the same as a single thread.
try:
    data = getHTMLText(url, headers)  # Get page information
    links = re.findall(r'a href="(.*?)"', data, re.S)  # Match regularly to get additional links on the page
    photos = re.findall(r'img src="(.*?)"', data, re.S)  # Regular matching to get links to pictures
    for i in range(len(photos)):
        if photos[i].startswith("http"):  # Determine if picture links are up to standard
            # Create a thread T, execute the download function, with parameters for picture links and count values to name the pictures
            T = threading.Thread(target=download, args=(photos[i], count))
            T.setDaemon(False)  # Non-daemon thread
            T.start()  # Start Thread
            threads.append(T)  # Load thread T into threads list
            count += 1  # Crawl Pictures Plus 1
            if count > 132:  # Exit the loop if 132 have been downloaded
                break
    if count <= 132:  # If you don't download 132 pictures, crawl other links on the web
        for link in links:
            spider(link, headers)  # Crawl links
except Exception as err:
    print(err)

Observe the downloaded picture information output from the console, and complete the download in sequence at first. Error-shooting found that the thread was not created with args as a parameter (statement below), which prevents the multithread from executing concurrently.
T = threading.Thread(target=download(photos[i], count))
After code adjustment, the pictures are downloaded concurrently by multiple threads, and the order of download completion is different from that of crawl.

Experience

  • Through the exercises, I consolidated the traversal of the website, the search for links, and also experienced multithreaded concurrent download of pictures, which greatly accelerated the download speed.

Full code link

https://gitee.com/cxqi/crawl_project/tree/master/experiment 3/job 1

Job 2

content

  • Requirements: Reproduce job one using the scrapy framework.
  • Output information: same as job one.

Code & Result

Write code directly for the web page html information analysis as Job 1.

  • items.py
    Write a data item class PhotoItem to save picture information
class PhotoItem(scrapy.Item):
    data = scrapy.Field()  # Picture Binary Information
    name = scrapy.Field()  # Picture Name
    link = scrapy.Field()  # pictures linking
  • pipelines.py
    Write a data processing class, PhotoPipeline, to download data.
class PhotoPipeline(object):
    count = 0
    imdir = "E:/example/exe3/images/"  # Picture Save File
    print("Download:")
    def process_item(self, item, spider):
        try:
            impath = self.imdir + item["name"]  # Picture Save Path, name Pictures by item name
            with open(impath, 'wb') as f:
                f.write(item["data"])  # Download pictures to local files
                print("{}   {}".format(item["name"], item["link"]))  # Print picture name and corresponding link in console
        except Exception as err:
            print(err)
        return item
  • settings.py
    Write configuration file to set non-compliance with crawler protocol
    ROBOTSTXT_OBEY = False #Non-compliance with crawler protocol to prevent backcrawling
    Push data into the PhotoPipeline class
ITEM_PIPELINES = {  # Open Pipe
   'exe3.pipelines.PhotoPipeline': 300,
}
  • myspider.py
    Web Crawl Entry Function:
def start_requests(self):  # Crawl Entry Function, crawl the page to execute the callback function parse
  yield scrapy.Request(url=self.url, callback=self.parse, headers=self.headers)

Specific picture links and other web page links are extracted, and two kinds of web page requests are generated, one is to link pictures, and to execute the callback function imParse; One is links to other pages, parse the callback function, and continue crawling pictures and links to other pages.

def parse(self, response):
  try:
      data = response.body.decode()
      selector = scrapy.Selector(text=data)
      photos = selector.xpath("//Img/@src'.extract() #Extract Picture Link
      links = selector.xpath("//a/@href").extract()#Extract other links from the page to page through
      for photo in photos:  # Processing pictures
          if photo.startswith("http") and self.count <= 132:
              # Determine if the picture link is standard, and if the picture link is accessible, execute the callback function imParse for picture data; Also determine the number of downloaded pictures, less than 132 continue to download pictures
              yield scrapy.Request(url=photo, callback=self.imParse, headers=self.headers)
      if self.count <= 132:  # If the number of downloaded pictures is less than 132, then crawl the links in the web page
          for link in links:
              if link not in self.urls:  # Determine if the link has been crawled
                  # If it is not crawled, add it to the visited list urls and then crawl the page
                  self.urls.append(link)
                  yield scrapy.Request(url=link, callback=self.parse, headers=self.headers)
  except Exception as err:
      print(err)

Write another function to complete the access download of the picture link, generate a data item class PhotoItem, and push it to PhotoPipeline.

def imParse(self, response):
  if self.count > 132:  # 132 downloaded returns
      return
  try:
      item = PhotoItem()  # Create a PhotoItem object
      item["data"] = response.body  # Picture data is the page information corresponding to the picture link
      item["name"] = str(self.count) + response.url[-4:]  # Rename pictures, including counting counts and suffix names for the pictures themselves
      item["link"] = response.url  # pictures linking
      yield item
      self.count += 1
  except Exception as err:
      print(err)

Run the program and the console output is as follows:

Download pictures:

Experience

Through the exercises, I consolidated the scrapy framework's practice of crawling web page information, and learned to process information for different needs, which can be achieved by writing multiple callback functions.

Full code link

https://gitee.com/cxqi/crawl_project/tree/master/experiment 3/job 2

Job Three

content

  • Requirements: Crawl Douban movie data using scrapy and xpath, and store the content in the database, while storing the pictures in the imgs path.
  • Candidate sites: https://movie.douban.com/top250
  • Output information:

Code & Result

  • items.py
    From the output information we get information about a movie we want to extract:
class MovieItem(scrapy.Item):
    rank = scrapy.Field()  # Movie Ranking
    name = scrapy.Field()  # Movie Name
    director = scrapy.Field()  # director
    actor = scrapy.Field()  # performer
    summary = scrapy.Field()  # Sketch
    score = scrapy.Field()  # Movie Scoring
    photo = scrapy.Field()  # Movie Cover
    link = scrapy.Field()  # Movie Cover Picture Link
  • myspider.py
    Find the location of the required information in the html information of the web page and extract the information.
    Note here that in the resulting list of movies, xpath positioning for each movie needs to be written in the form of'. //', you cannot ignore'.', otherwise extract_ Only the first movie on each page can be crawled after first.
movies = selector.xpath("//li/div[@class='item'] #All movies in a web page
for movie in movies:
    rank = movie.xpath(".//Div[@class='pic']/em/text()').extract_ First() #Movie Ranking
    name = movie.xpath(".//Div[@class='pic']//img/@alt').extract_first()#movie name
    p = movie.xpath(".//Div[@class='bd']/p/text()').extract_ First() #Movie director and director information
    summary = movie.xpath(".//P[@class='quote']/span/text()'.extract_first()#Introduction to Movies
    score = movie.xpath(".//Span[@class='rating_num']/text()'.extract_ First() #Movie Score
    link = movie.xpath(".//Div[@class='pic']//img/@src').extract_first()#Link to picture of movie cover

Discovering that some information is not directly available, the movie director and actor information are matched in a tag by rematching the extracted string.

director = re.findall(r'director: (.*?)\xa0', p)  # re matches the director of the movie
actor = re.findall(r'To star: (.*)', p)  # re matches the director of the movie

To generate a data project class MovieItem, be aware to handle certain information that cannot be crawled, such as setting it to ".

item = MovieItem()  # Create a MovieItem object
# Processing the data and converting None to "" if the extracted information does not exist
item["rank"] = rank if rank else ""
item["name"] = name if name else ""
item["director"] = director[0] if director else ""
item["actor"] = actor[0] if actor else ""
item["summary"] = summary if summary else ""
item["score"] = score if score else ""
# Set item's photo to the local file directory where the movie cover picture is saved
item["photo"] = "./imgs/" + name + ".jpg" if name else ""
item["link"] = link if link else ""
yield item  # Push item to data pipeline under pipelines.py

For multi-page crawling, we also need to explore the law of page turning, we can see that the law is relatively simple.


self.page += 25
if self.page <= 225:  # Page flipping process, crawling 10 pages
  url = self.start_url + "?start=" + str(self.page) + "&filter="
  yield scrapy.Request(url=url, callback=self.parse, headers=self.headers)
  • pipelines.py
    MoviePipeline creates database tables and inserts information.
def process_item(self, item, spider):
    try:
        if self.opened:
            # Store the crawled movie information in the database
            self.cursor.execute("insert into movies (Sequence Number, Movie Name, director, performer, brief introduction, Movie Scoring, Movie Cover)"
                                "values (?,?,?,?,?,?,?)", (item["rank"], item["name"], item["director"],
                                                          item["actor"], item["summary"], item["score"], item["photo"]))
            self.count += 1
    except Exception as err:
        print(err)
    return item

PhotoPipeline inherits ImagesPipeline to download pictures.

class PhotoPipeline(ImagesPipeline):  # Introducing the ImagePipeline channel

    def get_media_requests(self, item, info):  # Override get_media_requests function, which generates a request request request from the url of the picture
        yield scrapy.Request(url=item['link'], meta={'item': item})  # Transmit item using meta parameter to read movie name in item when renaming

    def file_path(self, request, response=None, info=None):
        item = request.meta['item']  # The meta above is passed to the item

        # Picture rename, request.url.split ('/') [-1].split ('.)') [-1] get picture suffix jpg,png
        image_name = item['name'] + '.' + request.url.split('/')[-1].split('.')[-1]
        # Picture Download Directory
        filename = u'imgs/{0}'.format(image_name)
        return filename
    def item_completed(self, results, item, info):
        # Judge that if a path exists, save the picture to it
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:  # If there are no pictures in it, drop it
            raise DropItem("Item contains no images")
        return item
  • settings.py
IMAGES_STORE = r'.'  # Set Picture Save Address
ROBOTSTXT_OBEY = False  # Non-compliance with crawler protocol to prevent backcrawling
ITEM_PIPELINES = { # Open the pipe, assign the integer value to each pipe, determine the order in which they run, lower the number, higher the priority value
   'exe3_2.pipelines.MoviePipeline': 100,
   'exe3_2.pipelines.PhotoPipeline': 200,
}
  • Result

Experience

Through practice, I learned another way to download pictures in the scarpy framework in addition to the Job 2 method. At the same time, I have mastered the method of meta-parameterization, because the special requirement is to rename pictures, which need to be parameterized in scrapy.Request in order to get the movie name corresponding to the pictures when renaming.

Full code link

https://gitee.com/cxqi/crawl_project/tree/master/experiment 3/job 3

Posted by twopeak on Sat, 30 Oct 2021 10:01:18 -0700