Coly of the Go Crawler goes from getting started to not giving up the guide

Keywords: Go github JSON encoding Redis

Recently, I found that there were fewer and fewer questions I knew I was interested in, so I was ready to aggregate technical questions and answers from other platforms, such as segmentfault, stackoverflow, and so on.

To accomplish this work, there must be crawlers.I stopped by to study colly, a crawler framework for Go.

Overview

Coly is a well-known crawler framework for Gos implementation, and the advantages of Gos in high-concurrency and distributed scenarios are just what crawler technology needs.Its main features are lightweight, fast, elegant design, and easy to expand with distributed support.

How to Learn

The most famous framework for crawlers should be Python's scrapy, which is the first crawler framework that many people come into contact with, and I am no exception.It has extensive documentation and extensive components.When we design a crawler frame, we often refer to its design.Previously, I saw some articles describing scrapy-like implementations in Go.

By contrast, Coly's learning materials are poor.When I first saw it, I couldn't help learning from my scrapy experience, but I found that it was not possible to move the hard case by hand.

At this point, we naturally want to look for some articles to read, but the result is that there are really a few colly-related articles, the basic ones we can find are officially provided, and they don't seem so perfect.No way, chew slowly!There are usually three official learning materials, namely documents, cases, and sources.

Today, let's start with the official documentation.Start of text.

Official Documents

Official Documents Introduces how to use it emphatically. If you are a friend with crawling experience, it will be quick to clean the document once.I spent some time organizing a version of the official website documents in my own way.

The main content is not much, involving install,Quick Start,How to configure,debugging,Distributed Crawlers,storage,Using multiple collectors,Configuration optimization,extend.

Each of these documents is so small that even a few of them don't have to scroll through pages.

How to Install

Installation of colly is as simple as other Go Library installations.The following:

go get -u github.com/gocolly/colly

One line command.So easy!

Quick Start

Let's take a quick look at colly using a hello word case.The steps are as follows:

The first step is to import colly.

import "github.com/gocolly/colly"

The second step is to create the collector.

c := colly.NewCollector()

The third step, Event Listening, executes event handling through callback.

// Find and visit all links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")
    // Print link
    fmt.Printf("Link found: %q -> %s\n", e.Text, link)
    // Visit link found on page
    // Only those links are visited which are in AllowedDomains
    c.Visit(e.Request.AbsoluteURL(link))
})

c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL)
})

By the way, we'll list the types of events colly supports as follows:

Called before OnRequest request execution
Called after the OnResponse response returns
OnHTML listens for execution selector
OnXML listens for execution selector
OnHTMLDetach, deactivate listening, parameter is selector string
OnXMLDetach, unlistening, parameter is selector string
OnScraped, executed after grabbing, executed after all work is done
OnError, error callback

Finally, c.Visit() officially launches web page access.

c.Visit("http://go-colly.org/")

The completion code for the case is in the colly source_example directory basic Provided in.

How to configure

colly is a flexible framework that provides a wide range of options for developers to configure.By default, each option provides a better default value.

The following is a collector created by default.

c := colly.NewCollector()

Configure the collector s you create, such as setting useragent s and allowing duplicate access.The code is as follows:

c2 := colly.NewCollector(
    colly.UserAgent("xy"),
    colly.AllowURLRevisit(),
)

We can also create it before changing the configuration.

c2 := colly.NewCollector()
c2.UserAgent = "xy"
c2.AllowURLRevisit = true

The configuration of the collector can change at any stage of the crawl execution.A classic example of a simple backcrawl can be achieved by randomly changing the user-agent.

const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

func RandomString() string {
    b := make([]byte, rand.Intn(10)+10)
    for i := range b {
        b[i] = letterBytes[rand.Intn(len(letterBytes))]
    }
    return string(b)
}

c := colly.NewCollector()

c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("User-Agent", RandomString())
})

As mentioned earlier, collectors have chosen better configurations for us by default, but they can also be changed through environment variables.This way, we don't have to recompile each time to change the configuration.Environment variable configuration takes effect at collector initialization and can be overridden after formal startup.

Supported configuration items are as follows:

ALLOWED_DOMAINS (String Slice)，Allowed domain names, such as []string{"segmentfault.com", "zhihu.com"}
CACHE_DIR (string) Cache directory
DETECT_CHARSET (y/n) Whether to detect response encoding
DISABLE_COOKIES (y/n) prohibit cookies
DISALLOWED_DOMAINS (String Slice)，Prohibited domain names, same as ALLOWED_DOMAINS type
IGNORE_ROBOTSTXT (y/n) Ignore or not ROBOTS Agreement
MAX_BODY_SIZE (int) Maximum response
MAX_DEPTH (int - 0 means infinite) Depth of Access
PARSE_HTTP_ERROR_RESPONSE (y/n) analysis HTTP Response Error
USER_AGENT (string)

They are all very easy to understand options.

Let's take a look at the configuration of HTTP, which is commonly used, such as proxy, various timeouts, and so on.

c := colly.NewCollector()
c.WithTransport(&http.Transport{
    Proxy: http.ProxyFromEnvironment,
    DialContext: (&net.Dialer{
        Timeout:   30 * time.Second,          // timeout
        KeepAlive: 30 * time.Second,          // keepAlive timeout
        DualStack: true,
    }).DialContext,
    MaxIdleConns:          100,               // Maximum number of idle connections
    IdleConnTimeout:       90 * time.Second,  // Idle connection timeout
    TLSHandshakeTimeout:   10 * time.Second,  // TLS handshake timeout
    ExpectContinueTimeout: 1 * time.Second,  
}

debugging

When using scrapy, it provides a very useful shell to help us implement debug very easily.Unfortunately, there is no such feature in colly, where debugger refers primarily to runtime information collection.

debugger is an interface that allows you to collect runtime information by implementing only two of its methods.

type Debugger interface {
    // Init initializes the backend
    Init() error
    // Event receives a new collector event.
    Event(e *Event)
}

There is a typical case in the source code. LogDebugger .We just need to provide the appropriate io.Writer type variables, how do we use them?

One case is as follows:

package main

import (
    "log"
    "os"

    "github.com/gocolly/colly"
    "github.com/gocolly/colly/debug"
)

func main() {
    writer, err := os.OpenFile("collector.log", os.O_RDWR|os.O_CREATE, 0666)
    if err != nil {
        panic(err)
    }

    c := colly.NewCollector(colly.Debugger(&debug.LogDebugger{Output: writer}), colly.MaxDepth(2))
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        if err := e.Request.Visit(e.Attr("href")); err != nil {
            log.Printf("visit err: %v", err)
        }
    })

    if err := c.Visit("http://go-colly.org/"); err != nil {
        panic(err)
    }
}

When the run is complete, open collector.log to view the output.

Distributed

Distributed crawlers can be considered from several levels, namely, agent level, execution level and storage level.

Agent level

By setting up a proxy pool, we can assign download tasks to different nodes to perform, which can help provide the crawler's web download speed.At the same time, it can effectively reduce the possibility of IP being blocked due to too fast crawling speed.

The code for colly to implement proxy IP is as follows:

package main

import (
    "github.com/gocolly/colly"
    "github.com/gocolly/colly/proxy"
)

func main() {
    c := colly.NewCollector()

    if p, err := proxy.RoundRobinProxySwitcher(
        "socks5://127.0.0.1:1337",
        "socks5://127.0.0.1:1338",
        "http://127.0.0.1:8080",
    ); err == nil {
        c.SetProxyFunc(p)
    }
    // ...
}

proxy.RoundRobinProxySwitcher is colly's built-in function for proxy switching through polling.Of course, we can also completely customize it.

For example, in a case where a proxy switches randomly, the following is true:

var proxies []*url.URL = []*url.URL{
    &url.URL{Host: "127.0.0.1:8080"},
    &url.URL{Host: "127.0.0.1:8081"},
}

func randomProxySwitcher(_ *http.Request) (*url.URL, error) {
    return proxies[random.Intn(len(proxies))], nil
}

// ...
c.SetProxyFunc(randomProxySwitcher)

It is important to note, however, that the crawler is still centralized at this time, and the task is performed on only one node.

Execution level

This approach achieves true distribution by assigning tasks to different nodes for execution.

If distributed execution is implemented, a problem needs to be faced first. How can tasks be assigned to different nodes to achieve collaborative work between different task nodes?

First, we choose the appropriate communication scheme.Common communication protocols are HTTP, TCP, a stateless text protocol, and a connection-oriented protocol.In addition, you can choose from a wide variety of RPC protocols, such as Jsonrpc, facebook thrift, google grpc, and so on.

The document provides an HTTP service sample code that receives requests and performs tasks.The following:

package main

import (
    "encoding/json"
    "log"
    "net/http"

    "github.com/gocolly/colly"
)

type pageInfo struct {
    StatusCode int
    Links      map[string]int
}

func handler(w http.ResponseWriter, r *http.Request) {
    URL := r.URL.Query().Get("url")
    if URL == "" {
        log.Println("missing URL argument")
        return
    }
    log.Println("visiting", URL)

    c := colly.NewCollector()

    p := &pageInfo{Links: make(map[string]int)}

    // count links
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Request.AbsoluteURL(e.Attr("href"))
        if link != "" {
            p.Links[link]++
        }
    })

    // extract status code
    c.OnResponse(func(r *colly.Response) {
        log.Println("response received", r.StatusCode)
        p.StatusCode = r.StatusCode
    })
    c.OnError(func(r *colly.Response, err error) {
        log.Println("error:", r.StatusCode, err)
        p.StatusCode = r.StatusCode
    })

    c.Visit(URL)

    // dump results
    b, err := json.Marshal(p)
    if err != nil {
        log.Println("failed to serialize response:", err)
        return
    }
    w.Header().Add("Content-Type", "application/json")
    w.Write(b)
}

func main() {
    // example usage: curl -s 'http://127.0.0.1:7171/?url=http://go-colly.org/'
    addr := ":7171"

    http.HandleFunc("/", handler)

    log.Println("listening on", addr)
    log.Fatal(http.ListenAndServe(addr, nil))
}

There is no code for the scheduler, but the implementation is not complicated.When the task is completed, the service returns the corresponding link to the scheduler, which is responsible for sending the new task to the work node to continue execution.

If the task execution node needs to be determined based on the node load, the service also needs to provide monitoring API s to get node performance data to help the scheduler make decisions.

Storage Level

We have distributed tasks by assigning them to different nodes for execution.However, some data, such as cookies, url records accessed, etc., need to be shared between nodes.By default, the data is stored in memory and can only be shared by each collector.

We can share data among nodes by saving the data to redis, mongo, and so on.colly supports switching between any storage as long as the corresponding storage implementation colly/storage.Storage Method in interface.

In fact, colly has built in some of the storage implementations, see storage .This topic will also be covered in the next section.

storage

We've just covered this topic, so let's take a closer look at what storage s colly already supports.

InMemoryStorage That is, memory, colly's default storage, which we can replace with collector.SetStorage().

RedisStorage , perhaps because redis is used more in distributed scenarios, officially provided Use cases.

There are others Sqlite3Storage and MongoStorage.

Multiple Collectors

The crawls we've demonstrated earlier are fairly simple and have very similar processing logic.If it's a complex crawl, we can create different collector s to handle different tasks.

How do you understand this?Take an example.

If you've written about crawling for a while, you've probably encountered problems with parent-child page capture. Usually, the processing logic of parent pages is different from that of child pages, and there's often a need for data sharing between parent-child pages.If you've used scrapy, you should know that scrapy implements logical processing of different pages through request-bound callback functions, while data sharing implements data transfer from parent page to child page by binding data on request.

After the study, we found that scrapy does not support colly in this way.So what should I do?This is the problem we are trying to solve.

For the processing logic of different pages, we can define to create multiple collectors, called collectors, which handle different page logic.

c := colly.NewCollector(
    colly.UserAgent("myUserAgent"),
    colly.AllowedDomains("foo.com", "bar.com"),
)
// Custom User-Agent and allowed domains are cloned to c2
c2 := c.Clone()

Typically, the collectors for parent and child pages are the same.In the example above, the collector c2 of the child page copies the configuration of the parent collector through a clone.

Data transfer between parent and child pages can be done between different collector s through Context.Note that this Context is only a structure of data sharing implemented by colly, not a Context in the Go Standard Library.

c.OnResponse(func(r *colly.Response) {
    r.Ctx.Put("Custom-header", r.Headers.Get("Custom-Header"))
    c2.Request("GET", "https://foo.com/", nil, r.Ctx, nil)
})

In this way, we can get the data from the parent through r.Ctx in the child page.For this scenario, we can look at the officially provided cases coursera_courses.

Configuration optimization

colly's default configuration is optimized for a small number of sites.If you're capturing a lot of sites, you need to make some improvements.

Persistent Storage

By default, cookies and URLs in colly are stored in memory and we want to replace them with persistent storage.As mentioned earlier, colly has implemented some common persistent storage components.

Enable asynchronous to speed up task execution

colly blocks waiting for requests to complete by default, which results in an increasing number of tasks waiting to execute.We can avoid this problem by setting the collector's sync option to true for asynchronous processing.If you do this, remember to add c.Wait(), otherwise the program will exit immediately.

Prohibit or restrict KeepAlive connections

Coly turns KeepAlive on by default to increase the crawl's capture speed.However, this requires open file descriptors, and processes can easily reach the maximum descriptor limit for long-running tasks.

Sample code for KeepAlive that prohibits HTTP is shown below.

c := colly.NewCollector()
c.WithTransport(&http.Transport{
    DisableKeepAlives: true,
})

extend

colly provides extensions to common crawler-related functions such as referer, random_user_agent, url_length_filter, and so on.Source path in colly/extensions/ Down.

Learn how to use them through an example, as follows:

import (
    "log"

    "github.com/gocolly/colly"
    "github.com/gocolly/colly/extensions"
)

func main() {
    c := colly.NewCollector()
    visited := false

    extensions.RandomUserAgent(c)
    extensions.Referrer(c)

    c.OnResponse(func(r *colly.Response) {
        log.Println(string(r.Body))
        if !visited {
            visited = true
            r.Request.Visit("/get?q=2")
        }
    })

    c.Visit("http://httpbin.org/get")
}

Simply pass the collector into the extension function.It's so simple.

So can we do an extension on our own?

When using scrapy, we need to understand a number of concepts in advance and read its documentation carefully if we want to implement an extension.But colly doesn't have any explanations in the document at all.What about Swollen?It looks like you can only see the source code.

Let's open the source for the referer plug-in as follows:

package extensions

import (
    "github.com/gocolly/colly"
)

// Referer sets valid Referer HTTP header to requests.
// Warning: this extension works only if you use Request.Visit
// from callbacks instead of Collector.Visit.
func Referer(c *colly.Collector) {
    c.OnResponse(func(r *colly.Response) {
        r.Ctx.Put("_referer", r.Request.URL.String())
    })
    c.OnRequest(func(r *colly.Request) {
        if ref := r.Ctx.Get("_referer"); ref != "" {
            r.Headers.Set("Referer", ref)
        }
    })
}

An extension is implemented by adding some event callbacks to the collector.With this simple source code, you can implement your own extension without documentation at all.Of course, if you look closely, we can see that the idea is similar to scrapy, which is achieved through callbacks that extend request and response, and colly's simplicity is largely due to its elegant design and the simple syntax of Go.

summary

After reading colly's official document, you'll find that although it's rudimentary, it's mostly covered.If there are some that are not covered, I have also made relevant additions in this article.Previously using Go s elastic Packaging is also a pity with few documents, but by simply reading the source code, you can immediately see how to use it.

Maybe this is the simplest Go Avenue.

Finally, if you encounter any problems with colly, the official example is absolutely the best practice and it is recommended that you take time to read it.

Posted by Cynthia Blue on Sun, 04 Aug 2019 09:17:42 -0700

Programmer Group