Go Crawler HTTP Request QuickStart

Keywords: Go JSON encoding network github

A few days ago, I talked about how to imitate learning on the topic of "knowing your thoughts". I gave an example of how to imitate Pyhton's requests through net/http client. But it hasn't been practiced. Is an idea really only an idea? Of course not, so I decided to suspend GO notes for a week to practice my ideas.

There are some new knowledge that we can learn by imitation.

This article will use GO to implement all the examples in the quick start document of requests, and learn how to use http client systematically. Although the title is quick start, there is a lot of content.

Quick Experience

First, let's launch a GET request. The code is very simple. As follows:

func get() {
    r, err := http.Get("https://api.github.com/events")
    if err != nil {
        panic(err)
    }
    defer func() { _ = r.Body.Close() }()

    body, _ := ioutil.ReadAll(r.Body)
    fmt.Printf("%s", body)
}

Through the http.Get method, we get a Response and an error, namely R and err. Through r, we can get the Response information, err can achieve error checking.

r.Body needs to be shut down after it is read, so defer can do it. Content can be read through ioutil.ReadAll.

Request method

In addition to GET, HTTP has a series of other methods, including POST, PUT, DELETE, HEAD, OPTIONS. GET in fast experience is implemented in a convenient way. It hides many details. It's not used here for the time being.

Let's start with a general approach to help us implement all HTTP method requests. It mainly involves two important types, Client and Request.

Client is the client that sends HTTP requests. The execution of requests is initiated by Client. It provides some convenient request methods, such as we need to initiate a Get request, which can be implemented through client.Get(url). A more general approach is through client.Do(req), which belongs to the Request type.

Request is a structure used to describe the request information, such as request method, address, header and so on. We can set it up. Request can be created through http.NewRequest.

Next, the implementation code of all HTTP methods is listed.

GET

r, err := http.DefaultClient.Do(
    http.NewRequest(http.MethodGet, "https://api.github.com/events", nil))

POST

r, err := http.DefaultClient.Do(
    http.NewRequest(http.MethodPost, "http://httpbin.org/post", nil))

PUT

r, err := http.DefaultClient.Do(
    http.NewRequest(http.MethodPut, "http://httpbin.org/put", nil))

DELETE

r, err := http.DefaultClient.Do(
    http.NewRequest(http.MethodDelete, "http://httpbin.org/delete", nil))

HEAD

r, err := http.DefaultClient.Do(
    http.NewRequest(http.MethodHead, "http://httpbin.org/get", nil))

OPTIONS

r, err := http.DefaultClient.Do(
    http.NewRequest(http.MethodOptions, "http://httpbin.org/get", nil))

The above shows the implementation of all HTTP methods. Here are a few more points that need to be explained.

DefaultClient, which is the default client provided by the net/http package, does not need to create a new Client for general requests, but can use the default.

For GET, POST and HEAD requests, GO provides a more convenient way to implement them without creating them manually.

Sample code, each HTTP request method has two implementations.

GET

r, err := http.DefaultClient.Get("http://httpbin.org/get")
r, err := http.Get("http://httpbin.org/get")

POST

bodyJson, _ := json.Marshal(map[string]interface{}{
    "key": "value",
})
r, err := http.DefaultClient.Post(
    "http://httpbin.org/post",
    "application/json",
    strings.NewReader(string(bodyJson)),
)
r, err := http.Post(
    "http://httpbin.org/post",
    "application/json",
    strings.NewReader(string(bodyJson)),
)

This paper also shows how to submit JSON data to POST interface. The main content-type settings are given. The content-type of JSON interface is application/json.

HEAD

r, err := http.DefaultClient.Head("http://httpbin.org/get")
r, err := http.Head("http://httpbin.org/get")

If you look at the source code, you will find that the call in http.Get is http.DefaultClient.Get, which means the same thing, just for convenience. Head and Post are the same.

URL parameter

By placing key/value pairs in URL s, we can deliver data to specific addresses. The key/value will follow a question mark, such as http://httpbin.org/get?key=val . Building URL s manually can be cumbersome, and we can do it through the methods provided by net/http.

Take a chestnut, for example, if you want to pass key1=value1 and key2=value2 to http://httpbin.org/get . The code is as follows:

req, err := http.NewRequest(http.MethodGet, "http://httpbin.org/get", nil)
if err != nil {
    panic(err)
}

params := make(url.Values)
params.Add("key1", "value1")
params.Add("key2", "value2")

req.URL.RawQuery = params.Encode()

// Specific case of URL http://httpbin.org/get?Key1 = value1 & key2 = Value2
// fmt.Println(req.URL.String()) 

r, err := http.DefaultClient.Do(req)

url.Values can help organize QueryString, look at the source code and find that url.Values is actually map[string][]string. Call the Encode method to pass the organizational string to RawQuery requesting req. An array parameter can also be set through url.Values, similar to the following form:

http://httpbin.org/get?key1=v...

How to do it?

params := make(url.Values)
params.Add("key1", "value1")
params.Add("key2", "value2")
params.Add("key2", "value3")

Look at the last line of code. Actually, just add another value to key2.

Response information

How to view the response information when the execution request is successful. To view the response information, you can get a general idea of what the response usually contains. Common are Body, Status, Header and Encoding.

Body

In fact, the Body reading process was demonstrated at the beginning. The response content can be read through ioutil.

body, err := ioutil.ReadAll(r.Body)

The response content is varied. If it is json, it can be decoded directly using json.Unmarshal. JSON knowledge is not introduced.

r.Body implements the io.ReadeCloser interface. In order to reduce the waste of resources, it can be released in time through defer.

defer func() { _ = r.Body.Close() }()

StatusCode

In response information, besides Body body Body content, there are other information, such as status code and charset.

r.StatusCode
r.Status

r.StatusCode is the HTTP return code, and Status is the return status description.

Header

Response header information can be obtained through Response.Header. It is important to note that the Key of the response header is case-insensitive.

r.Header.Get("content-type")
r.Header.Get("Content-Type")

You'll find that content-type and Content-Type get exactly the same content.

Encoding

How to identify the response content encoding? We need help http://golang.org/x/net/html/... Packet implementation. First, define a function. The code is as follows:

func determineEncoding(r *bufio.Reader) encoding.Encoding {
    bytes, err := r.Peek(1024)
    if err != nil {
        fmt.Printf("err %v", err)
        return unicode.UTF8
    }

    e, _, _ := charset.DetermineEncoding(bytes, "")

    return e
}

How to call it?

bodyReader := bufio.NewReader(r.Body)
e := determineEncoding(bodyReader)
fmt.Printf("Encoding %v\n", e)

decodeReader := transform.NewReader(bodyReader, e.NewDecoder())

A new reader is generated by bufio, and then the content coding is detected by determineEncoding, and the coding is transformed by transforming.

Picture Download

If the content is a picture, how can we download it? For example, a picture of the address below.

https://pic2.zhimg.com/v2-5e8...

In fact, it's very simple, just create a new file and save the response content.

f, err := os.Create("as.jpg")
if err != nil {
    panic(err)
}
defer func() { _ = f.Close() }()

_, err = io.Copy(f, r.Body)
if err != nil {
    panic(err)
}

r is Response, which creates a new file using os, and then saves the content of the response into the file through io.Copy.

Customized request header

How do I customize the request header for requests? Request has actually provided a way to do this through req.Header.Add.

For example, suppose we're going to visit http://httpbin.org/get But this address has a crawl policy for user-agent. We need to modify the default user-agent.

Sample code:

req, err := http.NewRequest(http.MethodGet, "http://httpbin.org/get", nil)
if err != nil {
    panic(err)
}

req.Header.Add("user-agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0)")

If that is the case, the task can be accomplished.

Complex POST requests

The way to submit JSON data to the POST interface has been shown previously. Next, I introduce several other ways to submit data to the POST interface, namely form submission and file submission.

Form submission

Form submission is a very common function, so in net/http, besides providing standard usage, it also provides us with a simplified method.

Let's first introduce a standard implementation method.

For example, suppose you want to http://httpbin.org/post Submit a form with name poloxue and password 123456.

payload := make(url.Values)
payload.Add("name", "poloxue")
payload.Add("password", "123456")
req, err := http.NewRequest(
    http.MethodPost,
    "http://httpbin.org/post",
    strings.NewReader(payload.Encode()),
)
if err != nil {
    panic(err)
}
req.Header.Add("Content-Type", "application/x-www-form-urlencoded")

r, err := http.DefaultClient.Do(req)

POST payload is a string like name = poloxue & password = 123456, so we can organize it through url.Values.

The content submitted to NewRequest must be of the type that implements the Reader interface, so strings.NewReader transformation is required.

If the content-type submitted by the Form form is application/x-www-form-urlencoded, it should also be set.

Complex ways are introduced. Then I introduce the simplified method, in fact, form submission can be completed only by calling http.PostForm. The sample code is as follows:

payload := make(url.Values)
payload.Add("name", "poloxue")
payload.Add("password", "123456")
r, err := http.PostForm("http://httpbin.org/post", form)

It's so simple.

submit document

File submission should be more complex in HTTP requests. In fact, it is not difficult to say, unlike other requests, we need to spend some energy to read files and organize the submission of POST data.

For example, suppose I now have a picture file called as.jpg, with a path in the / Users/polo directory. Now submit this image to http://httpbin.org/post.

We need to organize the POST submission first. The code is as follows:

filename := "/Users/polo/as.jpg"

f, err := os.Open(filename)
if err != nil {
    panic(err)
}
defer func() { _ = f.Close() }()

uploadBody := &bytes.Buffer{}
writer := multipart.NewWriter(uploadBody)

fWriter, err := writer.CreateFormFile("uploadFile", filename)
if err != nil {
    fmt.Printf("copy file writer %v", err)
}

_, err = io.Copy(fWriter, f)
if err != nil {
    panic(err)
}

fieldMap := map[string]string{
    "filename": filename,
}
for k, v := range fieldMap {
    _ = writer.WriteField(k, v)
}

err = writer.Close()
if err != nil {
    panic(err)
}

In my opinion, data organization can be divided into several steps, as follows:

  • The first step is to open the file to be uploaded and use defer f.Close() to prepare for resource release.
  • The second step is to create bytes.Buffer, named uploadBody, which stores uploaded content.
  • The third step is to create a writer through multipart.NewWriter to write the content provided by the file to the buffer.
  • The fourth step is to create an uploaded file through author. CreateFormFile and write content to it through io.Copy.
  • Finally, add additional information through writer.WriteField, and pay attention to closing the writer at last.

At this point, the data uploaded by the file is organized. Next, just call the http.Post method to complete the file upload.

r, err := http.Post("http://httpbin.org/post", writer.FormDataContentType(), uploadBody)

It's important to note that the content-type of the request needs to be set, and the type of uploaded file can be obtained through author. FormDataContentType ().

At this point, the submission of documents has also been completed, I do not know if there is a very simple feeling.

Cookie

It mainly involves two parts: reading the cookie of the response and setting the cookie of the request. The way to get the cookie of the response is very simple, just call r.Cookies directly.

Focus on how to set the request cookie. There are two ways to set cookies, one on Client and the other on Request.

Setting Cookie on Client

Look directly at the sample code:

cookies := make([]*http.Cookie, 0)

cookies = append(cookies, &http.Cookie{
    Name:   "name",
    Value:  "poloxue",
    Domain: "httpbin.org",
    Path:   "/cookies",
})
cookies = append(cookies, &http.Cookie{
    Name:   "id",
    Value:  "10000",
    Domain: "httpbin.org",
    Path:   "/elsewhere",
})

url, err := url.Parse("http://httpbin.org/cookies")
if err != nil {
    panic(err)
}

jar, err := cookiejar.New(nil)
if err != nil {
    panic(err)
}
jar.SetCookies(url, cookies)

client := http.Client{Jar: jar}

r, err := client.Get("http://httpbin.org/cookies")

In the code, we first create the http.Cookie slice, and then add two cookie data to it. Here, two new cookies are saved through cookiejar.

This time, instead of using the default DefaultClient, we will create a new Client and bind the cookiejar that holds cookie information to the client. Next, you just need to use the newly created Client to initiate the request.

Setting Cookie on Request

The cookie settings on the request can be implemented by req.AddCookie. Sample code:

req, err := http.NewRequest(http.MethodGet, "http://httpbin.org/cookies", nil)
if err != nil {
    panic(err)
}

req.AddCookie(&http.Cookie{
    Name:   "name",
    Value:  "poloxue",
    Domain: "httpbin.org",
    Path:   "/cookies",
})

r, err := http.DefaultClient.Do(req)

It's quite simple. There's nothing to introduce.

What's the difference between cookie setting Client and Request setting? One of the easiest differences to think about is that Request cookies are invalid only once the request is made, and cookies on Client are always valid as long as you use the newly created Client.

Redirection and request history

By default, all types of requests automatically handle redirection.

The HEAD request in Python's requests package is not redirected, but the test results show that the HEAD request in net/http is automatically redirected.

The redirection control in net/http can be controlled by a member of the Client named CheckRedirect, which is a function type. Definitions are as follows:

type Client struct {
    ...
    CheckRedirect func(req *Request, via []*Request) error
    ...
}

Next, let's see how to use it.

Suppose we want to implement the function: to prevent circular redirection, the number of redirections should not be defined more than 10 times, and the historical Response should be recorded.

Sample code:

var r *http.Response
history := make([]*http.Response, 0)

client := http.Client{
    CheckRedirect: func(req *http.Request, hrs []*http.Request) error {
        if len(hrs) >= 10 {
            return errors.New("redirect to many times")
        }

        history = append(history, req.Response)
        return nil
    },
}

r, err := client.Get("http://github.com")

First, a variable named history is created for the http.Response slice. Then an anonymous function is assigned to CheckRedirect in http.Client to control the redirection behavior. The first parameter of the CheckRedirect function represents the Request that will be requested next time, and the second parameter represents the Request that has been requested.

When a redirection occurs, the current Request saves the Reponse of the last request, so req.Response can be appended to the history variable here.

time-out

Wouldn't it be embarrassing if the server didn't respond late after the Request was sent out? Then we wonder if we can set a timeout rule for requests. No doubt, of course.

Timeouts can be divided into connection timeouts and response read timeouts, which can be set. But under normal circumstances, you don't want to have such a clear distinction, so you can also set a total timeout.

total timeout

The total timeout setting is bound to a member of Client named Timeout, which is time.Duration.

Assuming this is a timeout of 10 seconds, the sample code:

client := http.Client{
    Timeout:   time.Duration(10 * time.Second),
}

connection timed out

Connection timeout can be achieved through Transport in Client. Transport has a member function called Dial that can be used to set connection timeouts. Transport is the underlying data carrier for HTTP.

Assuming that the connection timeout time is set to 2 seconds, the sample code:

t := &http.Transport{
    Dial: func(network, addr string) (net.Conn, error) {
        timeout := time.Duration(2 * time.Second)
        return net.DialTimeout(network, addr, timeout)
    },
}

In Dial's function, we use net.DialTimeout to connect the network, and realize the connection timeout function.

Read timeout

Reading timeouts are also set by Client's Transport settings, such as setting the response to read for 8 seconds.

Sample code:

t := &http.Transport{
    ResponseHeaderTimeout: time.Second * 8,
}
//All in all, Client's creation code is as follows:

t := &http.Transport{
    Dial: func(network, addr string) (net.Conn, error) {
        timeout := time.Duration(2 * time.Second)
        return net.DialTimeout(network, addr, timeout)
    },
    ResponseHeaderTimeout: time.Second * 8,
}
client := http.Client{
    Transport: t,
    Timeout:   time.Duration(10 * time.Second),
}

In addition to the above timeout settings, Transport has other timeout settings. You can see the definition of Transport and find three definitions related to timeout:

// IdleConnTimeout is the maximum amount of time an idle
// (keep-alive) connection will remain idle before closing
// itself.
// Zero means no limit.
IdleConnTimeout time.Duration

// ResponseHeaderTimeout, if non-zero, specifies the amount of
// time to wait for a server's response headers after fully
// writing the request (including its body, if any). This
// time does not include the time to read the response body.
ResponseHeaderTimeout time.Duration

// ExpectContinueTimeout, if non-zero, specifies the amount of
// time to wait for a server's first response headers after fully
// writing the request headers if the request has an
// "Expect: 100-continue" header. Zero means no timeout and
// causes the body to be sent immediately, without
// waiting for the server to approve.
// This time does not include the time to send the request header.
ExpectContinueTimeout time.Duration

IdleConn Timeout (connection idle timeout, keep-live open), TLS Handshake Timeout (TLS handshake time) and Expect Continue Timeout (seems to be included in Response Header Timeout, see comments).

At this point, the timeout settings have been completed. It's really a lot more complicated than Python requests.

Request Agent

Agents are important, especially for those who develop reptiles. So how do net/http set up proxies? This work still depends on Client member Transport, which is important.

Transport has a member named Proxy. Let's see how to use it. Suppose we want to request Google's home page by setting up a proxy with the proxy address of __________ http://127.0.0.1:8087.

Sample code:

proxyUrl, err := url.Parse("http://127.0.0.1:8087")
if err != nil {
    panic(err)
}
t := &http.Transport{
    Proxy:           http.ProxyURL(proxyUrl),
    TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
}
client := http.Client{
    Transport: t,
    Timeout:   time.Duration(10 * time.Second),
}

r, err := client.Get("https://google.com")

Focus on the code created by http.Transport. Two parameters, time-sharing Proxy and TLSClientConfig, are used to set up proxies and disable https validation, respectively. I found that it was possible to request success without setting TLSClientConfig. The specific reason was not carefully studied.

error handling

Error handling need not be introduced. The common errors in GO are checking the returned errors. The same is true for HTTP requests. It will return the corresponding error information according to the situation, such as timeout, network connection failure and so on.

Errors in the sample code are thrown out through panic. This is certainly not true for real projects. We need to record relevant logs and do error recovery work at all times.

summary

This paper takes Python's requests document as a guide, sorting out how the cases in requests Quick Start Document are implemented in GO. To illustrate, GO actually provides a cloned version of requests, [github address](
https://github.com/levigross/... . I haven't seen it yet. Interested friends can study it.

Posted by Orpheus13 on Sun, 28 Jul 2019 23:07:23 -0700