Climb the treasure net with go language for the first time

Keywords: Go encoding github git Java

Let's use go language to crawl "treasure net" user information.

First, the request url is analyzed as follows:

http://www.zhenai.com/zhenghun

Next, use go to request the url. The code is as follows:

package main

import (
 "fmt"
 "io/ioutil"
 "net/http"
)

func main() {

 //Return request get return result
 resp, err := http.Get("http://www.zhenai.com/zhenghun")

 if err != nil {
   panic(fmt.Errorf("Error: http Get, err is %v\n", err))
 }

 //Close response body
 defer resp.Body.Close()

 if resp.StatusCode != http.StatusOK {
   fmt.Println("Error: statuscode is ", resp.StatusCode)
   return
 }

 body, err := ioutil.ReadAll(resp.Body)

 if err != nil {
   fmt.Println("Error read body, error is ", err)
 }

 //Print return value
 fmt.Println("body is ", string(body))
}

After running, you will find a lot of garbled code in the return body:

It can be found in the return body that the code is gbk, and the default code of go is utf-8, so there will be garbled code. Next, we use the third-party library to change its encoding format to utf-8.

Because a ladder is required to visit golang.org/x/text, otherwise an error is reported:

So download it on github:

mkdir -p $GOPATH/src/golang.org/x
cd $GOPATH/src/golang.org/x
git clone https://github.com/golang/text.git

Then the gbk code is converted to utf-8, and the code needs to be modified as follows:

utf8Reader := transform.NewReader(resp.Body, simplifiedchinese.GBK.NewDecoder())
body, err := ioutil.ReadAll(utf8Reader)

Considering the universality, the returned encoding format is not necessarily gbk, so you need to judge the actual encoding, and then turn the result to utf-8. You need to use a third-party library, golang.org/x/net/html, which is also downloaded on github:

mkdir -p $GOPATH/src/golang.org/x
cd $GOPATH/src/golang.org/x
git clone https://github.com/golang/net

So the code becomes this:

package main

import (
 "fmt"
 "io/ioutil"
 "net/http"
 "golang.org/x/text/transform"
 //"golang.org/x/text/encoding/simplifiedchinese"
 "io"
 "golang.org/x/text/encoding"
 "bufio"
 "golang.org/x/net/html/charset"
)

func main() {

 //Return request get return result
 resp, err := http.Get("http://www.zhenai.com/zhenghun")

 if err != nil {
   panic(fmt.Errorf("Error: http Get, err is %v\n", err))
 }

 //Close response body
 defer resp.Body.Close()

 if resp.StatusCode != http.StatusOK {
   fmt.Println("Error: statuscode is ", resp.StatusCode)
   return
 }

 //utf8Reader := transform.NewReader(resp.Body, simplifiedchinese.GBK.NewDecoder())
 utf8Reader := transform.NewReader(resp.Body, determinEncoding(resp.Body).NewDecoder())
 body, err := ioutil.ReadAll(utf8Reader)

 if err != nil {
   fmt.Println("Error read body, error is ", err)
 }

 //Print return value
 fmt.Println("body is ", string(body))
}

func determinEncoding(r io.Reader) encoding.Encoding {

 //The r here is read to make sure that resp.Body is readable.
 body, err := bufio.NewReader(r).Peek(1024)

 if err != nil {
   fmt.Println("Error: peek 1024 byte of body err is ", err)
 }

 //It is simplified here without confirmation.
 e, _, _ := charset.DetermineEncoding(body, "")
 return e
}

After running, you can't see the garbled Code:

Climb here today. Tomorrow, we will extract the address URL and city in the returned body. See the next section.

This public account provides free csdn download service and massive it learning resources. If you are ready to enter the IT pit and aspire to become an excellent program ape, these resources are suitable for you, including but not limited to java, go, python, springcloud, elk, embedded, big data, interview materials, front-end and other resources. At the same time, we have set up a technology exchange group. There are many big guys who will share technology articles from time to time. If you want to learn and improve together, you can reply [2] in the background of the public account. Free invitation plus technology exchange groups will learn from each other and share programming it related resources from time to time.

Scan the code to pay attention to the wonderful content and push it to you at the first time

Posted by godwheel on Thu, 17 Oct 2019 10:04:30 -0700