Java crawls and downloads cool dog TOP500 songs

Keywords: Java Apache JSON SSL

The following methods and codes are for learning purposes only, not for other purposes.The example uses several libraries, including jsoup, HttpClient, net.sf.json, which you can download on your own.

1. Analyse if you can get TOP500 song lists

First, open the cool dog homepage to see the cool dog TOP500, say 500, how can there be only 22?

Really just look at these or find the rest, so I looked at the link to the TOP500

https://www.kugou.com/yy/rank/home/1-8888.html?from=rank

You can see that there is a 1 behind home. Does this mean the first page?So I changed 1 to 2 and entered, and then I entered the second page, so I know we can get these 500 song lists on the web.

2. Analyse to find the real mp3 download address (this is a bit around)

Click on a song to go to the playback page, use Elements on Google Browser's console, search for mp3, and easily locate the MP3

But there is no file address of mp3 in the html crawled when using java access, so it must be that JS is used to load mp3 in the location of the page. Refresh the page, see what the page loads, load a little more, and focus on js, php requests, mainly to see if there is an address of mp3 in it, let alone analyze the details

Ultimately I'm on the list

https://wwwapi.kugou.com/yy/index.php?r=play/getdata&callback=jQuery191027067069941080546_1546235744250&hash=667939C6E784265D541DEEE65AE4F2F8&album_id=0&_=1546235744251

The full address of mp3 was found in this request.

"play_url": "http://fs.w.kugou.com/201812311325/dcf5b6449160903c6ee48035e11434bb/G128/M08/02/09/IIcBAFrZqf2ANOadADn94ubOmaU995.mp3",

So how does this js know which song it is? It's only possible that the hash parameter determines the song, and then go to the playback page to find the location of this hash, which is in the js below

var dataFromSmarty = [{"hash":"667939C6E784265D541DEEE65AE4F2F8","timelength":"237051","audio_name":"White Small White - Best Wedding","author_name":"White Small White","song_name":"Best Wedding","album_id":0}],//Song information on current page
            playType = "search_single";//Current Play
    </script>

Crawl the web page in java to see if you can crawl to this hash. Sure, there is this js in the html crawled, so far the address of mp3 has been found, and the song list has also been found, then the next step is to use the program to achieve it.

3.java implements crawling cool dog mp3

First look at the crawl results

Found the resources, the program implementation is just to say, which uses several tool classes written by itself, it is still good to sort out their own tool classes, in the future, there is no need to rewrite any problems, just use them directly.Nothing to say, paste the source code directly below

SpiderKugou.java

package com.bing.spider;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import com.bing.download.FileDownload;
import com.bing.html.HtmlManage;
import com.bing.http.HttpGetConnect;

import net.sf.json.JSONObject;

public class SpiderKugou {

    public static String filePath = "F:/music/";
    public static String mp3 = "https://wwwapi.kugou.com/yy/index.php?r=play/getdata&callback=jQuery191027067069941080546_1546235744250&"
            + "hash=HASH&album_id=0&_=TIME";

    public static String LINK = "https://www.kugou.com/yy/rank/home/PAGE-8888.html?from=rank";
    //"https://www.kugou.com/yy/rank/home/PAGE-23784.html?from=rank";


    public static void main(String[] args) throws IOException {

        for(int i = 1 ; i < 23 ; i++){
            String url = LINK.replace("PAGE", i + "");
            getTitle(url);
            //download("https://www.kugou.com/song/mfy6je5.html");
        }
    }

    public static String getTitle(String url) throws IOException{
        HttpGetConnect connect = new HttpGetConnect();
        String content = connect.connect(url, "utf-8");
        HtmlManage html = new HtmlManage();
        Document doc = html.manage(content);
        Element ele = doc.getElementsByClass("pc_temp_songlist").get(0);
        Elements eles = ele.getElementsByTag("li");
        for(int i = 0 ; i < eles.size() ; i++){
            Element item = eles.get(i);
            String title = item.attr("title").trim();
            String link = item.getElementsByTag("a").first().attr("href");

            download(link,title);
        }
        return null;
    }

    public static String download(String url,String name) throws IOException{
        String hash = "";
        HttpGetConnect connect = new HttpGetConnect();
        String content = connect.connect(url, "utf-8");
        HtmlManage html = new HtmlManage();

        String regEx = ""hash":"[0-9A-Z]+"";
        // Compile regular expressions
        Pattern pattern = Pattern.compile(regEx);
        Matcher matcher = pattern.matcher(content);
        if (matcher.find()) {
            hash = matcher.group();
            hash = hash.replace(""hash":"", "");
            hash = hash.replace(""", "");
        }

        String item = mp3.replace("HASH", hash);
        item = item.replace("TIME", System.currentTimeMillis() + "");

        System.out.println(item);
        String mp = connect.connect(item, "utf-8");

        mp = mp.substring(mp.indexOf("(") + 1, mp.length() - 3);

        JSONObject json = JSONObject.fromObject(mp);
        String playUrl = json.getJSONObject("data").getString("play_url");


        System.out.print(playUrl + "  ==  ");
        FileDownload down = new FileDownload();
        down.download(playUrl, filePath + name + ".mp3");

        System.out.println(name + "Download complete");
        return playUrl;
    }

}

HttpGetConnect.java

package com.bing.http;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.security.NoSuchAlgorithmException;
import java.security.cert.CertificateException;
import java.security.cert.X509Certificate;

import javax.net.ssl.SSLContext;
import javax.net.ssl.TrustManager;
import javax.net.ssl.X509TrustManager;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.http.HttpEntity;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.ResponseHandler;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.conn.ClientConnectionManager;
import org.apache.http.conn.scheme.Scheme;
import org.apache.http.conn.scheme.SchemeRegistry;
import org.apache.http.conn.ssl.SSLSocketFactory;
import org.apache.http.impl.client.BasicResponseHandler;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.BasicHttpClientConnectionManager;
import org.apache.http.params.HttpParams; 
/**
 * @Explain:
 * @author: gaoll
 * @CreateTime:2014-11-13
 * @ModifyTime:2014-11-13
 */
public class HttpGetConnect {

    /**
     *  Get html content
     * @param url
     * @param charsetName  UTF-8,GB2312
     * @return
     * @throws IOException
     */
    public static String connect(String url,String charsetName) throws IOException{
        BasicHttpClientConnectionManager connManager = new BasicHttpClientConnectionManager();

        CloseableHttpClient httpclient = HttpClients.custom()
                .setConnectionManager(connManager)
                .build();
        String content = "";

        try{
            HttpGet httpget = new HttpGet(url);

            RequestConfig requestConfig = RequestConfig.custom()
                    .setSocketTimeout(5000)
                    .setConnectTimeout(50000)
                    .setConnectionRequestTimeout(50000)
                    .build();
            httpget.setConfig(requestConfig);
            httpget.setHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
            httpget.setHeader("Accept-Encoding", "gzip,deflate,sdch");
            httpget.setHeader("Accept-Language", "zh-CN,zh;q=0.8");
            httpget.setHeader("Connection", "keep-alive");
            httpget.setHeader("Upgrade-Insecure-Requests", "1");
            httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36");
            //httpget.setHeader("Hosts", "www.oschina.net");
            httpget.setHeader("cache-control", "max-age=0");  

            CloseableHttpResponse response = httpclient.execute(httpget);

            int status = response.getStatusLine().getStatusCode();
            if (status >= 200 && status < 300) {

                HttpEntity entity = response.getEntity();
                InputStream instream = entity.getContent();
                BufferedReader br = new BufferedReader(new InputStreamReader(instream,charsetName));
                StringBuffer sbf = new StringBuffer();
                String line = null;
                while ((line = br.readLine()) != null){
                    sbf.append(line + "
");
                }

                br.close();
                content = sbf.toString();
            } else {
                content = "";
            }

        }catch(Exception e){
            e.printStackTrace();
        }finally{
            httpclient.close();
        }
        //log.info("content is " + content);
        return content;
    }
    private static Log log = LogFactory.getLog(HttpGetConnect.class);
}

HtmlManage.java

package com.bing.html;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import com.bing.http.HttpGetConnect;

/**
 * @Explain:
 * @author: gaoll
 * @CreateTime:2014-11-13
 * @ModifyTime:2014-11-13
 */
public class HtmlManage {

    public Document manage(String html){
        Document doc = Jsoup.parse(html);
        return doc;
    }

    public Document manageDirect(String url) throws IOException{
        Document doc = Jsoup.connect( url ).get();
        return doc;
    }

    public List<String> manageHtmlTag(Document doc,String tag ){
        List<String> list = new ArrayList<String>();

        Elements elements = doc.getElementsByTag(tag);
        for(int i = 0; i < elements.size() ; i++){
            String str = elements.get(i).html();
            list.add(str);
        }
        return list;
    }

    public List<String> manageHtmlClass(Document doc,String clas ){
        List<String> list = new ArrayList<String>();

        Elements elements = doc.getElementsByClass(clas);
        for(int i = 0; i < elements.size() ; i++){
            String str = elements.get(i).html();
            list.add(str);
        }
        return list;
    }

    public List<String> manageHtmlKey(Document doc,String key,String value ){
        List<String> list = new ArrayList<String>();

        Elements elements = doc.getElementsByAttributeValue(key, value);
        for(int i = 0; i < elements.size() ; i++){
            String str = elements.get(i).html();
            list.add(str);
        }
        return list;
    }

    private static Log log = LogFactory.getLog(HtmlManage.class);
}

FileDownload.java

package com.bing.download;

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;

/**
 * @Explain:
 * @author: gaoll
 * @CreateTime:2014-11-20
 * @ModifyTime:2014-11-20
 */
public class FileDownload {

    /**
     * File Download
     * @param url Link Address
     * @param path Path and file name to save
     * @return
     */
    public static boolean download(String url,String path){

        boolean flag = false;

        CloseableHttpClient httpclient = HttpClients.createDefault();
        RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(2000)
                .setConnectTimeout(2000).build();

        HttpGet get = new HttpGet(url);
        get.setConfig(requestConfig);

        BufferedInputStream in = null;
        BufferedOutputStream out = null;
        try{
            for(int i=0;i<3;i++){
                CloseableHttpResponse result = httpclient.execute(get);
                System.out.println(result.getStatusLine());
                if(result.getStatusLine().getStatusCode() == 200){
                    in = new BufferedInputStream(result.getEntity().getContent());
                    File file = new File(path);
                    out = new BufferedOutputStream(new FileOutputStream(file));
                    byte[] buffer = new byte[1024];
                    int len = -1;
                    while((len = in.read(buffer,0,1024)) > -1){
                        out.write(buffer,0,len);
                    }
                    flag = true;
                    break;
                }else if(result.getStatusLine().getStatusCode() == 500){
                    continue ;
                }
            }

        }catch(Exception e){
            e.printStackTrace();
            flag = false;
        }finally{
            get.releaseConnection();
            try{
                if(in != null){
                    in.close();
                }
                if(out != null){
                    out.close();
                }
            }catch(Exception e){
                e.printStackTrace();
                flag = false;
            }
        }
        return flag;
    }

    private static Log log = LogFactory.getLog(FileDownload.class);
}

This is the end, there may be some code is not complete, the main code is almost the same, should be able to run, teach more.

Scan the QR code below to get more Internet job search, java, python, crawler, big data and other technologies in time, and share with a large amount of data:
Public Number**Newbie Bird Name Enterprise Dream background send "csdn" to get free download services of "csdn" and "Baidu Library";
Public Number Rookie Name Enterprise Dream Background Send "Data": You can get 5T boutique learning materials**, Java interview test points and Java noodles summarized, as well as dozens of java, big data projects, there are almost all the data you want to find

Recommended reading

Temperature and Air Quality Changes in Chengdu in Recent Eight Years

National Salary Report Hot in 2019

Full Aliyun Redis Development Specification

Answers to the 208 most common Java interview questions

An article on JVM: Summary of JVM interview points

Posted by daveyc on Mon, 29 Apr 2019 16:40:36 -0700