Grabbing website data is no longer difficult, So Easy can do it all.

Keywords: C# encoding Firefox

Reference page:

http://www.yuanjiaocheng.net/CSharp/Csharp-data-types.html

http://www.yuanjiaocheng.net/CSharp/cshart-value-reference-type.html

http://www.yuanjiaocheng.net/CSharp/Csharp-keys.html

http://www.yuanjiaocheng.net/CSharp/csharp-interface.html

http://www.yuanjiaocheng.net/CSharp/Csharp-operators.html

Starting with the title, why is it no longer difficult to grab website data (in fact, it is difficult to grab website data), SO EASY!!! With Fizzler, I believe that most people or companies should have the experience of grabbing other people's website data. For example, every time our blog Park publishes an article, it will be grabbed by other websites. If you don't believe it, you can see it. Others grab useful information such as mailboxes, phone numbers, QQ on other people's websites, which can definitely sell money or do other things. We receive spam messages or e-mails from time to time every day. Maybe that's the same thing. O() Ohaha ~.

I wrote two programs a while ago, one is to grab the data of a lottery website (double-color ball), the other is to grab the data of job-hunting website (hunting, future-free, recruitment and so on). When writing these two programs, it was particularly difficult to see a bunch of HTML tags really want to die. First of all, let's review how I parsed HTML before. It's a very conventional way to get HTML content through WebRequest, and then to intercept what you want step by step through HTML tags. The following code is the code for intercepting red balls and basketball for two-color balls. Once the label of the website changes a little, it may be faced with the need to re-program, which is very inconvenient to use.

The following is my code for analyzing the red ball and basketball of the two-color ball. The most I do is to intercept the corresponding content of the tag (regular expression). Maybe this code is not very complicated, because the intercepted data is limited and very regular, so it is relatively simple.

 1         #region * In one TR Analysis of TD,Get the number of Phase I
 2         /// <summary>
 3         /// In one TR Analysis of TD,Get the number of Phase I
 4         /// </summary>
 5         /// <param name="wn"></param>
 6         /// <param name="trContent"></param>
 7         private void ResolveTd(ref WinNo wn, string trContent)
 8         {
 9             List<int> redBoxList = null;
10             //Expression of Matching Period Number
11             string patternQiHao = "<td align=\"center\" title=\"Date";
12             Regex regex = new Regex(patternQiHao);
13             Match qhMatch = regex.Match(trContent);
14             wn.QiHao = trContent.Substring(qhMatch.Index + 17 + patternQiHao.Length, 7);
15             //The expression of matching basketball
16             string patternChartBall02 = "<td class=\"chartBall02\">";
17             regex = new Regex(patternChartBall02);
18             Match bMatch = regex.Match(trContent);
19             wn.B = Convert.ToInt32(trContent.Substring(bMatch.Index + patternChartBall02.Length, 2));
20             //Store matched red ball numbers
21             redBoxList = new List<int>();
22             //The expression of matching red ball
23             string patternChartBall01 = "<td class=\"chartBall01\">";
24             regex = new Regex(patternChartBall01);
25             MatchCollection rMatches = regex.Matches(trContent);
26             foreach (Match r in rMatches)
27             {
28                 redBoxList.Add(Convert.ToInt32(trContent.Substring(r.Index + patternChartBall01.Length, 2)));
29             }
30             //The expression of matching red ball
31             string patternChartBall07 = "<td class=\"chartBall07\">";
32             regex = new Regex(patternChartBall07);
33             rMatches = regex.Matches(trContent);
34             foreach (Match r in rMatches)
35             {
36                 redBoxList.Add(Convert.ToInt32(trContent.Substring(r.Index + patternChartBall07.Length, 2)));
37             }
38             //Sort Red Ball Number
39             redBoxList.Sort();
40             //First Red Ball Number
41             wn.R1 = redBoxList[0];
42             //Second Red Ball Number
43             wn.R2 = redBoxList[1];
44             wn.R3 = redBoxList[2];
45             wn.R4 = redBoxList[3];
46             wn.R5 = redBoxList[4];
47             wn.R6 = redBoxList[5];
48         }

The following code is the interception data of a recruitment website, that is, a series of intercepting HTML tags. Haha, at that time, it was quite a headache to write. I wonder if anyone who did this method felt the same way. When you parse the data of more websites, it is even more important (I wrote the data of grasping the future without worry, hunting for employment, future without worry and pulling the net). O() O() O() O() O() O() O() O() O() O() O() O

// Regular expression filtering: Regular expression, text to be replaced
        private static readonly string[][] Filters =
        {
            new[] { @"(?is)<script.*?>.*?</script>", "" },
            new[] { @"(?is)<style.*?>.*?</style>", "" },
            new[] { @"(?is)<!--.*?-->", "" },    // filter Html Comments in code
            new[] { @"(?is)<footer.*?>.*?</footer>",""},
            //new[] { "(?is)<div class=\"job-require bottom-job-require\">.*?</div></div>",""}
            new[] { @"(?is)<h3>Common links:.*?</ul>",""}
        };

        private void GetJobInfoFromUrl(string url)
        {
            try
            {
                JobInfo info = new JobInfo();
                //--
                string pageStr = GetHtmlCode.GetByget(url, "utf-8");
                if (string.IsNullOrEmpty(pageStr))
                {
                    return;
                }
                //--
                pageStr = pageStr.Replace("\r\n", "");//Replace line breaks
                // Obtain html,body Label content
                string body = string.Empty;
                string bodyFilter = @"(?is)<body.*?</body>";
                Match m = Regex.Match(pageStr, bodyFilter);
                if (m.Success)
                {
                    body = m.ToString().Replace("<tr >", "<tr>").Replace("\r\n", "");
                }
                // Uncorrelated labels such as filtering styles, scripts, etc.
                foreach (var filter in Filters)
                {
                    body = Regex.Replace(body, filter[0], filter[1]);
                }
                //--
                if (!string.IsNullOrEmpty(mustKey) && !body.Contains(mustKey))
                {
                    return;
                }
                body = Regex.Replace(body, "\\s", "");

                info.Url = url;

                string basicInfoRegexStr0 = "<h1title=([\\s\\S]+?)>(.*?)</h1>"; //Job title
                string position = Regex.Match(body, basicInfoRegexStr0).Value;
                info.Position = string.IsNullOrEmpty(position) ? "" : position.Substring(position.IndexOf(">") + 1, position.IndexOf("</") - position.IndexOf(">") - 1);//Job title

                string basicInfoRegexStr1 = "</h1><h3>(.*?)</h3>";//Corporate name
                string company = Regex.Match(body, basicInfoRegexStr1).Value;
                info.Company = string.IsNullOrEmpty(company) ? "" : company.Substring(company.IndexOf("<h3>") + 4, company.IndexOf("</h3>") - company.IndexOf("<h3>") - 4);//Corporate name

                string basicInfoRegexStr2 = "<divclass=\"resumeclearfix\"><span>(.*?)</span>";//Working place
                string address = Regex.Match(body, basicInfoRegexStr2).Value;
                info.Address = string.IsNullOrEmpty(address) ? "" : address.Substring(address.IndexOf("<span>") + 6, address.IndexOf("</") - address.IndexOf("<span>") - 6);//Working place

                string basicInfoRegexStr3 = "<li><span>Enterprise nature:</span>(.*?)</li>";//Company nature
                string nature = Regex.Match(body, basicInfoRegexStr3).Value;
                info.Nature = string.IsNullOrEmpty(nature) ? "" : nature.Substring(nature.IndexOf("</span>") + 7, nature.IndexOf("</li>") - nature.IndexOf("</span>") - 7);//Company nature

                if (string.IsNullOrEmpty(info.Nature))
                {
                    string basicInfoRegexStr3_1 = "<br><span>Nature:</span>(.*?)<br>";
                    string nature_1 = Regex.Match(body, basicInfoRegexStr3_1).Value;
                    info.Nature = string.IsNullOrEmpty(nature_1) ? "" : nature_1.Substring(nature_1.IndexOf("</span>") + 7, nature_1.LastIndexOf("<br>") - nature_1.IndexOf("</span>") - 7);//Company nature
                }

                string basicInfoRegexStr4 = "<li><span>Enterprise size:</span>(.*?)</li>";//company size
                string scale = Regex.Match(body, basicInfoRegexStr4).Value;
                info.Scale = string.IsNullOrEmpty(scale) ? "" : scale.Substring(scale.IndexOf("</span>") + 7, scale.IndexOf("</li>") - scale.IndexOf("</span>") - 7);//company size

                if (string.IsNullOrEmpty(info.Scale))
                {
                    string basicInfoRegexStr4_1 = "<br><span>Scale:</span>(.*?)<br>";
                    string scale_1 = Regex.Match(body, basicInfoRegexStr4_1).Value;
                    info.Scale = info.Nature = string.IsNullOrEmpty(scale_1) ? "" : scale_1.Substring(scale_1.IndexOf("</span>") + 7, scale_1.LastIndexOf("<br>") - scale_1.IndexOf("</span>") - 7);//company size
                }

                string basicInfoRegexStr5 = "<spanclass=\"noborder\">(.*?)</span>";//Hands-on background
                string experience = Regex.Match(body, basicInfoRegexStr5).Value;
                info.Experience = string.IsNullOrEmpty(experience) ? "" : experience.Substring(experience.IndexOf(">") + 1, experience.IndexOf("</") - experience.IndexOf(">") - 1);//Hands-on background

                string basicInfoRegexStr6 = "</span><span>(.*?)</span><spanclass=\"noborder\">";//Minimum academic qualifications
                string education = Regex.Match(body, basicInfoRegexStr6).Value;
                info.Education = string.IsNullOrEmpty(education) ? "" : education.Substring(education.IndexOf("<span>") + 6, education.IndexOf("</span><spanclass=") - education.IndexOf("<span>") - 6);//Minimum academic qualifications

                string basicInfoRegexStr7 = "<pclass=\"job-main-title\">(.*?)<";//A monthly salary
                string salary = Regex.Match(body, basicInfoRegexStr7).Value;
                info.Salary = string.IsNullOrEmpty(salary) ? "" : salary.Substring(salary.IndexOf(">") + 1, salary.LastIndexOf("<") - salary.IndexOf(">") - 1);//A monthly salary

                string timeInfoRegexStr = "<pclass=\"release-time\">Release time:<em>(.*?)</em></p>";//Release time
                string time = Regex.Match(body, timeInfoRegexStr).Value;
                info.Time = string.IsNullOrEmpty(time) ? "" : time.Substring(time.IndexOf("<em>") + 4, time.IndexOf("</em>") - time.IndexOf("<em>") - 4);//Release time

                if (GetJobEnd != null)
                {
                    GetJobEnd(pageStr, info);
                }
            }
            catch (Exception exMsg)
            {
                throw new Exception(exMsg.Message);
            }
        }
    }

From the above code, we can see that they are intercepting (regular expressions) the corresponding content, which is very complex. It is very difficult to intercept website data without paying attention to it. Finally, Fizzler is used to extract website data through the introduction of friends in QQ group (186841119). It feels much easier at once. Here's how to introduce Fizzler. Tools (as if this is open source), related information can be found on the website.

First, provide the download address of the tool: Fizzler

This includes three files: Fizzler.dll, Fizzler.Systems.HtmlAgilityPack.dll and HtmlAgilityPack.dll, which can be referenced directly in VS2010.

Completing the above is the reference to Fizzler.

using HtmlAgilityPack;
using Fizzler;
using Fizzler.Systems.HtmlAgilityPack;

The above can be cited in CS.

The following is the implementation of the code.

        private static WebDownloader m_wd = new WebDownloader();
        /// <summary>
        /// Obtain HTML content
        /// </summary>
        /// <param name="Url">link</param>
        /// <param name="Code">character set</param>
        /// <returns></returns>
        public static string GetHtml(string Url, Encoding Code)
        {
            return m_wd.GetPageByHttpWebRequest(Url, Code);
        }

        public string GetPageByHttpWebRequest(string url, Encoding encoding)
        {
            Stream sr = null;
            StreamReader sReader = null;
            try
            {
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
                request.Method = "Get";
                request.Timeout = 30000;

                HttpWebResponse response = request.GetResponse() as HttpWebResponse;
                if (response.ContentEncoding.ToLower() == "gzip")//If used GZip First decompression.
                {
                    sr = new GZipStream(response.GetResponseStream(), CompressionMode.Decompress);
                }
                else
                {
                    sr = response.GetResponseStream();
                }
                sReader = new StreamReader(sr, encoding);
                return sReader.ReadToEnd();
            }
            catch
            {
                return null;
            }
            finally
            {
                if (sReader != null)
                    sReader.Close();
                if (sr != null)
                    sr.Close();
            }
        }

The above is the implementation of HTML data grabbing code, the above code is basically no difference, that is, the common method of grabbing data.

        /// <summary>
        /// Get the corresponding label content
        /// </summary>
        /// <param name="Url">link</param>
        /// <param name="CSSLoad">CSS Route</param>
        /// <param name="Code">character set</param>
        /// <returns></returns>
        public static IEnumerable<HtmlNode> GetUrlInfo(string Url, string CSSLoad, Encoding Code)
        {
            HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument
            {
                OptionAddDebuggingAttributes = false,
                OptionAutoCloseOnEnd = true,
                OptionFixNestedTags = true,
                OptionReadEncoding = true
            };

            htmlDoc.LoadHtml(GetHtml(Url, Code));
            IEnumerable<HtmlNode> NodesMainContent = htmlDoc.DocumentNode.QuerySelectorAll(CSSLoad);//Query Path
            return NodesMainContent;
        }

        /// <summary>
        /// Get the corresponding label content
        /// </summary>
        /// <param name="html">html content</param>
        /// <param name="CSSLoad">CSS Route</param>
        /// <returns></returns>
        public static IEnumerable<HtmlNode> GetHtmlInfo(string html, string CSSLoad)
        {
            HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument
            {
                OptionAddDebuggingAttributes = false,
                OptionAutoCloseOnEnd = true,
                OptionFixNestedTags = true,
                OptionReadEncoding = true
            };

            htmlDoc.LoadHtml(html);
            IEnumerable<HtmlNode> NodesMainContent = htmlDoc.DocumentNode.QuerySelectorAll(CSSLoad);//Query Path
            return NodesMainContent;
        }

The above two methods are to capture the corresponding path label data. One is to capture the corresponding path label data according to the URL, the other is to capture the corresponding data according to the HTML content. The following focuses on the acquisition method of CSSLoad, which requires the installation of Firefox Browser. FireBug plug-in is needed to query Firefox Browser, as shown in the following figure (website toolbar):

Then click on the spider-like icon to see the following:

So you can see all the HTML tags, and then how to get the CSS path is relatively simple.

Click on the blue arrow to select the relevant content of the website.

So the corresponding HTML is also selected, which is a step closer to getting the CCS path, and then click the right button to copy the CCS path. As follows:

Just click on the copy CSS path. Copy the CSS path as follows:

html body#Posts form#frmMain table#BodyTable tbody tr td#Body div#Main div#Editor_Edit div#Editor_Edit_Contents div#edit_container div#Editor_Edit_APOptions div#Editor_Edit_APOptions_Contents.Edit div.edit_option div#Editor_Edit_APOptions_Advancedpanel1 div#Editor_Edit_APOptions_Advancedpanel1_Header.subCollapsibleTitle

We don't need to copy all these paths into our program. Otherwise, it's too responsible. We just need to put the last part of the nodes in the method above. We can read the corresponding content of the HTML tag. Here's a simple example to illustrate.

  1 /// <summary>
  2         /// Analysis of each recruitment information
  3         /// </summary>
  4         /// <param name="Url"></param>
  5         private void GetJobInfoFromUrl(object Url)
  6         {
  7             try
  8             {
  9                 JobInfo info = new JobInfo();
 10                 info.Url = Url.ToString();
 11                 //--Obtain HTML content
 12                 string html =AnalyzeHTML.GetHtml(Url.ToString(), Encoding.UTF8);
 13                 if (string.IsNullOrEmpty(html)) { return; }
 14                 //--Job title
 15                 IEnumerable<HtmlNode> NodesMainContent1 = AnalyzeHTML.GetHtmlInfo(html, "div.title-info h1");
 16                 if(NodesMainContent1.Count()>0)
 17                 {
 18                     info.Position = NodesMainContent1.ToArray()[0].InnerText;
 19                 }
 20                 //--Corporate name
 21                 IEnumerable<HtmlNode> NodesMainContent2 = AnalyzeHTML.GetHtmlInfo(html, "div.title-info h3");
 22                 if (NodesMainContent2.Count() > 0)
 23                 {
 24                     info.Company = NodesMainContent2.ToArray()[0].InnerText;
 25                 }
 26                 //--Company nature/company size
 27                 IEnumerable<HtmlNode> NodesMainContent4 = AnalyzeHTML.GetHtmlInfo(html, "div.content.content-word ul li");
 28                 if (NodesMainContent4.Count() > 0)
 29                 {
 30                     foreach (var item in NodesMainContent4)
 31                     {
 32                         if (item.InnerHtml.Contains("Nature of enterprise"))
 33                         {
 34                             string nature = item.InnerText;
 35                             nature = nature.Replace("Enterprise nature:", "");
 36                             info.Nature = nature;
 37                         }
 38                         if (item.InnerHtml.Contains("Enterprise scale"))
 39                         {
 40                             string scale = item.InnerText;
 41                             scale = scale.Replace("Enterprise size:", "");
 42                             info.Scale = scale;
 43                         }
 44                     }
 45                 }
 46                 else//Second Analysis of the Nature and Scale of Enterprises
 47                 {
 48                     IEnumerable<HtmlNode> NodesMainContent4_1 = AnalyzeHTML.GetHtmlInfo(html, "div.right-post-top div.content.content-word");
 49                     if (NodesMainContent4_1.Count() > 0)
 50                     {
 51                         foreach (var item_1 in NodesMainContent4_1)
 52                         {
 53                             string[] arr = item_1.InnerText.Split("\r\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
 54                             if (arr != null && arr.Length > 0)
 55                             {
 56                                 foreach (string str in arr)
 57                                 {
 58                                     if (str.Trim().Contains("Nature"))
 59                                     {
 60                                         info.Nature = str.Replace("Nature:", "").Trim();
 61                                     }
 62                                     if (str.Trim().Contains("scale"))
 63                                     {
 64                                         info.Scale = str.Replace("Scale:", "").Trim();
 65                                     }
 66                                 }
 67                             }
 68                         }
 69                     }
 70                 }
 71                 //--Hands-on background
 72                 IEnumerable<HtmlNode> NodesMainContent5 = AnalyzeHTML.GetHtmlInfo(html, "div.resume.clearfix span.noborder");
 73                 if (NodesMainContent5.Count() > 0)
 74                 {
 75                     info.Experience = NodesMainContent5.ToArray()[0].InnerText;
 76                 }
 77                 //--Company address/Minimum academic qualifications
 78                 IEnumerable<HtmlNode> NodesMainContent6 = AnalyzeHTML.GetHtmlInfo(html, "div.resume.clearfix");
 79                 if (NodesMainContent6.Count() > 0)
 80                 {
 81                     foreach (var item in NodesMainContent6)
 82                     {
 83                         string lable = Regex.Replace(item.InnerHtml, "\\s", "");
 84                         lable = lable.Replace("<span>", "");
 85                         string[] arr = lable.Split("</span>".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
 86                         if (arr != null && arr.Length > 2)
 87                         {
 88                             info.Address = arr[0];//Company address
 89                             info.Education = arr[1];//Minimum academic qualifications
 90                         }
 91                     }
 92                 }
 93                 //--A monthly salary
 94                 IEnumerable<HtmlNode> NodesMainContent7 = AnalyzeHTML.GetHtmlInfo(html, "div.job-title-left p.job-main-title");
 95                 if (NodesMainContent7.Count() > 0)
 96                 {
 97                     info.Salary = NodesMainContent7.ToArray()[0].InnerText;
 98                 }
 99                 //--Release time
100                 IEnumerable<HtmlNode> NodesMainContent8 = AnalyzeHTML.GetHtmlInfo(html, "div.job-title-left p.release-time em");
101                 if (NodesMainContent8.Count() > 0)
102                 {
103                     info.Time = NodesMainContent8.ToArray()[0].InnerText;
104                 }
105                 //--
106                 if (GetJobEnd != null)
107                 {
108                     GetJobEnd("", info);
109                 }
110             }
111             catch (Exception exMsg)
112             {
113                 throw new Exception(exMsg.Message);
114             }
115         }

The above method is also to analyze the content of a recruitment website label, but can not see the complex regular expression to intercept HTML label, so it seems more concise and simple code, and a whole configuration page can cope with the problem of constantly changing website label, so it seems that it is a very simple thing to grab other people's website data, O()_ ) Oha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha-ha

The above only represent personal views!!! If you are interested in adding QQ group together: (18684 1119), participate in discussion and learning exchanges.

Posted by jeffkee on Thu, 21 Mar 2019 18:03:52 -0700