Web Form Data Extraction Based on Interface Crawler

Keywords: Programming SQL Java Linux Oracle

I recently received a task, to climb a data, the data in a web page table, the amount of data hundreds. Open debugging mode and find that the interface returns an html page, as long as it is treated as string. (xpath crawler is troublesome for parsing html files) The scheme uses regular matching of all cell rows and extracting cell content, which encounters some other problems:

  1. Originally, the content was extracted directly, and it was found that the content involved the languages and characters of various countries, which was a bit pitfalls.

  2. After intercepting the cell lines, we find that there are spaces between the contents of the two fields, and the number is uncertain. spit method is used to limit the size of the array.

  3. Incorrect coding format leads to scrambling

Share the code for your reference:

public static void main(String[] args) {
 
		String url = "https://docs.oracle.com/cd/E13214_01/wli/docs92/xref/xqisocodes.html";
		HttpGet httpGet = getHttpGet(url);
		JSONObject httpResponse = getHttpResponse(httpGet);
		String content = httpResponse.getString("content");
		List<String> strings = regexAll(content, "<tr.+</a>" + LINE + ".+" + LINE + ".+" + LINE + ".+" + LINE + ".+" + LINE + ".+" + LINE + "</div>");
		int size = strings.size();
		for (int i = 0; i < size; i++) {
			String s = strings.get(i).replaceAll("<.+>", EMPTY).replaceAll(LINE, EMPTY);
			String[] split = s.split(" ", 2);
			String sql = "INSERT country_code (country,code) VALUES (\"%s\",\"%s\");";
			output(String.format(sql, split[0].replace(SPACE_1, EMPTY), split[1].replace(SPACE_1, EMPTY)));
		}
		testOver();
	}

Some of the packaging methods are as follows:

/**
	 * Returns all matches
	 *
	 * @param text  Text that needs to be matched
	 * @param regex regular expression
	 * @return
	 */
	public static List<String> regexAll(String text, String regex) {
		List<String> result = new ArrayList<>();
		Pattern pattern = Pattern.compile(regex);
		Matcher matcher = pattern.matcher(text);
		while (matcher.find()) {
			result.add(matcher.group());
		}
		return result;
	}

The sql part of the final stitching results are as follows:

INSERT country_code (country,code) VALUES ("German","de");
INSERT country_code (country,code) VALUES ("Greek","el");
INSERT country_code (country,code) VALUES ("Greenlandic","kl");
INSERT country_code (country,code) VALUES ("Guarani","gn");
INSERT country_code (country,code) VALUES ("Gujarati","gu");
INSERT country_code (country,code) VALUES ("Hausa","ha");
INSERT country_code (country,code) VALUES ("Hebrew","he");
INSERT country_code (country,code) VALUES ("Hindi","hi");
INSERT country_code (country,code) VALUES ("Hungarian","hu");
INSERT country_code (country,code) VALUES ("Icelandic","is");
INSERT country_code (country,code) VALUES ("Indonesian","id");
INSERT country_code (country,code) VALUES ("Interlingua","ia");
INSERT country_code (country,code) VALUES ("Interlingue","ie");
INSERT country_code (country,code) VALUES ("Inuktitut","iu");
INSERT country_code (country,code) VALUES ("Inupiak","ik");
INSERT country_code (country,code) VALUES ("Irish","ga");
INSERT country_code (country,code) VALUES ("Italian","it");
INSERT country_code (country,code) VALUES ("Japanese","ja");

Selection of Technical Articles

  1. One line of java code prints a heart
  2. Chinese Language Version of Linux Performance Monitoring Software netdata
  3. Interface Test Code Coverage (jacoco) Scheme Sharing
  4. Performance testing framework
  5. How to Enjoy Performance Testing on Linux Command Line Interface
  6. Graphic HTTP Brain Map
  7. How to Test Probabilistic Business Interface
  8. httpclient handles multi-user simultaneous online
  9. Automatically convert swagger documents into test code
  10. Five lines of code to build static blogs
  11. How httpclient handles 302 redirection
  12. A preliminary study on the testing framework of linear interface based on java
  13. Tcloud Cloud Measurement Platform

Selection of non-technical articles

  1. Why choose software testing as a career path?
  2. Ten Steps to Become a Great Java Developer
  3. Writing to everyone about programming thinking
  4. Obstacles to automated testing
  5. The Problems of Automated Testing
  6. Tested "Code Immortality" Brain Map
  7. Seven Steps to Become an Excellent Automated Testing Engineer
  8. Attitudes of Excellent Software Developers

Click on the Public Number Map

Posted by GameMusic on Wed, 11 Sep 2019 20:00:11 -0700