Continue to analyze the crawler data analysis articles today, let's take a look at the acquisition and analysis of NetEase's critical commodity reviews.
NetEase Comments Crawling
Analyzing Web Pages
Comment Analysis
Enter the official website of NetEase Strict Selection, search for "bra", and then order a commodity at will.
On the merchandise page, open the Chrome console, switch to the Network page, switch the merchandise page to the evaluation label, and select a commentary, such as "Thin, Comfortable, Satisfied to Wear", to search in the Network.
You can see that the comment text was passed through listByItemByTag.json, Click to enter the request, and copy out the URL of the request:
https://you.163.com/xhr/comment/listByItemByTag.json?csrf_token=060f4782bf9fda38128cfaeafb661f8c&__timestamp=1571106038283&itemId=1616018&tag=%E5%85%A8%E9%83%A8&size=20&page=1&orderBy=0&oldItemTag=%E5%85%A8%E9%83%A8&oldItemOrderBy=0&tagChanged=0
Put the URL in Postman, try url query params one by one, and you'll find that you only need to keep the itemId and page request parameters.
The request returns data in JSON format, which is analyzed below.
It's not difficult to find that all the comment data is stored in the commentList, so we just need to save it.
Here's how to get the itemId information. This is the product ID. Let's go back to NetEase's strict first page and continue with the analysis.
Product ID Acquisition
When we enter keywords into the search box to search, we can also see that there are many requests in the Network. At this time, we can observe the requests, and by requesting the name of the file (which requires some experience here, the disciplined programmer will not scramble to name), we can locate the request to display the search results when searching.
Searches are typically search, so we've locked this request for search.json.Similarly, copy the request URL into Postman, validate the parameters one by one, and leave the page and keyword parameters at the end.
The request returns more data or requires patience to analyze the data, and you can also see that the id value under result->data->direct->searcherResult->result is the product id we want to get.
Above, we have basically completed the previous analysis work, and now we start coding.
Write code
Get product ID
def search_keyword(keyword): uri = 'https://you.163.com/xhr/search/search.json' query = { "keyword": keyword, "page": 1 } try: res = requests.get(uri, params=query).json() result = res['data']['directly']['searcherResult']['result'] product_id = [] for r in result: product_id.append(r['id']) return product_id except: raise
Here I get the product ID of page 1. Here I get the comment information under different products through the product ID.
From the previous analysis, we know that the commentary information is in the following form, for which we can easily store the information into MongoDB and then slowly analyze the content in the data.
{ "skuInfo": [ "colour:Skin colour", "Cup code:75B" ], "frontUserName": "1****8", "frontUserAvatar": "https://yanxuan.nosdn.127.net/f8f20a77db47b8c66c531c14c8b38ee7.jpg", "content": "Good quality, comfortable to wear", "createTime": 1555546727635, "picList": [ "https://yanxuan.nosdn.127.net/742f28186d805571e4b3f28faa412941.jpg" ], "commentReplyVO": null, "memberLevel": 4, "appendCommentVO": null, "star": 5, "itemId": 1680205 }
For MongoDB, we can either build it ourselves or use free online services.Here I introduce a free MongoDB service website: mlab, which is easy to use and doesn't cover much of the process.
Now that you have a database, here's how to save the data.
def details(product_id): url = 'https://you.163.com/xhr/comment/listByItemByTag.json' try: C_list = [] for i in range(1, 100): query = { "itemId": product_id, "page": i, } res = requests.get(url, params=query).json() if not res['data']['commentList']: break print("Crawl No. %s Page Comments" % i) commentList = res['data']['commentList'] C_list.append(commentList) time.sleep(1) # save to mongoDB try: mongo_collection.insert_many(commentList) except: continue return C_list except: raise
After the final crawl is complete, there are more than 7,000 pieces of data in total. Here you can do some analysis according to your personal needs.
Crawled Data MongoDB Links
conn = MongoClient("mongodb://%s:%s@ds149974.mlab.com:49974/you163" % ('you163', 'you163'))
db = conn.you163
mongo_collection = db.you163
Analysis of commodity review data
Here's the exciting moment, a sister-in-law preference!
Preferred colors
First let's look at the color preferences of the sisters
You can see that black is a far-reaching leader. You have to know what you want here!
Then use the pie chart to see the proportion of different colors
So among these colors, does she like you?
size distribution
No problem, 75B is the size of most sisters
It doesn't matter if you haven't studied the size of this cup. I've prepared a checklist for you, so take it away.
Comments on Commodities
Finally, let's take a look at how the sisters rated the goods.
In terms of star rating, most of them are five-star positive comments. After all, the quality must be guaranteed under the name of "strict selection".
Let's see in the comments section, what words would your sister like best to use to describe it?
Comfortable, comfortable, very comfortable; satisfied, very satisfied.
As if they had entered the exaggeration group, it seems that the most important thing for sisters is whether they are comfortable or not. After all, they are close to each other and the quality is the most important!
Okay, after looking at the above analysis, do you feel more motivated to be single?If you already have a soft sister next to you, should you please her next to you?