Python data analysis reveals the little secrets of big V

Keywords: Python JSON Database MongoDB

Preface

The text and pictures of the article are from the Internet, only for learning and communication, and do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.

Author: qingfengxiaozhu

PS: if you need Python learning materials, you can click the link below to get them by yourself

http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef

Zhihu is the largest Q & a community in China. Different from Weibo, post bar and other products, Zhihu's content is more about users sharing knowledge, experience and opinions on specific issues.

The connection of big V

First of all, I'll show you the diagram of the top 50 users who know the number of fans:

The lines in the diagram are the relationship of mutual concern between users. The difference of these three pictures is only in the size of the circle, which respectively represents: the number of fans, the degree of connection (the number of people who are concerned about in the picture), the degree of connection (the number of people who are concerned about in the picture)

Obviously, ordinary users such as Zhihu daily, Liu kanshan and Dingxiang doctor pay more attention to the "institution number", which is not highly concerned in the big V.

This picture is based on the data collected by Python through a software called Gephi. The color of the circle is the result of Gephi's automatic aggregation according to the association relationship.

Who is the big V?

Four indicators of Zhihu: attention, approval, thanks and collection. Let's take a look at the "head users" ranked by different indicators:

 

Among them, Zhang Jiawei is very outstanding. (the name is familiar Do you remember the analysis of tiger attack before

 

 

The official number of answers, Zhang Jiawei is still far ahead, including the number of articles on the list.

 

If we only look at quantity regardless of quality:

These data are combined into two three-dimensional scatter diagrams

The data selected in the figure are users with more than 10000 concerns. There are interactive web versions in the project, which can be more intuitive to view the distribution map.

Big V's enthusiasm is fading?

 

The above figures are statistics on the historical release data of over 100000 users currently concerned. From the picture, in 2015, the big V were more enthusiastic about answering, and later they mostly changed to write column articles. Judging from the trend, it seems that the frequency of big V's sending is no longer increasing. However, this can not directly infer the overall popularity of Zhihu, or the output of content may be more scattered among different users? This is the only way to have authoritative data.

What do users like to see?

 

 

These columns and favorites with the highest attention, do you pay attention to them?

Finally, this is a word cloud created by focusing on personal profiles of more than 10000 users:

 

Code

 1 # coding:utf8
 2 # Number of fans grabbed is over 1 w user
 3 import requests
 4 import pymongo
 5 import time
 6 import pickle
 7  8 def get_ready(ch='user_pd',dbname='test'):
 9     '''Database call'''
10     global mycol, myclient,myhp
11     myclient = pymongo.MongoClient("mongodb://localhost:27017/")
12     mydb = myclient[dbname]
13     mycol = mydb[ch]
14 get_ready()
15 ss = mycol.find({})
16 17 se = {1,} # De aggregation
18 se2 = ['GOUKI9999','zhang-jia-wei'] # Crawled list
19 # with open(r'C:\Users\yc\Desktop\used.txt', 'rb') as f: # read
20 #     used = pickle.load(f)
21 used={1,}
22 sed = {}
23 for s in ss:
24     if s['follower_count']>=10000: # More than 10000 fans 
25         sed[s['user_id']] = sed.get(s['user_id'],0) + 1
26         if sed[s['user_id']] == 1:
27             se.add(s['user_id'])
28             se2.append(s['user_id'])
29 leng = len(se2)
30 print(leng)
31 proxies = {
32     "http": "http://spiderbeg:pythonbe@106.52.85.210:8000",
33     "https": "http://spiderbeg:pythonbe@106.52.85.210:8000",
34 }
35 headers = {
36     'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
37     'cookie':'your_cookie(User main page)'
38     }
39 for i,url_id in enumerate(se2): # Crawl
40     if i>=0:
41         print(i,' ', end='') # url Number
42     if url_id not in used: # Used or not
43         used.add(url_id)
44         nums = 500
45         off = 0
46         
47         while True:
48             url2 = 'https://www.zhihu.com/api/v4/members/' + url_id + '/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=' + str(off) + '&limit=' + str(nums)
49             r2 = requests.get(url2, headers=headers,proxies=proxies)
50             time.sleep(0.5)
51             c = 0
52             if 'error' in r2.json():
53                 if r2.json()['error']['code'] in {310000, 310001}:
54                     break
55                 else:
56                     raise NameError('Page error')
57             used.add(url_id) # Determine whether to use
58             for d in r2.json()['data']:
59                 z = {}
60                 c+=1
61                 z['user_id'] = d['url_token']
62                 z['name'] = d['name']
63                 z['headline'] = d['headline']
64                 z['follower_count'] = d['follower_count']
65                 z['answer_count'] = d['answer_count']
66                 z['articles_count'] = d['articles_count'] # if d['articles_count'] else 0 
67                 z['from'] = url_id # Whose list of concerns
68                 if d['follower_count']>=10000 and d['url_token'] not in se and d['url_token'] not in used: # Fans greater than 1 w,Crawl
69                     se.add(d['url_token'])
70                     se2.append(d['url_token'])
71                     mycol.insert_one(z) # insert data
72             if r2.json()["paging"]['is_end'] == False:
73                 nums+=500
74                 off+=500
75             elif r2.json()["paging"]['is_end'] == True:
76                 break
77             else:
78                 print(r2.json)
79                 break

Posted by CrOuZ on Mon, 25 Nov 2019 22:52:41 -0800