Flash implements microblog portrait collection gadget

Flask is another excellent Web framework implemented in Python besides Django. Compared with the fully functional Django, flask is famous for its freedom and flexibility. When developing some small applications, flash is very suitable. This paper will use flask to develop a microblog user portrait generator.

The development steps are as follows:

  • Capture microblog user data;
  • Analyze data and generate user portrait;
  • Website implementation, beautify the interface.

1, Microblog capture

I use the mobile microblog here( m.weibo.cn ), for example. This tutorial uses the chrome browser for debugging.

Search for "gulinaza" in "discovery"

, click to enter her home page;

Start analyzing the request message, right-click to open the debugging window, and select the "network" tab of the debugging window;

Select "Preserve Log" to refresh the page;

Analyzing each request process, we can find that the data of the blog is from https://m.weibo.cn/api/container/getIndex?XXX Similar addresses. The main parameters are type (fixed value), value (blogger ID), containerid (identification, returned in the request) and page (page number)

Let's start the code of crawling blog posts.

Import related libraries

import requests
from time import sleep

# Define the function to get blogger information
# The parameter uid is the id of the blogger

def get_user_info(uid):
    # Send request
    result = requests.get('https://m.weibo.cn/api/container/getIndex?type=uid&value={}'
                          .format(uid))
    json_data = result.json()  # Get the json content in the information
    userinfo = {
        'name': json_data['userInfo']['screen_name'],                    # Get user Avatar
        'description': json_data['userInfo']['description'],             # Get user description
        'follow_count': json_data['userInfo']['follow_count'],           # Get number of concerns
        'followers_count': json_data['userInfo']['followers_count'],     # Get fans
        'profile_image_url': json_data['userInfo']['profile_image_url'], # Get Avatar
        'verified_reason': json_data['userInfo']['verified_reason'],     # Authentication information
        'containerid': json_data['tabsInfo']['tabs'][1]['containerid']   # This field is required in getting blog posts
    }

    # Get gender. In the microblog, m represents male and f represents female
    if json_data['userInfo']['gender'] == 'm':
        gender = 'male'
    elif json_data['userInfo']['gender'] == 'f':
        gender = 'female'
    else:
        gender = 'unknown'
    userinfo['gender'] = gender
    return userinfo
# Get gulinaza information
userinfo = get_user_info('1350995007')
# The information is as follows
userinfo
{'containerid': '1076031350995007',
 'description': 'Please contact: nazhagongzuo@163.com',
 'follow_count': 529,
 'followers_count': 12042995,
 'name': 'I'm Naza',
 'profile_image_url': 'https://tvax2.sinaimg.cn/crop.0.0.1242.1242.180/50868c3fly8fevjzsp2j4j20yi0yi419.jpg',
 'verified_reason': 'Actor, representative work "choosing the day"'}




# Cycle to get all blog posts

def get_all_post(uid, containerid):
    # Start on the first page
    page = 0
    # This is used to store the blog list
    posts = []
    while True:
        # Request blog list
        result = requests.get('https://m.weibo.cn/api/container/getIndex?type=uid&value={}&containerid={}&page={}'
                              .format(uid, containerid, page))
        json_data = result.json()

        # When the blog post is obtained, exit the loop
        if not json_data['cards']:
            break

        # Loop to add new posts to the list
        for i in json_data['cards']:
            posts.append(i['mblog']['text'])

        # Pause for half a second to avoid being anti crawled
        sleep(0.5)

        # Jump to next page
        page += 1

    # Return all posts
    return posts
posts = get_all_post('1350995007', '1076031350995007')
# Number of blog posts viewed
len(posts)
1279
# Display the first 3
posts[:3]

At this point, the user's data is ready, and then start generating the user portrait.

2, Generate user portrait

1. Extract keywords

Here we extract keywords from the blog list and analyze the hot words published by bloggers

import jieba.analyse
from html2text import html2text

content = '\n'.join([html2text(i) for i in posts])

# Here, jieba's textrank is used to extract 1000 keywords and their proportion
result = jieba.analyse.textrank(content, topK=1000, withWeight=True)

# Generate keyword dictionary
keywords = dict()
for i in result:
    keywords[i[0]] = i[1]

2. Generate word cloud

from PIL import Image, ImageSequence
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator

# Initialize picture
image = Image.open('./static/images/personas.png')
graph = np.array(image)

# When generating cloud images, it should be noted that WordCloud does not support Chinese by default, so the Chinese bold font library needs to be loaded here
wc = WordCloud(font_path='./fonts/simhei.ttf',
    background_color='white', max_words=300, mask=graph)
wc.generate_from_frequencies(keywords)
image_color = ImageColorGenerator(graph)
# display picture
plt.imshow(wc)
plt.imshow(wc.recolor(color_func=image_color))
plt.axis("off") # Turn off the image coordinate system
plt.show()

3, Implement flash application

Developing Flask is not as complex as Django. A small application can be completed with a few files. The steps are as follows:

  1. install

Use pip to install flash. The command is as follows:

pip install flask

2. Implement application logic

Simply put, a Flask application is a Flask class whose url request is controlled by the route function. The code implementation is as follows:

# app.py


from flask import Flask
import requests
from PIL import Image, ImageSequence
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
import jieba.analyse
from html2text import html2text
from time import sleep
from collections import OrderedDict
from flask import render_template, request

# Create a flash application
app = Flask(__name__)



# Microblog correlation function #

# Define the function to get blogger information
# The parameter uid is the id of the blogger

def get_user_info(uid):
    # Send request
    result = requests.get('https://m.weibo.cn/api/container/getIndex?type=uid&value={}'
                          .format(uid))
    json_data = result.json()  # Get the json content in the information
    # Get gender. In the microblog, m represents male and f represents female
    if json_data['userInfo']['gender'] == 'm':
        gender = 'male'
    elif json_data['userInfo']['gender'] == 'f':
        gender = 'female'
    else:
        gender = 'unknown'

    userinfo = OrderedDict()
    userinfo['nickname'] = json_data['userInfo']['screen_name']           # Get user Avatar
    userinfo['Gender'] = gender                                         # Gender
    userinfo['Number of concerns'] = json_data['userInfo']['follow_count']        # Get number of concerns
    userinfo['Number of fans'] = json_data['userInfo']['followers_count']     # Get fans
    userinfo['Authentication information'] = json_data['userInfo']['verified_reason']   # Get fans
    userinfo['describe'] = json_data['userInfo']['description']           # Get fans
    data = {
        'profile_image_url': json_data['userInfo']['profile_image_url'], # Get Avatar
        'containerid': json_data['tabsInfo']['tabs'][1]['containerid'],  # This field is required in getting blog posts
        'userinfo': '
'.join(['{}:{}'.format(k, v) for (k,v) in userinfo.items()])
    }

    return data


# Cycle to get all blog posts

def get_all_post(uid, containerid):
    # Start on the first page
    page = 0
    # This is used to store the blog list
    posts = []
    while True:
        # Request blog list
        result = requests.get('https://m.weibo.cn/api/container/getIndex?type=uid&value={}&containerid={}&page={}'
                              .format(uid, containerid, page))
        json_data = result.json()

        # When the blog post is obtained, exit the loop
        if not json_data['cards']:
            break

        # Loop to add new posts to the list
        for i in json_data['cards']:
            posts.append(i['mblog']['text'])

        # Pause for half a second to avoid being anti crawled
        sleep(0.5)

        # Jump to next page
        page += 1

    # Return all posts
    return posts


##############################
## Cloud correlation function

# Generate cloud map
def generate_personas(uid, data_list):
    content = '
'.join([html2text(i) for i in data_list])

    # Here, jieba's textrank is used to extract 1000 keywords and their proportion
    result = jieba.analyse.textrank(content, topK=1000, withWeight=True)

    # Generate keyword dictionary
    keywords = dict()
    for i in result:
        keywords[i[0]] = i[1]

    # Initialize picture
    image = Image.open('./static/images/personas.png')
    graph = np.array(image)

    # When generating cloud images, it should be noted that WordCloud does not support Chinese by default, so the Chinese bold font library needs to be loaded here
    wc = WordCloud(font_path='./static/fonts/simhei.ttf',
        background_color='white', max_words=300, mask=graph)
    wc.generate_from_frequencies(keywords)
    image_color = ImageColorGenerator(graph)
    plt.imshow(wc)
    plt.imshow(wc.recolor(color_func=image_color))
    plt.axis("off") # Turn off the image coordinate system
    dest_img = './static/personas/{}.png'.format(uid)
    plt.savefig(dest_img)
    return dest_img


#######################################
# Define route
# Specifies the response function for the root path request
@app.route('/', methods=['GET', 'POST'])
def index():
    # Initialization template data is empty
    userinfo = {}
    # If it is a Post request and there is a microblog user id, obtain the microblog data and generate the corresponding cloud map
    # The value of request.method is the request method
    # request.form is a submitted form
    if request.method == 'POST' and request.form.get('uid'):
        uid = request.form.get('uid')
        userinfo = get_user_info(uid)
        posts = get_all_post(uid, userinfo['containerid'])
        dest_img = generate_personas(uid, posts)
        userinfo['personas'] = dest_img
    return render_template('index.html', **userinfo)


if __name__ == '__main__':
    app.run()

The above is all the code, simple? Of course, the single file structure is only suitable for small applications. With the increase of function and code, it is still necessary to separate the code into different file structures for development and maintenance. Finally, there is still a template file for the page.

3. Template development

The template needs an input form and user information display, which is based on Jinja2 template engine. Those who are familiar with Django templates should be able to get started quickly. The process is similar to Django types. Create a folder named templates under the project root directory and a new file named index.html. The code is as follows:

    Flask Microblog single user portrait generator

In this way, the application is completed, and the project structure is as follows:

$ tree .
weibo_personas
├── app.py
├── static
│   ├── css
│   │   └── style.css
│   ├── fonts
│   │   └── simhei.ttf
│   └── images
│       └── personas.png
└── templates
    └── index.html

Enter the project folder and start the project:

 python app.py
 Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

Then the browser opens http://127.0.0.1:5000 You can see the effect of this tutorial.

The above is only a preliminary implementation, and there are still many areas that need to be improved. For example, if there are many published blogs and the acquisition time is long, you can consider adding a cache to store the acquired users to avoid repeated requests. The front end can also add a loading effect. This tutorial only shows a single user. Later, you can also obtain user information in batches and generate user portraits of a group.

Posted by RDx321 on Mon, 22 Nov 2021 13:16:36 -0800