Grabbing the hot review of Changjin lake on Douban, I found these

Keywords: Python crawler Data Analysis

preface

Everybody, I'm classmate K!

Recently, a film "Changjin Lake" has become a hot circle of friends. It has been by various Amway. I don't want to go to the cinema to dominate a couple seat alone. I'll honestly analyze the film review and see everyone's "impressions", right~

First locate the target page

https://movie.douban.com/subject/25845392/comments

On the crawler, grab the following four fields

Then use pandas to import data and do simple processing

import pandas as pd
import os

file_path = os.path.join("douban.csv")

#Read columns A and B in the test.csv file. If the usecols parameter is not set, all data will be read by default.
df = pd.read_csv(open(file_path,'r',encoding='utf-8'), names=["user name","Star review","Comment time","comment"])
df.head()
user nameStar reviewComment timecomment
0Still van teseyit 's not bad2021-09-30 10:23:06A little disappointed, the plot can be said to be nothing, or the characterization as usual, as always, so sensational, the first battle is better than the second
1Oreo cookies 🍪Poor2021-09-30 15:13:40After three hours, I just want to say that I thought the climax was coming, and the result suddenly stopped. I nodded heavily and my feet were light. The part of Watergate bridge should be taken out alone
2High quality connoisseurPoor2021-09-26 21:17:48I went to see some movies. It's worth the ticket price. It's OK to watch it for three hours. The war drama is easy to paralyze my eyes, but it's also exciting. I don't like red hydantoin
3Wu: 30it 's not bad2021-09-27 18:16:24Just tell the truth: \ n1. The film is too long and very unfriendly to the audience. War scenes can be reduced, and soldiers can play with each other
4xi-xiait 's not bad2021-09-30 11:11:45The length of the battle scene and the weak logical connection of the plot are really numb at the end. It's better to limit the film length to two hours.
star_num = df.Star review.value_counts()
star_num = star_num.sort_index()
star_num
recommend strongly        one hundred and twelve
 recommend         35
 The user did not comment      2
 Poor         fourteen
 it 's not bad         thirty-seven
Name: Star review, dtype: int64

Proportion of Douban short comment score

from pyecharts.charts import Pie, Bar, Line, Page
from pyecharts import options as opts 
from pyecharts.globals import SymbolType

# Data pair
data_pair = [list(z) for z in zip([i for i in star_num.index], star_num.values.tolist())]

# Pie chart
pie1 = Pie(init_opts=opts.InitOpts(width='800px', height='400px'))
pie1.add('', data_pair, radius=['35%', '60%'])
pie1.set_global_opts(title_opts=opts.TitleOpts(title='Proportion of Douban short comment score'), 
                     legend_opts=opts.LegendOpts(orient='vertical', pos_top='15%', pos_left='2%')
                    ) 
pie1.set_series_opts(label_opts=opts.LabelOpts(formatter='{b}:{d}%'))
pie1.render_notebook()

Trend chart of comments

# Line chart
line1 = Line(init_opts=opts.InitOpts(width='800px', height='400px'))
line1.add_xaxis(comment_date.index.tolist())
line1.add_yaxis('', comment_date.values.tolist(),
                #areastyle_opts=opts.AreaStyleOpts(opacity=0.5),
                label_opts=opts.LabelOpts(is_show=False))
line1.set_global_opts(title_opts=opts.TitleOpts(title='Trend chart of comments'), 
#                       toolbox_opts=opts.ToolboxOpts(),
                      visualmap_opts=opts.VisualMapOpts(max_=140))
line1.set_series_opts(linestyle_opts=opts.LineStyleOpts(width=4))
line1.render_notebook()

It was released on September 30 and began to build momentum on September 29. It peaked on September 30, but it seems that the momentum of No. 1 has greatly decreased.

Word cloud picture

positive

import jieba

def get_cut_words(content_series):
    # Read in stoplist
    stop_words = [] 
    
    with open(r"hit_stopwords.txt", 'r', encoding='utf-8') as f:
        lines = f.readlines()
        for line in lines:
            stop_words.append(line.strip())

    # Add keyword
    my_words = ['Changjin Lake', 'a volunteer']  
    for i in my_words:
        jieba.add_word(i) 

#     Custom stop words
    my_stop_words = ['film',"Changjin Lake","Warfare"] 
    stop_words.extend(my_stop_words)               

    # participle
    word_num = jieba.lcut(content_series.str.cat(sep='. '), cut_all=False)

    # Conditional filtering
    word_num_selected = [i for i in word_num if i not in stop_words and len(i)>=2]
    
    return word_num_selected
text1 = get_cut_words(content_series=df[(df.Star review=='recommend strongly')|(df.Star review=='recommend')]['comment'])
text1[:5]
['sacrifice', 'Ice and snow', 'warrior', 'should', 'forget']
import stylecloud
from IPython.display import Image # Used to display local pictures in jupyter lab



# Draw word cloud
stylecloud.gen_stylecloud(text=' '.join(text1), 
                          max_words=1000,
                          collocations=False,
                          font_path=r'Classic variety Brief.ttf',
                          icon_name='fas fa-thumbs-up',
                          size=360,
                          output_name='Cloud chart of Douban positive scoring words.png')

Image(filename='Cloud chart of Douban positive scoring words.png') 

negative

text2 = get_cut_words(content_series=df[(df.Star review=='it 's not bad')|(df.Star review=='Poor')]['comment'])
text2[:5]
['somewhat', 'disappointment', 'plot', 'as always', 'character']
# Draw word cloud
stylecloud.gen_stylecloud(text=' '.join(text2), 
                          max_words=1000,
                          collocations=False,
                          font_path=r'Classic variety Brief.ttf',
                          icon_name='fas fa-thumbs-down',
                          size=350,
                          output_name='Cloud chart of Douban negative scoring words.png')
Image(filename='Cloud chart of Douban negative scoring words.png') 

🔥 Pay attention to the official account below (K classmate) reply: get your source code.

Posted by spitfire_esquive on Wed, 06 Oct 2021 21:21:59 -0700