python Jinjiang literary city novel title crawler + simple data analysis + word cloud visualization

Keywords: Python crawler

1. Introduction

Target site: Jinjiang Literature City Library
Crawler Tool: BeautifulSoup
Data analysis: pandas, matplotlib
Word cloud: wordcloud, re

PS.Given that jj boss is very stubborn and has only three servers and only a few offices in the community, it is recommended to slow down the crawl speed by sleep when crawling to reduce the pressure on the server.

2. Reptiles

2.1 url parsing

Check any number of options on the first page of the database to try and see how the url changes (note the section outlined in the box):

  • Sex orientation: Emotions, sorted by publication time, show only completed url s:
    https://www.jjwxc.net/bookbase.php?fw0=0&fbsj0=0&ycx0=0&[xx1=1]&mainview0=0&sd0=0&lx0=0&fg0=0&bq=&removebq=&[sortType=3]&collectiontypes=ors&searchkeywords=&[page=0]&[isfinish=2]

  • Sex orientation: Pure love, sorted by collection of works, only shows unlimited, jump to page 4, get url:
    https://www.jjwxc.net/bookbase.php?fw0=0&fbsj0=0&ycx0=0&[xx2=2]&mainview0=0&sd0=0&lx0=0&fg0=0&bq=&removebq=&[sortType=4]&[page=4]&[isfinish=0]&collectiontypes=ors&searchkeywords=

Several parameters are summarized:

  • Pages: Page, starting from 1 (but 0 is also the first page). Upper limit is 1000.
  • Sexual orientation: Emotion is xx1=1, pure love xx2=2, Lily xx3=3, if you want to choose more than one sexual orientation at the same time, write it together
  • Sort by: sortType, update time = 1, collection = 4, publication time = 3, work credit = 2
  • Is it finished: isfinish, unlimited = 0, in-line = 1, completed = 2

2.2 Page Element Resolution

The data you want to crawl is as follows:

Press F12 to open the developer tool, look at the page elements, and find all the information in one table (class="cytable"), one tr per line and one td per cell.

2.3 Login

When you try to jump to more than 10 pages, an interface appears that requires you to log in:

After login to Jinjiang account, press F12 to open the developer tool, open the network tab, refresh the page, find the corresponding package, copy the cookie in headers, and add it to the crawler's request header.

2.4 Complete Code

def main(save_path, sexual_orientation):
	"""
	save_path: File Save Path
    sexual_orientation: 1: Emotions, 2: pure love, 3: lily, 4: female dignity, 5: none CP
    """
    for page in range(1, 1001):
        url = get_url(page, sexual_orientation)
        headers = {
            'cookie': Your cookie,
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'}
        html = requests.get(url, headers=headers)
        html.encoding = html.apparent_encoding
        try:
            data = parse(html.content)
        except:
            print("Crawl failed:", page)
            continue
        if len(data) == 0:
            break
        df = pd.DataFrame(data)
        df.to_csv(save_path, mode='a', header=False, index=False)
        print(page)
        time.sleep(3)


def get_url(page, sexual_orientation):
    url = f"https://www.jjwxc.net/bookbase.php?fw0=0&fbsj0=0&ycx1=1&xx{sexual_orientation}={sexual_orientation}&mainview0=0&sd0=0&lx0=0&fg0=0&bq=-1&" \
          f"sortType=3&isfinish=2&collectiontypes=ors&page={page}"
    return url


def parse(document):
    soup = BeautifulSoup(document, "html.parser")
    table = soup.find("table", attrs={'class': 'cytable'})
    rows = table.find_all("tr")
    data_all = []
    for row in rows[1:]:
        items = row.find_all("td")
        data = []
        for item in items:
            text = item.get_text(strip=True)
            data.append(text)
        data_all.append(data)
    return data_all

if __name__ == "__main__":
	main("Emotions.txt", 1)

3.Data analysis and visualization

Use pandas to read the crawled data as follows:

Simple preprocessing

# Duplicate removal
df = df.drop_duplicates(subset=["author", "name"])
print("Number of articles:", df.shape[0])

# Time Type Conversion
df["publish_time"] = pd.to_datetime(df["publish_time"])

# Convert Word Count to Ten Thousand Words
df["word"] /= 10000

# Integral converted to ten thousand
df["points"] /= 10000

3.1 Column

View the minimum and maximum number of words:

df["word"].min(), df["word"].max()

Results: (0.0001, 616.9603) (unit: 10,000 words), so the minimum value was set to 0 and the maximum value was set to 700 when grouping.

# Set the location of data grouping
bins_words = [0, 0.5, 1, 10, 20, 40, 60, 80, 100, 700]

# Distribution of Words in Novels Published Before 2018
words_distribution1 = pd.value_counts(pd.cut(df.query("publish_time<'2018-01-01'")["word"], bins=bins_words), sort=False)
words_distribution1 /= np.sum(words_distribution1)  # normalization

# Distribution of Words in Novels Published After 2018
words_distribution2 = pd.value_counts(pd.cut(df.query("publish_time>='2018-01-01'")["word"], bins=bins_words), sort=False)
words_distribution2 /= np.sum(words_distribution2)  # normalization

# Drawing
plt.figure(dpi=100)
plt.title("Fiction Word Number Distribution Map", fontsize=15)
loc = np.array([i for i in range(len(words_distribution1.index))])
plt.bar(loc-0.15, words_distribution1.values, width=0.3, label="2018 Years ago")
plt.bar(loc+0.15, words_distribution2.values, width=0.3, label="2018 Year later")
plt.xticks(loc, words_distribution1.index, rotation=45)
plt.xlabel("Number of words/Ten thousand words")
plt.ylabel("Proportion")
plt.legend()

3.2 Pie Chart

  • Type statistics.
    Most types of formats are "Origin - Pure Love - Overhead History - Fairy Man", which can be split at intervals of "-" with a third tag.
    However, there are a small number of works whose type format is "essays"/"comments"/"unknown". Direct access to elements marked 2 will result in errors, so an if statement is added to process them.
    # Type Statistics
    tags = df["type"].apply(lambda x: x.split("-")[2] if len(x.split("-"))==4 else x)
    tag_count = pd.value_counts(tags)
    
    # Categories with too few merges
    tag_count["Other"] = tag_count["Unknown"] + tag_count["Informal essay"] + tag_count["comment"] + tag_count["Poetry"] + tag_count[""]
    tag_count = tag_count.drop(["Unknown","Informal essay","comment","Poetry", ""])
    
  • Style Statistics
    # Style Statistics
    manner_count = pd.value_counts(df["manner"])
    
    # Categories with too few merges
    manner_count["Other"] = manner_count["Dark"] + manner_count["Laughing"] + manner_count["Unknown"]
    manner_count = manner_count.drop(["Dark", "Laughing", "Unknown"])
    
  • Drawing
    fig, axes = plt.subplots(1, 2, figsize=(10,5), dpi=100)
    fig.subplots_adjust(wspace=0.05)
    axes[0].pie(tag_count,
            labels=tag_count.index,
            autopct='%1.2f%%',
            pctdistance=0.7,
            colors=[plt.cm.Set3(i) for i in range(len(tag_count))],
            textprops={'fontsize':10},
            wedgeprops={'linewidth': 1, 'edgecolor': "black"}
           )
    axes[0].set_title("type", fontsize=15)
    
    axes[1].pie(manner_count,
            labels=manner_count.index,
            autopct='%1.2f%%',
            pctdistance=0.7,
            colors=[plt.cm.Accent(i) for i in range(len(manner_count))],
            textprops={'fontsize':10},
            wedgeprops={'linewidth': 1, 'edgecolor': "black"}
           )
    axes[1].set_title("style", fontsize=15)
    

4.Story Title Ci Cloud

from wordcloud import WordCloud
import jieba
import pandas as pd
import matplotlib.pyplot as plt
import re

4.1 participle

  • Main points:
    • Separate each title using the jieba thesaurus (Vectorization with DataFrame speeds up processing)
    • The original title contains more symbols (e.g. (top), [ABO], the * symbol changed by crab), and English characters, which can be removed using regular expressions.
      Before removal vs after removal:
  • Code:
    # Add Custom Words
    jieba.add_word("Fast Wear")
    # Separate each title, separated by English commas
    words_arr = df["name"].apply(lambda x: ",".join(jieba.cut(x))).values
    # Connect the participle results
    text = ",".join(words_arr)
    # Filter characters (such as brackets, asterisks) that are not Chinese or commas with regular expressions
    reg = "[^\u4e00-\u9fa5,]"
    text = re.sub(reg, '', text)
    

4.2 Word Frequency Statistics

  • Main points

    • Word Frequency Statistics Using DataFrame
    • Empty strings (products of regular symbol removal) and large numbers of words found in high frequency vocabulary are meaningless for data analysis and need to be removed.
    Before removalAfter removal
  • Code

    # word frequency count
    words_list = text.split(",")
    df_freq = pd.DataFrame(pd.value_counts(words_list), columns=["frequency"])
    df_freq.index.name="vocabulary"
    # Remove Words
    stop_words = df_freq[df_freq.index.str.len()<=1].index.tolist()
    df_freq = df_freq[df_freq.index.str.len()>1]
    # Visualization of Top Vocabularies
    plt.figure(figsize=(15, 5))
    x = [i for i in range(20)]
    y = df_freq.iloc[:20].values.flatten()
    labels = df_freq.index[:20]
    plt.bar(x, y, color='steelblue', width=0.5)
    plt.xticks(ticks=x, labels=labels, fontsize=15, rotation=45)
    plt.title("Title of Jinjiang Literature City's Novels Top20 Popular Words", fontsize=15)
    

4.3 Word Cloud Visualization

  • Main points
    • Mask: An 8-bit gray-scale logo of Jinjiang Literature City is used as the mask. Essentially, it is a two-dimensional matrix with a partial value of 0 and a partial value of 1. Other pictures, such as RGB format, can be converted by themselves according to this rule.
    • Stop word: Use the word removed from the word frequency statistics as a stop word.
  • Code
    # Generate Mask
    mask = plt.imread(r"D:\2021 Grind first\Geographic information visualization\data\logo.bmp")
    wordcloud = WordCloud(font_path=r"C:\Windows\Fonts\simhei.ttf",
                          stopwords=stop_words, 
                          width=800, height=600,
                          mask=mask,
                          max_font_size=150,
                          mode='RGBA', background_color=None).generate(text)
    fig = plt.figure(figsize=(10, 10))
    ax = fig.add_subplot(111)
    ax.axis("off")
    ax.imshow(wordcloud, interpolation='bilinear')
    plt.tight_layout(pad=4.5)
    fig.savefig("wordcloud.png")
    

Posted by cbailster on Fri, 08 Oct 2021 11:50:02 -0700