Data Analysis in Practice We Rate Dogs': Twitter Data Analysis

Keywords: less JSON IPython network

Through this project, learners have the opportunity to experience the whole process of data analysis, from data collection to data cleaning and analysis, and finally to visualize the analysis.

The data comes from Twitter account'We Rate Dogs'which ratings people's pet dogs in a humorous way. These scores are usually denominated by 10. But the molecule is generally greater than 10:11/10, 12/10, 13/10, so it can make most dogs score higher than 10.

1. Gathering Data

We have three data set files

The main data of twitter_archive_enhanced WeRateDogs twitter, from 2015 to 2017
RESULTS OF IMAGE MACHINE LEARNING ALGORITHM FOR image_archive_master WeRateDogs twitter
Additional twitter data sets for tweet_data, including the number of forwards per tweet, the number of collections, etc.

# Import related Toolkits
import numpy as np
import pandas as pd
import requests
import json
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
%matplotlib inline

plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False # Display negative number

twitter_df = pd.read_csv('../data/twitter-archive-enhanced.csv')
twitter_df.head()

twi_pred_df = pd.read_csv('.../data/tweet-predictions.tsv',sep='\t')
twi_pred_df.head()

with open('../data/tweet_data.txt','r') as f:
    data = json.load(f)
scrapped_df = pd.DataFrame(data)
scrapped_df.head()

2. Assessing The Data

Our three data frames are twitter_df, twi_pred_df, scrapped_df

1.

twitter_df['in_reply_to_status_id'].count()
# Output 78

twitter_df.info()

plt.figure(figsize=(20,8))
sns.heatmap(twitter_df.isnull(), cbar=False)
plt.show()

twi_pred_df.info()

scrapped_df.info()

These are large data sets, so we can clearly see some problems that retweeted in twitter_df needs to be deleted.
Two columns in twitter_df are less than 100 rows, and there are too many missing values that need to be deleted.
The number of rows in the three data boxes is different, 2342 in scrapped_df, 2075 in twi_pred_df and 2356 in twitter_df, so internal connections are needed.
Through visual analysis of missing values thermograms, we can see that source and expanded_urls should be deleted, because they contain data that we do not need, and all URLs are unique. doggo, floofer, pupper, puppo represent the four phases of a dog, which can be combined into one column.

2.

twitter_df[twitter_df['expanded_urls'].duplicated()].count()

twitter_df['expanded_urls'].value_counts()

expanded url lets me know that tweets are repetitive. So 137 tweets are repetitive. To prove this, I opened the dataset and found that some expanded URLs had two.

twi_pred_df[twi_pred_df['jpg_url'].duplicated()].count()

The same is true for predictions, where there are 66 duplicate lines and scrapped_df is not checked because it will merge with twitter_df, so duplicate values are not considered.

3.

len(twi_pred_df) - len(twi_pred_df[(twi_pred_df['p1_dog']== True) & (twi_pred_df['p2_dog'] == True) & (twi_pred_df['p3_dog'] == True)])
# Output 832

len(twi_pred_df) - len(twi_pred_df[(twi_pred_df['p1_dog']== True) | (twi_pred_df['p2_dog'] == True) | (twi_pred_df['p3_dog'] == True)])
# Output 324

Of the 2075 records, 832 may be images of dogs. There are 324 items. None of the three predicted images are dogs. You can observe 324 lines, but you can be sure that 342 pictures are definitely not dogs.

temp_df = twi_pred_df[(twi_pred_df['p1_dog']== False) | (twi_pred_df['p2_dog'] == False) | (twi_pred_df['p3_dog'] == False)]
temp_df.head()

Open the first web site to view

from IPython.display import IFrame
IFrame('https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg',width='100%', height=500)

It's obviously a tortoise.

scrapped_df.info()

4.

twi_pred_df['p1'].value_counts()

5.

sns.distplot(twitter_df['rating_numerator'], kde=False, rug=True) # Scoring Molecule
plt.show()

twitter_df['rating_numerator'].describe()

x = twitter_df[twitter_df['rating_numerator']<= 20.0].rating_numerator
sns.distplot(x, kde=False)
plt.xlabel('Ratings')
plt.ylabel('Count')
plt.title('Distribution of Ratings')
plt.show()

twitter_df['rating_numerator'].value_counts()

Scoring molecular data showed that there were some obvious outliers in the data. To check this, I used the describe function. Obviously, I saw that the average value was 13 and the maximum value was 1776. To get a more complete counter, I made a value count. I found that many values were abnormal, but there were also values below 10, which was not quite like his unique scoring system.

6.

twitter_df['rating_denominator'].describe() # Scoring denominator

sns.distplot(twitter_df['rating_denominator'], kde=False, rug=True)
plt.show()

twitter_df['rating_denominator'].value_counts()

The same question applies to the denominator, which should be equal to 10, but we can also find other values.
This is the problem found by the above analysis:

quality problem

There are five missing values in twitter_df
Most dogs do not distinguish between doggo, floofer, pupper, puppo
The dog's name was mislabeled, mislabeled and lost.
For better analysis
All three data boxes have duplicate values
Rating problem: Some denominators are not 10, while some molecules are less than 10. Some tweets are not rated.
All tweet_id s should be strings
Data type inconsistency

Cleanness

dog stage contains four different columns
For ease of analysis, dates and times can be divided into date and time columns.

3. Data Cleaning

# Backup raw data
c_twitter_df = twitter_df.copy()
c_twitter_df.head()

c_pred = twi_pred_df.copy()
c_pred.head()

c_scrapped = scrapped_df.copy()
c_scrapped.head()

Delete rows with empty retweet_status

c_twitter_df = c_twitter_df[c_twitter_df['retweeted_status_id'].notnull() == False]

Delete duplicate values of expanded_urls

# keep='first'retains the first occurrence
c_twitter_df = c_twitter_df.drop_duplicates(subset = 'expanded_urls', keep='first')

Delete all missing rows In c_twitter_df
- in_reply_to_status_id
- in_reply_to_user_id
- retweeted_status_id
- retweeted_status_user_id
- retweeted_status_timestamp

c_twitter_df = c_twitter_df.drop(columns=['in_reply_to_status_id','in_reply_to_user_id','retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp'])

c_twitter_df.info()

c_twitter_df.head()

Delete the useless column source, expand_urls

c_twitter_df = c_twitter_df.drop(columns=['expanded_urls','source'])

c_twitter_df.info()

Modify data types
Use astype to change the data type to a more appropriate type
c_twitter_df

tweet_id => string
timestamp => datetime

c_scrapped

created_at => datetime
tweet_id => string

c_pred

tweet_id => string

c_twitter_df['tweet_id'] = c_twitter_df['tweet_id'].astype('str')
c_twitter_df['timestamp'] = pd.to_datetime(c_twitter_df['timestamp'])
c_scrapped['created_at'] = pd.to_datetime(scrapped_df['created_at'])
c_scrapped['tweet_id'] = c_scrapped['tweet_id'].astype('str')
c_pred['tweet_id'] = c_pred['tweet_id'].astype('str')
c_twitter_df.info()

c_scrapped.info()

c_pred.info()

doggo, floofer, pupper, puppo columns can be merged into one column
But some lines have more than one stage, which interferes with subsequent analysis, so delete these lines

c_twitter_df['doggo'] = c_twitter_df['doggo'].replace('None', 0)
c_twitter_df['doggo'] = c_twitter_df['doggo'].replace('doggo', 1)
c_twitter_df['floofer'] = c_twitter_df['floofer'].replace('None', 0)
c_twitter_df['floofer'] = c_twitter_df['floofer'].replace('floofer', 1)
c_twitter_df['pupper'] = c_twitter_df['pupper'].replace('None', 0)
c_twitter_df['pupper'] = c_twitter_df['pupper'].replace('pupper', 1)
c_twitter_df['puppo'] = c_twitter_df['puppo'].replace('None', 0)
c_twitter_df['puppo'] = c_twitter_df['puppo'].replace('puppo', 1)
c_twitter_df['None'] = 0

Assign a value of 1 to the None column without dog stage rows

c_twitter_df.loc[(c_twitter_df['puppo']+c_twitter_df['floofer']+c_twitter_df['pupper']+c_twitter_df['doggo'] == 0),'None'] = 1

c_twitter_df['None'].value_counts()

Check whether a row has multiple dog stage s

c_twitter_df[(c_twitter_df['puppo']+c_twitter_df['floofer']+c_twitter_df['pupper']+c_twitter_df['doggo']+c_twitter_df['None']> 1)]

Delete these 12 lines because they interfere with subsequent analysis

# ~ Inversion
c_twitter_df = c_twitter_df[~(c_twitter_df['puppo']+c_twitter_df['floofer']+c_twitter_df['pupper']+c_twitter_df['doggo']> 1)]

len(c_twitter_df)
# Output 2105

c_twitter_df['None'].value_counts()

Fusion of doggo, floofer, pupper, puppo into one column

# pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
# frame: Data to be processed
# id_vars: Column names that do not need to be converted
# value_vars: Column names that need to be converted. If all the remaining columns need to be converted, you don't have to write
# var_name and value_name are column names corresponding to custom settings
# col_level: Use this level if the column is MultiIndex.
values = ['doggo', 'floofer', 'pupper', 'puppo', 'None']
ids = [x for x in list(c_twitter_df.columns) if x not in values]
c_twitter_df = pd.melt(c_twitter_df, id_vars = ids, value_vars = values, var_name='stage')

c_twitter_df = c_twitter_df[c_twitter_df.value == 1]
c_twitter_df.drop('value', axis=1, inplace=True)
c_twitter_df.head()

# reset_index can restore the index and revert to the default integer index 
# DataFrame.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill=") 
# Level controls the index of the specific level to be restored 
# If drop is False, the index column will be restored to normal column, otherwise it will be lost.
c_twitter_df.reset_index(drop=True, inplace=True)

# Convert stage to category type
c_twitter_df.stage = c_twitter_df.stage.astype('category')

None should be the same as before

c_twitter_df['stage'].value_counts()

len(c_twitter_df)
# Output 2105

Clean Dog Predictions

c_pred['p1'].value_counts()

c_pred['p2'].value_counts()

c_pred['p3'].value_counts()

c_pred['p1'] = c_pred['p1'].str.lower()
c_pred['p1'] = c_pred['p1'].replace([' ','-'],'_')
c_pred['p2'] = c_pred['p2'].str.lower()
c_pred['p2'] = c_pred['p2'].replace([' ','-'],'_')
c_pred['p3'] = c_pred['p3'].str.lower()
c_pred['p3'] = c_pred['p3'].replace([' ','-'],'_')

c_pred['p1'].value_counts()

c_pred['p2'].value_counts()

c_pred['p3'].value_counts()

The predicted object may not be a dog. If all three predictions are true, create a dog to make a separate prediction column. If all three predictions are wrong, it is not a dog.

pred = ['p1_dog', 'p2_dog', 'p3_dog']
for p in pred:
    c_pred[p] = c_pred[p].astype(int)

c_pred.loc[(c_pred['p1_dog']+c_pred['p2_dog']+c_pred['p3_dog'] == 0),'Prediction'] = 'not dog'
c_pred.loc[(c_pred['p1_dog']+c_pred['p2_dog']+c_pred['p3_dog'] == 1),'Prediction'] = 'mixed'
c_pred.loc[(c_pred['p1_dog']+c_pred['p2_dog']+c_pred['p3_dog'] == 2),'Prediction'] = 'mixed'
c_pred.loc[(c_pred['p1_dog']+c_pred['p2_dog']+c_pred['p3_dog'] == 3),'Prediction'] = 'dog'

c_pred['Prediction'].value_counts()

c_twitter_df['name'].value_counts()

Name has obvious errors,'an','a'and'the' can be considered names, so remove the wrong names first.
Names should start in capitals, and find names that do not begin in capitals.

pd.set_option('display.max_colwidth', -1)
vals = c_twitter_df[~c_twitter_df['name'].str[0].str.isupper()]['name'].value_counts()
vals.keys()

# Change incorrect names to None
for val in vals.keys():
    c_twitter_df['name'] = c_twitter_df['name'].replace(val,'None')
c_twitter_df['name'] = c_twitter_df['name'].replace('a','None')
c_twitter_df['name'] = c_twitter_df['name'].replace('an','None')
c_twitter_df['name'] = c_twitter_df['name'].replace('the','None')

c_twitter_df['name'].value_counts()

Part of the rating_numerator, rating_denominator column data does not match the original information in the text column
Delete the wrong score

rating_denominator

print("Count of tweets without '/10' in the text : {}".format(c_twitter_df[~c_twitter_df['text'].str.contains('/10')]['tweet_id'].count()))
print("Count of tweets without 10 rating_denominator : {}".format(c_twitter_df[c_twitter_df['rating_denominator'] != 10]['tweet_id'].count()))

c_twitter_df = c_twitter_df[c_twitter_df['text'].str.contains('/10')]
print("Count of tweets without 10 rating_denominator {}:".format(c_twitter_df[c_twitter_df['rating_denominator'] != 10].tweet_id.count()))

Check these five lines to see if they can be repaired manually.

c_twitter_df[c_twitter_df['rating_denominator'] != 10].text

We can see that the tweet is correct, so we can repair the data manually.

idx = c_twitter_df[c_twitter_df['rating_denominator'] != 10].index
for i in idx:
    c_twitter_df.at[i, 'rating_denominator'] = 10

c_twitter_df['rating_denominator'].value_counts()

rating_numerator

c_twitter_df['rating_numerator'].value_counts()

# strip removes first and last spaces by default
cannot_parse = set()
incorrect = set()

for i in c_twitter_df.index:
    index = int(c_twitter_df.loc[i].text.find('/10'))
    try:
        numerator = int(c_twitter_df.loc[i].text[index-2:index].strip())
    except:
        cannot_parse.add(i)
        continue
    if numerator != c_twitter_df.loc[i].rating_numerator:
        incorrect.add(i)

print('Indexes this code cannot parse: {}'.format(len(cannot_parse)))
print('Incorrect rating_numerator indexes:{}'.format(len(incorrect)))

for i in incorrect:
    print('{} - {}'.format(i, c_twitter_df.loc[i]['text']))

You can see that these scores follow the unique scoring mechanism of Twitter, so we can repair molecules manually.

c_twitter_df.loc[995,'rating_numerator'] = 9
c_twitter_df.loc[232,'rating_numerator'] = 8
c_twitter_df.loc[363,'rating_numerator'] = 13.5
c_twitter_df.loc[1198,'rating_numerator'] = 12
c_twitter_df.loc[1328,'rating_numerator'] = 8
c_twitter_df.loc[1457,'rating_numerator'] = 9
c_twitter_df.loc[1275,'rating_numerator'] = 9

c_twitter_df['rating_numerator'].value_counts()

Connection table
Since c_pred is a different set of information, it is not necessary to combine it with c_twitter_df.

print(len(c_twitter_df))
print(len(c_pred))
print(len(c_scrapped))

# Remove unwanted information
c_pred = c_pred.drop(columns=['p1_dog','p2_dog','p3_dog'])
c_scrapped = c_scrapped.drop(columns=['created_at','favorited','retweeted'])
c_twitter_df = c_twitter_df.drop(columns=['text'])

# Connect c_twitter to c_scrapped
main_df = c_twitter_df.merge(c_scrapped, on='tweet_id',how='inner')

main_df.head()

Storing

Save the cleaned data

main_df.to_csv('../data/twitter_archive_master.csv', index=False) 
c_pred.to_csv('../data/image_archive_master.csv', index=False)

Analysis

tweet_df = pd.read_csv('../data/twitter_archive_master.csv')
tweet_df.head()

image_df = pd.read_csv('../data/image_archive_master.csv')
image_df.head()

We analyze the following issues:

How does the score affect the number of forwards?
Over time, how does the forwarding volume and point approval change?
How does the model behave?
What is the most popular dog name?
What types of dogs are there?

1. How does score affect the amount of Twitter forwarding?

import seaborn as sns
plt.style.use('ggplot') 
# Delete Molecules Over 20 Scores
ratings = tweet_df[tweet_df['rating_numerator'] < 20]['rating_numerator']
count = tweet_df[tweet_df['rating_numerator'] < 20]['retweet_count']
sns.scatterplot(ratings,count)
plt.xlabel('score')
plt.ylabel('Forwarding volume')
plt.title('How does the score affect the amount of Twitter forwarding?')
plt.show()

We can see that the number of forwards with a score of less than 10 is very small, the number of forwards with a score of more than 10 is also large, and the number of forwards with a score of 13 is the largest.
However, this may be caused by other factors, and it can not be explained that there is a causal relationship between the two.

2. How does the forwarding volume and point approval change over time?

time = pd.to_datetime(tweet_df[tweet_df.favorite_count < 10000].timestamp).dt.date

fig, ax = plt.subplots(figsize=(20,10))
plt.plot_date(time,tweet_df[tweet_df.favorite_count < 10000].retweet_count,color='b' , label='retweet_count')
plt.plot_date(time,tweet_df[tweet_df.favorite_count < 10000].favorite_count, color='r', label='favorite_count')
plt.xlabel('time')
plt.ylabel('Number')
plt.title('Over time, how does the forwarding volume and point approval change?')
plt.legend()
ax.xaxis.set_tick_params(rotation=90, labelsize=10)
plt.show()

An obvious trend is that at the beginning, the number of points is similar to the number of forwards, but the number of forwards is more. But by 2016 and 2017, the number of Twitters was getting smaller and smaller (as can be seen from the number of blue and red dots, but the number of dots was getting larger and larger). ) Another noteworthy trend is that the number of point reviews has increased dramatically, but during the whole process, the number of forwards has not exceeded 5,000 at most.

3. How does the model perform?

image_df.describe()

The p1_conf value of some tweets is 1, which shows that the neural network is very confident in them.
But the two predictions, p2_conf and p3_conf, seem to be less confident.

Let's look at the graph with p1_conf value of 1.

tweet = image_df[image_df.p1_conf == 1.0]
tweet

from IPython.display import IFrame
IFrame('https://pbs.twimg.com/media/CUS9PlUWwAANeAD.jpg',width='100%', height=500)

We found that the model decided that the picture was not a dog, but when we opened the website manually, we found that it was a dog looking at the puzzle. The model might not catch the dog, but focused on the puzzle.

4. What is the most popular dog name?

from collections import Counter

x = tweet_df['name']

count = Counter(x)
count.most_common(11)

5. What types of dogs are there?

types = image_df[image_df.Prediction == 'dog'].p1.value_counts()
types
plt.bar(types[0:10].index,types[0:10])
plt.xticks(rotation='vertical')
plt.xlabel('Dog Types')
plt.ylabel('Number')
plt.title('What types of dogs are there?')
plt.show()

As you can see, the most popular is Golden Hair, followed by Pembroke and Labrador.
Through this project, we understand the importance of data visualization. For many people, it is difficult to find rules from a pile of text data, but we can clearly find the rules and errors through data visualization.
The initial data exploration is very important. It can help us to find problems in the data and help us to further analyze and understand the data.

Posted by quintus on Sun, 21 Jul 2019 03:58:25 -0700

Programmer Group