Five cases of data analysis: handle data processing (basic, detailed)

Keywords: Windows Python JSON Database

Article catalog


The data files in this chapter can be downloaded from github below

Data analysis with python (Second Edition)

Example 1. Data analysis of time zone

In 2011, Bitly, a short URL service provider, and the U.S. government website USA.gov Collaboration, which provides anonymous data collected from users with short links ending in. Gov or. mil.

Take the hourly snapshot as an example. The format of each line in the file is JSON (JavaScript Object Notation, which is a common Web data format). For example, if we only read the first line of a file, the result should be as follows:

path = '../datasets/bitly_usagov/example.txt'
open(path).readline()
'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11 (KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1, "tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r": "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u": "http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc": 1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'

Python has built-in or third-party modules to convert JSON strings into Python dictionary objects. Here, I will use the JSON module and its loads function to load the downloaded data file line by line:

import json
path = '../datasets/bitly_usagov/example.txt'
records = [json.loads(line) for line in open(path)]

Now, the records object becomes a set of Python Dictionaries:

records[0]
{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
 'c': 'US',
 'nk': 1,
 'tz': 'America/New_York',
 'gr': 'MA',
 'g': 'A6qOVH',
 'h': 'wfLQtf',
 'l': 'orofrog',
 'al': 'en-US,en;q=0.8',
 'hh': '1.usa.gov',
 'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
 'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991',
 't': 1331923247,
 'hc': 1331822918,
 'cy': 'Danvers',
 'll': [42.576698, -70.954903]}

1.1 pure python time zone count

Suppose we want to know which time zone (tz field) is the most common one in the dataset, there are many ways to get the answer. First, we use list derivation to extract a set of time zones:

time_zones = [rec['tz'] for rec in records]
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-5-f3fbbc37f129> in <module>
----> 1 time_zones = [rec['tz'] for rec in records]


<ipython-input-5-f3fbbc37f129> in <listcomp>(.0)
----> 1 time_zones = [rec['tz'] for rec in records]


KeyError: 'tz'

We found that not all records have time domain fields. if, we can judge whether the tz field is in the Julu record:

time_zones = [rec['tz'] for rec in records if 'tz' in rec]
time_zones[:10]

['America/New_York',
 'America/Denver',
 'America/New_York',
 'America/Sao_Paulo',
 'America/New_York',
 'America/New_York',
 'Europe/Warsaw',
 '',
 '',
 '']

Looking only at the top 10 time zones, we find that some are unknown (i.e. empty). Although they can be filtered out, they are reserved for the time being. Next, in order to count the time zones, here are two methods: one is more difficult (only using the standard Python Library) and the other is simpler (using panda). One way to count is to save the count value in the dictionary during the time zone traversal:

def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts

If you use the more advanced tools of Python standard library, you may write the code more succinctly:

from collections import defaultdict
def get_counts2(sequence):
    counts = defaultdict(int) # values will initialize to 0
    for x in sequence:
        counts[x] += 1
    return counts

I write logic into functions for greater reusability. To use it to process time zones, simply set time_zones in

counts = get_counts(time_zones)
counts['America/New_York']

1251
len(time_zones)

3440

If we want to get the top 10 time zones and their count values, we need to use some dictionary processing skills:

def top_counts(count_dict, n=10):
    value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
    value_key_pairs.sort()
    return value_key_pairs[-n:]

Then there are:

top_counts(counts)

[(33, 'America/Sao_Paulo'),
 (35, 'Europe/Madrid'),
 (36, 'Pacific/Honolulu'),
 (37, 'Asia/Tokyo'),
 (74, 'Europe/London'),
 (191, 'America/Denver'),
 (382, 'America/Los_Angeles'),
 (400, 'America/Chicago'),
 (521, ''),
 (1251, 'America/New_York')]

If you search Python's standard library, you can find collections.Counter Class, which makes this easier:

from collections import Counter
counts = Counter(time_zones)
counts.most_common(10)

[('America/New_York', 1251),
 ('', 521),
 ('America/Chicago', 400),
 ('America/Los_Angeles', 382),
 ('America/Denver', 191),
 ('Europe/London', 74),
 ('Asia/Tokyo', 37),
 ('Pacific/Honolulu', 36),
 ('Europe/Madrid', 35),
 ('America/Sao_Paulo', 33)]

1.2 time zone counting with pandas

Creating a DateFrame from a collection of original records, and passing a list of records to pandas.DataFrame Just as simple:

import pandas as pd
frame = pd.DataFrame(records)
frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3560 entries, 0 to 3559
Data columns (total 18 columns):
a              3440 non-null object
c              2919 non-null object
nk             3440 non-null float64
tz             3440 non-null object
gr             2919 non-null object
g              3440 non-null object
h              3440 non-null object
l              3440 non-null object
al             3094 non-null object
hh             3440 non-null object
r              3440 non-null object
u              3440 non-null object
t              3440 non-null float64
hc             3440 non-null float64
cy             2919 non-null object
ll             2919 non-null object
_heartbeat_    120 non-null float64
kw             93 non-null object
dtypes: float64(4), object(14)
memory usage: 500.8+ KB
frame['tz'][:10]

0     America/New_York
1       America/Denver
2     America/New_York
3    America/Sao_Paulo
4     America/New_York
5     America/New_York
6        Europe/Warsaw
7                     
8                     
9                     
Name: tz, dtype: object

Here, the output form of frame is summary view, which is mainly used for large DataFrame objects. We can then use value on the Series_ Count method:

tz_counts = frame['tz'].value_counts()
tz_counts[:10]

America/New_York       1251
                        521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Europe/London            74
Asia/Tokyo               37
Pacific/Honolulu         36
Europe/Madrid            35
America/Sao_Paulo        33
Name: tz, dtype: int64

We can use matplotlib to visualize this data. To do this, we first fill in an alternative value for the unknown or missing time zone in the record. fillna function can replace missing value (NA), while unknown value (empty string) can be replaced by Boolean array index:

clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz == ''] = 'Unknown'
tz_counts = clean_tz.value_counts()
tz_counts[:10]

America/New_York       1251
Unknown                 521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Missing                 120
Europe/London            74
Asia/Tokyo               37
Pacific/Honolulu         36
Europe/Madrid            35
Name: tz, dtype: int64

At this point, we can use the seaborn package to create a horizontal histogram:

import seaborn as sns
subset = tz_counts[:10]
sns.barplot(y=subset.index, x=subset.values)

<matplotlib.axes._subplots.AxesSubplot at 0x164d5acf2c8>

The a field contains information about browsers, devices, and applications that perform URL shortening:

frame['a'][1]

'GoogleMaps/RochesterNY'
frame['a'][50]

'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'
frame['a'][51][:50] # First 50 characters of information

'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P9'

It's frustrating to parse all the information in these "proxy" strings. One strategy is to separate the first section of this string (roughly corresponding to the browser) and get another user behavior summary:

results = pd.Series([x.split()[0] for x in frame.a.dropna()])
results[:5]

0               Mozilla/5.0
1    GoogleMaps/RochesterNY
2               Mozilla/4.0
3               Mozilla/5.0
4               Mozilla/5.0
dtype: object
results.value_counts()[:8]

Mozilla/5.0                 2594
Mozilla/4.0                  601
GoogleMaps/RochesterNY       121
Opera/9.80                    34
TEST_INTERNET_AGENT           24
GoogleProducer                21
Mozilla/6.0                    5
BlackBerry8520/5.0.0.681       4
dtype: int64

Now, suppose you want to break down time zone statistics by windows and non Windows users. For simplicity, we assume that as long as the agent string contains "windows," the user is considered a Windows user. Because some agents are missing, they are removed from the data first:

cframe = frame[frame.a.notnull()]

Then calculate whether each row contains the value of Windows:

import numpy as np
cframe['os'] = np.where(cframe['a'].str.contains('Windows'),
                        'Windows', 'Not Windows')

C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
cframe['os'][:5]

0        Windows
1    Not Windows
2        Windows
3    Not Windows
4        Windows
Name: os, dtype: object

Next, you can group the data according to the time zone and the new operating system list:

by_tz_os = cframe.groupby(['tz', 'os'])

Group count, similar to value_ Count function, which can be calculated by size. And use unstack to reconstruct the counting results

agg_counts = by_tz_os.size().unstack().fillna(0)
agg_counts[:10]

os Not Windows Windows
tz
245.0 276.0
Africa/Cairo 0.0 3.0
Africa/Casablanca 0.0 1.0
Africa/Ceuta 0.0 2.0
Africa/Johannesburg 0.0 1.0
Africa/Lusaka 0.0 1.0
America/Anchorage 4.0 1.0
America/Argentina/Buenos_Aires 1.0 0.0
America/Argentina/Cordoba 0.0 1.0
America/Argentina/Mendoza 0.0 1.0

Finally, let's choose the most frequent time zone. In order to achieve this goal, I based on the agg_ The number of rows in counts constructs an indirect index array:

indexer = agg_counts.sum(1).argsort()
indexer[:10]

tz
                                  24
Africa/Cairo                      20
Africa/Casablanca                 21
Africa/Ceuta                      92
Africa/Johannesburg               87
Africa/Lusaka                     53
America/Anchorage                 54
America/Argentina/Buenos_Aires    57
America/Argentina/Cordoba         26
America/Argentina/Mendoza         55
dtype: int64

Then I take the maximum value of the last 10 lines in this order:

count_subset = agg_counts.take(indexer[-10:])
count_subset

os Not Windows Windows
tz
America/Sao_Paulo 13.0 20.0
Europe/Madrid 16.0 19.0
Pacific/Honolulu 0.0 36.0
Asia/Tokyo 2.0 35.0
Europe/London 43.0 31.0
America/Denver 132.0 59.0
America/Los_Angeles 130.0 252.0
America/Chicago 115.0 285.0
245.0 276.0
America/New_York 339.0 912.0

pandas has a simple method nlargest, which can do the same work:

agg_counts.sum(1).nlargest(10)

tz
America/New_York       1251.0
                        521.0
America/Chicago         400.0
America/Los_Angeles     382.0
America/Denver          191.0
Europe/London            74.0
Asia/Tokyo               37.0
Pacific/Honolulu         36.0
Europe/Madrid            35.0
America/Sao_Paulo        33.0
dtype: float64

Then, as this code shows, it can be represented by a histogram. I pass an extra parameter to seaborn's bartolt function to draw a stacked bar graph

count_subset = count_subset.stack()
count_subset.name = 'total'
count_subset = count_subset.reset_index()
count_subset[:10]

tz os total
0 America/Sao_Paulo Not Windows 13.0
1 America/Sao_Paulo Windows 20.0
2 Europe/Madrid Not Windows 16.0
3 Europe/Madrid Windows 19.0
4 Pacific/Honolulu Not Windows 0.0
5 Pacific/Honolulu Windows 36.0
6 Asia/Tokyo Not Windows 2.0
7 Asia/Tokyo Windows 35.0
8 Europe/London Not Windows 43.0
9 Europe/London Windows 31.0
sns.barplot(x='total', y='tz', hue='os',  data=count_subset)

<matplotlib.axes._subplots.AxesSubplot at 0x164d60101c8>

[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-vc5bp0k-1592207117237) (% E6% 95% B0% E6% 8D% AE% E5% 88% 86% E6% 9E% 90% E7% A4% Ba% E4% be% 8b)_ files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_ 59_ 1.png)]

This figure is not easy to see the relative proportion of Windows users in small groups, so we normalized the group percentage to 1:

def norm_total(group):
    group['normed_total'] = group.total / group.total.sum()
    return group

results = count_subset.groupby('tz').apply(norm_total)
sns.barplot(x='normed_total', y='tz', hue='os',  data=results)

<matplotlib.axes._subplots.AxesSubplot at 0x164d615d988>

[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-5iare85d-1592207117240) (% E6% 95% B0% E6% 8D% AE% E5% 88% 86% E6% 9E% 90% E7% A4% Ba% E4% be% 8b)_ files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_ 61_ 1.png)]

We can also use the transform ation method of groupby to calculate the standardized sum more efficiently:

g = count_subset.groupby('tz')
results2 = count_subset.total / g.total.transform('sum')

Example 2. Analysis of film rating data

Movielens users provide a collection of movie rating data. These data include movie ratings, movie metadata, and audience data (age, zip code, gender, occupation). Here I'll show you how to slice this data into the exact form you need.

The MovieLens 1M dataset contains 1 million ratings from 6000 users for 4000 movies. It is divided into three tables: rating, user information and movie information. After extracting the data from the zip file, you can use the pandas.read_table read each table into a pandas DataFrame object:

import pandas as pd
pd.options.display.max_rows = 10 # Reduce the content of the exhibition
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('../datasets/movielens/users.dat', sep='::',
                      header=None, names=unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('../datasets/movielens/ratings.dat', sep='::',
                        header=None, names=rnames)
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('../datasets/movielens/movies.dat', sep='::',
                       header=None, names=mnames)

C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  """
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:11: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  # This is added back by InteractiveShellApp.init_path()

Using Python's slicing syntax, you can verify that the data loading is all right by looking at the first few lines of each DataFrame:

users[:5]

user_id gender age occupation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455
ratings[:5]

user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
movies[:5]

movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy

Note that the ages and occupations are given in coded form. For their specific meanings, please refer to the README file of the dataset. It's not easy to analyze the data scattered in three tables. If we want to calculate the average score of a movie based on gender and age, it's much easier to combine all the data into one table. We use the merge function of pandas to merge ratings with users, and then movies. Pandas infers which columns are merge (or join) keys based on the overlap of column names:

data = pd.merge(pd.merge(ratings, users), movies)
data.head()

user_id movie_id rating timestamp gender age occupation zip title genres
0 1 1193 5 978300760 F 1 10 48067 One Flew Over the Cuckoo's Nest (1975) Drama
1 2 1193 5 978298413 M 56 16 70072 One Flew Over the Cuckoo's Nest (1975) Drama
2 12 1193 4 978220179 M 25 12 32793 One Flew Over the Cuckoo's Nest (1975) Drama
3 15 1193 4 978199279 M 25 7 22903 One Flew Over the Cuckoo's Nest (1975) Drama
4 17 1193 5 978158471 M 50 1 95350 One Flew Over the Cuckoo's Nest (1975) Drama
data.iloc[0]

user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0, dtype: object

To calculate the average score of each movie by gender, we can use pivot_table method:

mean_ratings = data.pivot_table('rating', index='title',
                                columns='gender', aggfunc='mean')
mean_ratings[:5]

gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024

This operation produces another DataFrame with the content as the movie average score, the row as the movie name (index), and the column as the gender. Now, I'm going to filter out movies that don't have 250 ratings. To achieve this goal, I first group the title, and then use size() to get a Series object containing the size of each movie group:

ratings_by_title = data.groupby('title').size()
ratings_by_title[:10]

title
$1,000,000 Duck (1971)                37
'Night Mother (1986)                  70
'Til There Was You (1997)             52
'burbs, The (1989)                   303
...And Justice for All (1979)        199
1-900 (1994)                           2
10 Things I Hate About You (1999)    700
101 Dalmatians (1961)                565
101 Dalmatians (1996)                364
12 Angry Men (1957)                  616
dtype: int64
active_titles = ratings_by_title.index[ratings_by_title >= 250]
active_titles

Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)

There are more than 250 movie names in the title index, and then we can_ Ratings selects the required row:

mean_ratings = mean_ratings.loc[active_titles]
mean_ratings

gender F M
title
'burbs, The (1989) 2.793478 2.962085
10 Things I Hate About You (1999) 3.646552 3.311966
101 Dalmatians (1961) 3.791444 3.500000
101 Dalmatians (1996) 3.240000 2.911215
12 Angry Men (1957) 4.184397 4.328421
... ... ...
Young Guns (1988) 3.371795 3.425620
Young Guns II (1990) 2.934783 2.904025
Young Sherlock Holmes (1985) 3.514706 3.363344
Zero Effect (1998) 3.864407 3.723140
eXistenZ (1999) 3.098592 3.289086

1216 rows × 2 columns

In order to understand the favorite movies of female audiences, we can arrange the F column in descending order:

top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
top_female_ratings[:10]

gender F M
title
Close Shave, A (1995) 4.644444 4.473795
Wrong Trousers, The (1993) 4.588235 4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589
Wallace & Gromit: The Best of Aardman Animation (1996) 4.563107 4.385075
Schindler's List (1993) 4.562602 4.491415
Shawshank Redemption, The (1994) 4.539075 4.560625
Grand Day Out, A (1992) 4.537879 4.293255
To Kill a Mockingbird (1962) 4.536667 4.372611
Creature Comforts (1990) 4.513889 4.272277
Usual Suspects, The (1995) 4.513317 4.518248

2.1 difference of measurement and evaluation

Let's say we want to find out which movies have the biggest differences between male and female audiences. One way is to give it to mean_ratings adds a column to store the difference between the average scores and sorts them:

mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

By "diff", you can get the most divergent movies that female audiences prefer:

sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:10]

gender F M diff
title
Dirty Dancing (1987) 3.790378 2.959596 -0.830782
Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359
Grease (1978) 3.975265 3.367041 -0.608224
Little Women (1994) 3.870588 3.321739 -0.548849
Steel Magnolias (1989) 3.901734 3.365957 -0.535777
Anastasia (1997) 3.800000 3.281609 -0.518391
Rocky Horror Picture Show, The (1975) 3.673016 3.160131 -0.512885
Color Purple, The (1985) 4.158192 3.659341 -0.498851
Age of Innocence, The (1993) 3.827068 3.339506 -0.487561
Free Willy (1993) 2.921348 2.438776 -0.482573

Reverse the sorting results and take out the first 10 lines to get the movies that male audiences prefer:

sorted_by_diff[::-1][:10]

gender F M diff
title
Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351
Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359
Dumb & Dumber (1994) 2.697987 3.336595 0.638608
Longest Day, The (1962) 3.411765 4.031447 0.619682
Cable Guy, The (1996) 2.250000 2.863787 0.613787
Evil Dead II (Dead By Dawn) (1987) 3.297297 3.909283 0.611985
Hidden, The (1987) 3.137931 3.745098 0.607167
Rocky III (1982) 2.361702 2.943503 0.581801
Caddyshack (1980) 3.396135 3.969737 0.573602
For a Few Dollars More (1965) 3.409091 3.953795 0.544704

If you just want to find out the most divergent movies (regardless of gender), you can calculate the variance or standard deviation of score data:

rating_std_by_title = data.groupby('title')['rating'].std()
rating_std_by_title = rating_std_by_title.loc[active_titles]
rating_std_by_title.sort_values(ascending=False)[:10]

title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64

Example 3. Data analysis of infant names in the United States from 1880 to 2010

The U.S. Social Security Agency (SSA) provides data on the frequency of baby names from 1880 to the present.

Since this is a very standard comma separated format, you can use the pandas.read_csv loads it into the DataFrame:

import pandas as pd
names1880 =pd.read_csv('../datasets/babynames/yob1880.txt',names=['name', 'sex', 'births'])
names1880.head()

name sex births
0 Mary F 7065
1 Anna F 2604
2 Emma F 2003
3 Elizabeth F 1939
4 Minnie F 1746

These documents only contain names that appear more than five times in the year. For the sake of simplicity, we can use the sex subtotal of births column to represent the total births of the year:

names1880.groupby('sex').births.sum()

sex
F     90993
M    110493
Name: births, dtype: int64

Since the dataset is divided into multiple files by year, the first thing is to assemble all data into a DataFrame and add a year field. use pandas.concat This can be achieved:

years = range(1880, 2011)
pieces = []
columns = ['name', 'sex', 'births']
for year in years:
    path = '../datasets/babynames/yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)

    frame['year'] = year
    pieces.append(frame)
names = pd.concat(pieces, ignore_index=True)
names.head()

name sex births year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880

Here are a few things to watch out for. First, concat combines multiple dataframes by line by default; second, ignore must be specified_ Index = true because we don't want to keep read_ The original line number returned by the CSV. Now we have a very large DataFrame, which contains all the name data:|

names.head()

name sex births year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880

With this data, we can use groupby or pivot_table aggregates them at the year and sex levels:

total_births = names.pivot_table('births', index='year',columns='sex', aggfunc=sum)
total_births.tail()

sex F M
year
2006 1896468 2050234
2007 1916888 2069242
2008 1883645 2032310
2009 1827643 1973359
2010 1759010 1898382
total_births.plot(title='Total births by sex and year')

<matplotlib.axes._subplots.AxesSubplot at 0x164ddf12948>

[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-CoDIEpce-1592207117243)(%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_104_1.png)]

Let's insert a prop column to hold the ratio of the number of babies with a given name to the total number of births. A prop value of 0.02 means that 2 out of every 100 babies have the current name. Therefore, we first group by year and sex, and then add new columns to each group:

def add_prop(group):
    group['prop'] = group.births / group.births.sum()
    return group
names = names.groupby(['year', 'sex']).apply(add_prop)
names

name sex births year prop
0 Mary F 7065 1880 0.077643
1 Anna F 2604 1880 0.028618
2 Emma F 2003 1880 0.022013
3 Elizabeth F 1939 1880 0.021309
4 Minnie F 1746 1880 0.019188
... ... ... ... ... ...
1690779 Zymaire M 5 2010 0.000003
1690780 Zyonne M 5 2010 0.000003
1690781 Zyquarius M 5 2010 0.000003
1690782 Zyran M 5 2010 0.000003
1690783 Zzyzx M 5 2010 0.000003

1690784 rows × 5 columns

When performing such grouping processing, generally some validity checks should be done, such as verifying whether the sum of prop s of all groups is 1:

names.groupby(['year', 'sex']).prop.sum()

year  sex
1880  F      1.0
      M      1.0
1881  F      1.0
      M      1.0
1882  F      1.0
            ... 
2008  M      1.0
2009  F      1.0
      M      1.0
2010  F      1.0
      M      1.0
Name: prop, Length: 262, dtype: float64

completion of jobs. To facilitate further analysis, I need to extract a subset of the data: the first 1000 names of each pair of sex/year combinations. This is another grouping operation:

def get_top1000(group):
    return group.sort_values(by='births', ascending=False)[:1000]
grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)
# Drop the group index, not needed
top1000.reset_index(inplace=True, drop=True)

Now the result data set is much smaller:

top1000

name sex births year prop
0 Mary F 7065 1880 0.077643
1 Anna F 2604 1880 0.028618
2 Emma F 2003 1880 0.022013
3 Elizabeth F 1939 1880 0.021309
4 Minnie F 1746 1880 0.019188
... ... ... ... ... ...
261872 Camilo M 194 2010 0.000102
261873 Destin M 194 2010 0.000102
261874 Jaquan M 194 2010 0.000102
261875 Jaydan M 194 2010 0.000102
261876 Maxton M 193 2010 0.000102

261877 rows × 5 columns

The next data analysis work focuses on the top1000 dataset.

3.1 analyze name trends

With the complete dataset and the top1000 dataset just generated, we can start to analyze various naming trends. First, divide the first 1000 names into two parts: male and female:

boys = top1000[top1000.sex == 'M']
girls = top1000[top1000.sex == 'F']

These are two simple time series that can be plotted with just a little sorting (for example, the number of babies called John and Mary each year). Our husband made a PivotTable of total births by year and name:

total_births = top1000.pivot_table('births', index='year',columns='name',aggfunc=sum)

Now, let's use the plot method of DataFrame to draw a graph with several names

total_births.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 1880 to 2010
Columns: 6868 entries, Aaden to Zuri
dtypes: float64(6868)
memory usage: 6.9 MB
subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
subset.plot(subplots=True, figsize=(12, 10), grid=False, title="Number of births per year")

array([<matplotlib.axes._subplots.AxesSubplot object at 0x00000164EAD26748>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x00000164EAD2A948>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x00000164DE59F988>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x00000164EAD8F808>],
      dtype=object)

[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-anyijsbf-1592207117325) (% E6% 95% B0% E6% 8D% AE% E5% 88% 86% E6% 9E% 90% E7% A4% Ba% E4% be% 8b)_ files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_ 121_ 1.png)]

As can be seen from the picture, these names are no longer in the minds of the American people. But it's not that simple. We'll know what's going on in the next section.

Increase in diversity of measurement nomenclature

One explanation is that parents are willing to give their children fewer and fewer common names. This hypothesis can be verified from the data. One way is to calculate the proportion of the most popular 1000 names. I aggregate and plot by year and sex

table = top1000.pivot_table('prop', index='year',
                            columns='sex', aggfunc=sum)
table.plot(title='Sum of table1000.prop by year and sex',
           yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))

<matplotlib.axes._subplots.AxesSubplot at 0x164eb0ed788>

[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-4srprsi-1592207117328) (% E6% 95% B0% E6% 8D% AE% E5% 88% 86% E6% 9E% 90% E7% A4% Ba% E4% be% 8b)_ files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_ 125_ 1.png)]

As you can see from the figure, the diversity of names has indeed increased (the proportion of the first 1000 items has decreased). Another way is to calculate the number of different names that make up the top 50% of the total number of births, which is not very easy to calculate. We only consider the names of boys in 2010:

df = boys[boys.year == 2010]
df

name sex births year prop
260877 Jacob M 21875 2010 0.011523
260878 Ethan M 17866 2010 0.009411
260879 Michael M 17133 2010 0.009025
260880 Jayden M 17030 2010 0.008971
260881 William M 16870 2010 0.008887
... ... ... ... ... ...
261872 Camilo M 194 2010 0.000102
261873 Destin M 194 2010 0.000102
261874 Jaquan M 194 2010 0.000102
261875 Jaydan M 194 2010 0.000102
261876 Maxton M 193 2010 0.000102

1000 rows × 5 columns

After ranking the props in descending order, we want to know how many names in front of us add up to 50%. While writing a for loop does work, NumPy has a smarter way of vectoring. First, calculate the cumulative sum of props (cum sum), and then find out where 0.5 should be inserted through the search sorted method to ensure that the order is not broken

prop_cumsum = df.sort_values(by='prop', ascending=False).prop.cumsum()
prop_cumsum

260877    0.011523
260878    0.020934
260879    0.029959
260880    0.038930
260881    0.047817
            ...   
261872    0.842748
261873    0.842850
261874    0.842953
261875    0.843055
261876    0.843156
Name: prop, Length: 1000, dtype: float64
prop_cumsum.values.searchsorted(0.5)

116

Since the array index starts from 0, we will add 1 to the result, that is, the final result is 117. Compared with the data of 1900, the figure is much smaller:

df = boys[boys.year == 1900]
in1900 = df.sort_values(by='prop', ascending=False).prop.cumsum()
in1900.values.searchsorted(0.5) + 1

25

Now you can perform this calculation for all year/sex combinations. Use these two fields for groupby processing, and then use a function to calculate the value of each group:

def get_quantile_count(group, q=0.5):
    group = group.sort_values(by='prop', ascending=False)
    return group.prop.cumsum().values.searchsorted(q) + 1

diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)
diversity = diversity.unstack('sex')

Now, the data frame of diversity has two time series (one for each gender, indexed by year). With IPython, you can view its contents and draw a chart as before:

diversity.head()

sex F M
year
1880 38 14
1881 38 14
1882 38 15
1883 39 15
1884 39 16
diversity.plot(title="Number of popular names in top 50%")

<matplotlib.axes._subplots.AxesSubplot at 0x164eaf06308>

[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-ehl7rzqt-1592207117333) (% E6% 95% B0% E6% 8D% AE% E5% 88% 86% E6% 9E% 90% E7% A4% Ba% E4% be% 8b)_ files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_ 137_ 1.png)]

As can be seen from the picture, the diversity of girls' names is always higher than that of boys, and it is getting higher and higher.

'last letter' revolution

In 2007, Laura Wattenberg, a baby name researcher, pointed out on her own website that over the past hundred years, the distribution of boys' names in the last letter has changed significantly. In order to understand the specific situation, I first aggregate all the birth data on the year, gender and last letter:

get_last_letter = lambda x: x[-1]
last_letters = names.name.map(get_last_letter)
last_letters.name = 'last_letter'
table = names.pivot_table('births', index=last_letters,
                          columns=['sex', 'year'], aggfunc=sum)

Then, I choose three representative years and output the previous lines:

subtable = table.reindex(columns=[1910, 1960, 2010], level='year')
subtable.head()

sex F M
year 1910 1960 2010 1910 1960 2010
last_letter
a 108376.0 691247.0 670605.0 977.0 5204.0 28438.0
b NaN 694.0 450.0 411.0 3912.0 38859.0
c 5.0 49.0 946.0 482.0 15476.0 23125.0
d 6750.0 3729.0 2607.0 22111.0 262112.0 44398.0
e 133569.0 435013.0 313833.0 28655.0 178823.0 129012.0

Next, we need to standardize the table according to the total number of births, so as to calculate the proportion of the last letters of each gender in the total number of births:

subtable.sum()

sex  year
F    1910     396416.0
     1960    2022062.0
     2010    1759010.0
M    1910     194198.0
     1960    2132588.0
     2010    1898382.0
dtype: float64
letter_prop = subtable / subtable.sum()
letter_prop

sex F M
year 1910 1960 2010 1910 1960 2010
last_letter
a 0.273390 0.341853 0.381240 0.005031 0.002440 0.014980
b NaN 0.000343 0.000256 0.002116 0.001834 0.020470
c 0.000013 0.000024 0.000538 0.002482 0.007257 0.012181
d 0.017028 0.001844 0.001482 0.113858 0.122908 0.023387
e 0.336941 0.215133 0.178415 0.147556 0.083853 0.067959
... ... ... ... ... ... ...
v NaN 0.000060 0.000117 0.000113 0.000037 0.001434
w 0.000020 0.000031 0.001182 0.006329 0.007711 0.016148
x 0.000015 0.000037 0.000727 0.003965 0.001851 0.008614
y 0.110972 0.152569 0.116828 0.077349 0.160987 0.058168
z 0.002439 0.000659 0.000704 0.000170 0.000184 0.001831

26 rows × 6 columns

With this alphabetic scale data, you can generate a bar chart for each gender in each year:

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 1, figsize=(10, 8))
letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female',
                      legend=False)

<matplotlib.axes._subplots.AxesSubplot at 0x164eafad088>

[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-nlyeinjl-1592207117334) (% E6% 95% B0% E6% 8D% AE% E5% 88% 86% E6% 9E% 90% E7% A4% Ba% E4% be% 8b)_ files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_ 148_ 1.png)]

It can be seen that there has been a significant increase in the number of boy names ending with the letter "n" since the 1960s. Go back to the complete table you created earlier, normalize it by year and gender, select a few letters in the boy's name, and then transpose them to make each column into a time series:

letter_prop = table / table.sum()
dny_ts = letter_prop.loc[['d', 'n', 'y'], 'M'].T
dny_ts.head()

last_letter d n y
year
1880 0.083055 0.153213 0.075760
1881 0.083247 0.153214 0.077451
1882 0.085340 0.149560 0.077537
1883 0.084066 0.151646 0.079144
1884 0.086120 0.149915 0.080405

With the DataFrame of this time series, you can draw a trend chart through its plot method

dny_ts.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1648145b448>

[external link picture transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-3nlzrv5p-1592207117337) (% E6% 95% B0% E6% 8D% AE% E5% 88% 86% E6% 9E% 90% E7% A4% Ba% E4% be% 8b)_ files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_ 152_ 1.png)]

Boy's name becomes girl's (and vice versa)

Another interesting trend is that names that were popular with boys in the early years have "changed" in recent years, such as Lesley or Leslie. Go back to the top1000 dataset and find the set of names that start with "lesl":

all_names = pd.Series(top1000.name.unique())
lesley_like = all_names[all_names.str.lower().str.contains('lesl')]
lesley_like

632     Leslie
2294    Lesley
4262    Leslee
4728     Lesli
6103     Lesly
dtype: object

Then use this result to filter other names, and calculate the number of births by name group to see the relative frequency:

filtered = top1000[top1000.name.isin(lesley_like)]
filtered.groupby('name').births.sum()

name
Leslee      1082
Lesley     35022
Lesli        929
Leslie    370429
Lesly      10067
Name: births, dtype: int64

Next, we aggregate by gender and year, and standardize by year:

table = filtered.pivot_table('births', index='year',columns='sex', aggfunc='sum')
table = table.div(table.sum(1), axis=0)
table.tail()

sex F M
year
2006 1.0 NaN
2007 1.0 NaN
2008 1.0 NaN
2009 1.0 NaN
2010 1.0 NaN

Finally, it is easy to draw an annual curve of sex

table.plot(style={'M': 'k-', 'F': 'k--'})

<matplotlib.axes._subplots.AxesSubplot at 0x164812f6508>

[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-R1byVFxf-1592207117338)(%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_161_1.png)]

Example 4. Data analysis of USDA food database

The United States Department of agriculture (USDA) has produced a database of food nutrition information. Ashley Williams produced a JSON version of the data. The records are as follows:

{
  "id": 21441,
  "description": "KENTUCKY FRIED CHICKEN, Fried Chicken, EXTRA CRISPY,
Wing, meat and skin with breading",
  "tags": ["KFC"],
  "manufacturer": "Kentucky Fried Chicken",
"group": "Fast Foods",
  "portions": [
    {
      "amount": 1,
      "unit": "wing, with skin",
      "grams": 68.0
    },

    ...
  ],
  "nutrients": [
    {
      "value": 20.8,
      "units": "g",
      "description": "Protein",
      "group": "Composition"
    },

    ...
  ]
}

Each food has several identifying attributes and two lists of nutrients and quantities. This form of data is not very suitable for analysis, so we need to do some regularization to make it have a better form.

You can load it into python with any JSON library you like. I use Python's built-in JSON module:

import json
db = json.load(open('../datasets/usda_food/database.json'))
len(db)

6636

Every entry in db is a dictionary that contains all the data of a certain food. The nutrients field is a list of dictionaries, each of which corresponds to a nutrient:

db[0].keys()

dict_keys(['id', 'description', 'tags', 'manufacturer', 'group', 'portions', 'nutrients'])
db[0]['nutrients'][0]

{'value': 25.18,
 'units': 'g',
 'description': 'Protein',
 'group': 'Composition'}
nutrients = pd.DataFrame(db[0]['nutrients'])
nutrients

value units description group
0 25.180 g Protein Composition
1 29.200 g Total lipid (fat) Composition
2 3.060 g Carbohydrate, by difference Composition
3 3.280 g Ash Other
4 376.000 kcal Energy Energy
... ... ... ... ...
157 1.472 g Serine Amino Acids
158 93.000 mg Cholesterol Other
159 18.584 g Fatty acids, total saturated Other
160 8.275 g Fatty acids, total monounsaturated Other
161 0.830 g Fatty acids, total polyunsaturated Other

162 rows × 4 columns

When you convert a dictionary list to a DataFrame, you can extract only a part of the fields. Here, we will take out the name, classification, number, manufacturer and other information of the food:

info_keys = ['description', 'group', 'id', 'manufacturer']
info = pd.DataFrame(db, columns=info_keys)
info[:5]

description group id manufacturer
0 Cheese, caraway Dairy and Egg Products 1008
1 Cheese, cheddar Dairy and Egg Products 1009
2 Cheese, edam Dairy and Egg Products 1018
3 Cheese, feta Dairy and Egg Products 1019
4 Cheese, mozzarella, part skim milk Dairy and Egg Products 1028
info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6636 entries, 0 to 6635
Data columns (total 4 columns):
description     6636 non-null object
group           6636 non-null object
id              6636 non-null int64
manufacturer    5195 non-null object
dtypes: int64(1), object(3)
memory usage: 207.5+ KB

Through value_ Count, you can check the distribution of food categories:

pd.value_counts(info.group)[:10]

Vegetables and Vegetable Products    812
Beef Products                        618
Baked Products                       496
Breakfast Cereals                    403
Fast Foods                           365
Legumes and Legume Products          365
Lamb, Veal, and Game Products        345
Sweets                               341
Pork Products                        328
Fruits and Fruit Juices              328
Name: group, dtype: int64

Now, in order to do some analysis of all the nutrition data, the easiest way is to integrate the nutrition of all the food into a large table. We have several steps to achieve this. First, convert the nutrition list of each food into a DataFrame, add a column representing the number, and then add the DataFrame to a list. Finally, you can connect these things through concat:

pieces = []
for i in db:
    frame = pd.DataFrame(i['nutrients'])
    frame['id'] = i['id']
    pieces.append(frame)
nutrients = pd.concat(pieces,ignore_index=True)
nutrients

value units description group id
0 25.180 g Protein Composition 1008
1 29.200 g Total lipid (fat) Composition 1008
2 3.060 g Carbohydrate, by difference Composition 1008
3 3.280 g Ash Other 1008
4 376.000 kcal Energy Energy 1008
... ... ... ... ... ...
389350 0.000 mcg Vitamin B-12, added Vitamins 43546
389351 0.000 mg Cholesterol Other 43546
389352 0.072 g Fatty acids, total saturated Other 43546
389353 0.028 g Fatty acids, total monounsaturated Other 43546
389354 0.041 g Fatty acids, total polyunsaturated Other 43546

389355 rows × 5 columns

I found that there will be some duplicates in this DataFrame anyway, so I can discard them directly:

nutrients.duplicated().sum() # Count duplicates

14179
nutrients = nutrients.drop_duplicates() # Delete duplicates

Since both DataFrame objects have "group" and "description", we need to rename them in order to know who is who:

col_mapping = {'description' : 'food','group': 'fgroup'}
info = info.rename(columns=col_mapping, copy=False)
info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6636 entries, 0 to 6635
Data columns (total 4 columns):
food            6636 non-null object
fgroup          6636 non-null object
id              6636 non-null int64
manufacturer    5195 non-null object
dtypes: int64(1), object(3)
memory usage: 207.5+ KB
col_mapping = {'description' : 'nutrient','group' : 'nutgroup'}
nutrients = nutrients.rename(columns=col_mapping, copy=False)
nutrients

value units nutrient nutgroup id
0 25.180 g Protein Composition 1008
1 29.200 g Total lipid (fat) Composition 1008
2 3.060 g Carbohydrate, by difference Composition 1008
3 3.280 g Ash Other 1008
4 376.000 kcal Energy Energy 1008
... ... ... ... ... ...
389350 0.000 mcg Vitamin B-12, added Vitamins 43546
389351 0.000 mg Cholesterol Other 43546
389352 0.072 g Fatty acids, total saturated Other 43546
389353 0.028 g Fatty acids, total monounsaturated Other 43546
389354 0.041 g Fatty acids, total polyunsaturated Other 43546

375176 rows × 5 columns

After that, you can combine info and nutriments:

ndata = pd.merge(nutrients, info, on='id', how='outer')
ndata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 375176 entries, 0 to 375175
Data columns (total 8 columns):
value           375176 non-null float64
units           375176 non-null object
nutrient        375176 non-null object
nutgroup        375176 non-null object
id              375176 non-null int64
food            375176 non-null object
fgroup          375176 non-null object
manufacturer    293054 non-null object
dtypes: float64(1), int64(1), object(6)
memory usage: 25.8+ MB
ndata.iloc[30000]

value                                             0.04
units                                                g
nutrient                                       Glycine
nutgroup                                   Amino Acids
id                                                6158
food            Soup, tomato bisque, canned, condensed
fgroup                      Soups, Sauces, and Gravies
manufacturer                                          
Name: 30000, dtype: object

We can now draw a median map based on food classification and nutrition type

result = ndata.groupby(['nutrient', 'fgroup'])['value'].quantile(0.5)
result['Zinc, Zn'].sort_values().plot(kind='barh')

<matplotlib.axes._subplots.AxesSubplot at 0x164d65f0b48>

[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-0yrWMXr6-1592207117342)(%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_188_1.png)]

Just use your brain a little bit to find out what is the most nutritious food:

by_nutrient = ndata.groupby(['nutgroup', 'nutrient'])

get_maximum = lambda x: x.loc[x.value.idxmax()]
get_minimum = lambda x: x.loc[x.value.idxmin()]

max_foods = by_nutrient.apply(get_maximum)[['value', 'food']]

# make the food a little smaller
max_foods.food = max_foods.food.str[:50]

Because the DataFrame obtained is very large, it is not convenient to print all of them. Only the "Amino Acids" nutrition group is given here:

max_foods.loc['Amino Acids']['food']

nutrient
Alanine                          Gelatins, dry powder, unsweetened
Arginine                              Seeds, sesame flour, low-fat
Aspartic acid                                  Soy protein isolate
Cystine               Seeds, cottonseed flour, low fat (glandless)
Glutamic acid                                  Soy protein isolate
                                       ...                        
Serine           Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Threonine        Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Tryptophan        Sea lion, Steller, meat with fat (Alaska Native)
Tyrosine         Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Valine           Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Name: food, Length: 19, dtype: object

Example five.2012 data analysis of the United States Federal Election Commission database

The Federal Election Commission released data on political campaign sponsorship. This includes the sponsor's name, occupation, employer, address, and amount of contribution.

use pandas.read_csv load data:

fec = pd.read_csv('../datasets/fec/P00000001-ALL.csv')
fec.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
cmte_id              1001731 non-null object
cand_id              1001731 non-null object
cand_nm              1001731 non-null object
contbr_nm            1001731 non-null object
contbr_city          1001712 non-null object
contbr_st            1001727 non-null object
contbr_zip           1001620 non-null object
contbr_employer      988002 non-null object
contbr_occupation    993301 non-null object
contb_receipt_amt    1001731 non-null float64
contb_receipt_dt     1001731 non-null object
receipt_desc         14166 non-null object
memo_cd              92482 non-null object
memo_text            97770 non-null object
form_tp              1001731 non-null object
file_num             1001731 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 122.3+ MB
fec.head()

cmte_id cand_id cand_nm contbr_nm contbr_city contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num
0 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e+08 RETIRED RETIRED 250.0 20-JUN-11 NaN NaN NaN SA17A 736166
1 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e+08 RETIRED RETIRED 50.0 23-JUN-11 NaN NaN NaN SA17A 736166
2 C00410118 P20002978 Bachmann, Michelle SMITH, LANIER LANETT AL 3.68633e+08 INFORMATION REQUESTED INFORMATION REQUESTED 250.0 05-JUL-11 NaN NaN NaN SA17A 749073
3 C00410118 P20002978 Bachmann, Michelle BLEVINS, DARONDA PIGGOTT AR 7.24548e+08 NONE RETIRED 250.0 01-AUG-11 NaN NaN NaN SA17A 749073
4 C00410118 P20002978 Bachmann, Michelle WARDENBURG, HAROLD HOT SPRINGS NATION AR 7.19016e+08 NONE RETIRED 300.0 20-JUN-11 NaN NaN NaN SA17A 736166

The records in this DataFrame are as follows:

fec.iloc[123456]

cmte_id             C00431445
cand_id             P80003338
cand_nm         Obama, Barack
contbr_nm         ELLMAN, IRA
contbr_city             TEMPE
                    ...      
receipt_desc              NaN
memo_cd                   NaN
memo_text                 NaN
form_tp                 SA17A
file_num               772372
Name: 123456, Length: 16, dtype: object

You may have come up with many ways to extract statistics about patrons and sponsorship patterns from these campaign sponsorship data. I'll cover several different analytical tasks (using the methods I've learned so far) in the next section.

It's not hard to see that there is no party information in the data, so it's better to add it in. With unique, you can get a full list of candidates:

unique_cands = fec.cand_nm.unique()
unique_cands

array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
       "Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
       'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick',
       'Cain, Herman', 'Gingrich, Newt', 'McCotter, Thaddeus G',
       'Huntsman, Jon', 'Perry, Rick'], dtype=object)
unique_cands[2]

'Obama, Barack'

One way to identify party information is to use a dictionary:

parties = {'Bachmann, Michelle': 'Republican',
           'Cain, Herman': 'Republican',
           'Gingrich, Newt': 'Republican',
           'Huntsman, Jon': 'Republican',
           'Johnson, Gary Earl': 'Republican',
           'McCotter, Thaddeus G': 'Republican',
           'Obama, Barack': 'Democrat',
           'Paul, Ron': 'Republican',
           'Pawlenty, Timothy': 'Republican',
           'Perry, Rick': 'Republican',
           "Roemer, Charles E. 'Buddy' III": 'Republican',
           'Romney, Mitt': 'Republican',
           'Santorum, Rick': 'Republican'}

Now, with this mapping and the Series object's map method, you can get a set of party information based on the candidate's name:

fec.cand_nm[123456:123461]

123456    Obama, Barack
123457    Obama, Barack
123458    Obama, Barack
123459    Obama, Barack
123460    Obama, Barack
Name: cand_nm, dtype: object
fec.cand_nm[123456:123461].map(parties)

123456    Democrat
123457    Democrat
123458    Democrat
123459    Democrat
123460    Democrat
Name: cand_nm, dtype: object
# Add it as a column
fec['party'] = fec.cand_nm.map(parties)
fec['party'].value_counts()

Democrat      593746
Republican    407985
Name: party, dtype: int64

Here are two things to pay attention to. First, the data includes both donations and refunds

(fec.
 > 0).value_counts()

True     991475
False     10256
Name: contb_receipt_amt, dtype: int64

To simplify the analysis process, I limit the dataset to only positive contributions:

fec = fec[fec.contb_receipt_amt > 0]

Since Barack Obama and Mitt Romney are the two most important candidates, I have also prepared a subset that only includes sponsorship information for their campaigns:

fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack','Romney, Mitt'])]

5.1 donation statistics by occupation and employer

Career based sponsorship information statistics is another statistical task that is often studied. Lawyers, for example, tend to fund Democrats, while business owners tend to fund Republicans. If you don't believe me, just look at the data. First, calculate the total contribution based on occupation, which is very simple:

fec.contbr_occupation.value_counts()[:10]

RETIRED                                   233990
INFORMATION REQUESTED                      35107
ATTORNEY                                   34286
HOMEMAKER                                  29931
PHYSICIAN                                  23432
INFORMATION REQUESTED PER BEST EFFORTS     21138
ENGINEER                                   14334
TEACHER                                    13990
CONSULTANT                                 13273
PROFESSOR                                  12555
Name: contbr_occupation, dtype: int64

It's not hard to see that many occupations involve the same basic type of work, or there are many variants of the same thing. The following code snippet cleans up some of this data (mapping one occupation information to another). Notice that it's cleverly used here dict.get , which allows occupations without mapping relationships to "pass through":

occ_mapping = {
   'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
   'INFORMATION REQUESTED' : 'NOT PROVIDED',
   'INFORMATION REQUESTED (BEST EFFORTS)' : 'NOT PROVIDED',
   'C.E.O.': 'CEO'
}
# If no mapping provided, return x
f = lambda x: occ_mapping.get(x, x)
fec.contbr_occupation = fec.contbr_occupation.map(f)

I do the same with employer information:

emp_mapping = {
   'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
   'INFORMATION REQUESTED' : 'NOT PROVIDED',
   'SELF' : 'SELF-EMPLOYED',
   'SELF EMPLOYED' : 'SELF-EMPLOYED',
}
# If no mapping provided, return x
f = lambda x: emp_mapping.get(x, x)
fec.contbr_employer = fec.contbr_employer.map(f)

Now, you can use pivot_table aggregates data based on parties and occupations, and then filters out the total contribution of less than $2 million

by_occupation = fec.pivot_table('contb_receipt_amt',
                                index='contbr_occupation',
                                columns='party', aggfunc='sum')
over_2mm = by_occupation[by_occupation.sum(1) > 2000000]
over_2mm

party Democrat Republican
contbr_occupation
ATTORNEY 11141982.97 7.477194e+06
CEO 2074974.79 4.211041e+06
CONSULTANT 2459912.71 2.544725e+06
ENGINEER 951525.55 1.818374e+06
EXECUTIVE 1355161.05 4.138850e+06
... ... ...
PRESIDENT 1878509.95 4.720924e+06
PROFESSOR 2165071.08 2.967027e+05
REAL ESTATE 528902.09 1.625902e+06
RETIRED 25305116.38 2.356124e+07
SELF-EMPLOYED 672393.40 1.640253e+06

17 rows × 2 columns

Making a histogram of these data will look clearer ('bash 'for horizontal histogram)

over_2mm.plot(kind='barh')

<matplotlib.axes._subplots.AxesSubplot at 0x164d535f088>

[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-f1hwdsk6-1592207117344) (% E6% 95% B0% E6% 8D% AE% E5% 88% 86% E6% 9E% 90% E7% A4% Ba% E4% be% 8b)_ files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_ 224_ 1.png)]

You may also want to know about the occupations and businesses that contribute the most to Obama and Romney. To do this, we first group the candidates, and then use the top like approach described earlier in this chapter:

def get_top_amounts(group, key, n=5):
    totals = group.groupby(key)['contb_receipt_amt'].sum()
    return totals.nlargest(n)

Then aggregate by occupation and employer:

grouped = fec_mrbo.groupby('cand_nm')
grouped.apply(get_top_amounts, 'contbr_occupation', n=7)

cand_nm        contbr_occupation    
Obama, Barack  RETIRED                  25305116.38
               ATTORNEY                 11141982.97
               INFORMATION REQUESTED     4866973.96
               HOMEMAKER                 4248875.80
               PHYSICIAN                 3735124.94
                                           ...     
Romney, Mitt   HOMEMAKER                 8147446.22
               ATTORNEY                  5364718.82
               PRESIDENT                 2491244.89
               EXECUTIVE                 2300947.03
               C.E.O.                    1968386.11
Name: contb_receipt_amt, Length: 14, dtype: float64
grouped.apply(get_top_amounts, 'contbr_employer', n=10)

cand_nm        contbr_employer      
Obama, Barack  RETIRED                  22694358.85
               SELF-EMPLOYED            17080985.96
               NOT EMPLOYED              8586308.70
               INFORMATION REQUESTED     5053480.37
               HOMEMAKER                 2605408.54
                                           ...     
Romney, Mitt   CREDIT SUISSE              281150.00
               MORGAN STANLEY             267266.00
               GOLDMAN SACH & CO.         238250.00
               BARCLAYS CAPITAL           162750.00
               H.I.G. CAPITAL             139500.00
Name: contb_receipt_amt, Length: 20, dtype: float64

5.2 donation amount in barrels

Another very practical analysis of the data can also be done: use cut function to discretize the data into multiple facets according to the amount of contribution

bins = np.array([0, 1, 10, 100, 1000, 10000,100000, 1000000, 10000000])
labels = pd.cut(fec_mrbo.contb_receipt_amt, bins)
labels

411         (10, 100]
412       (100, 1000]
413       (100, 1000]
414         (10, 100]
415         (10, 100]
             ...     
701381      (10, 100]
701382    (100, 1000]
701383        (1, 10]
701384      (10, 100]
701385    (100, 1000]
Name: contb_receipt_amt, Length: 694282, dtype: category
Categories (8, interval[int64]): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]

Now you can group Obama and Romney data based on candidate names and face tags to get a histogram:

grouped = fec_mrbo.groupby(['cand_nm', labels])
grouped.size().unstack(0)

cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 493.0 77.0
(1, 10] 40070.0 3681.0
(10, 100] 372280.0 31853.0
(100, 1000] 153991.0 43357.0
(1000, 10000] 22284.0 26186.0
(10000, 100000] 2.0 1.0
(100000, 1000000] 3.0 NaN
(1000000, 10000000] 4.0 NaN

As can be seen from this data, Obama received a lot more micro sponsorship than Romney. You can also sum the contributions and normalize them in buckets to graphically display the proportions of various sponsorship amounts of the two candidates

bucket_sums = grouped.contb_receipt_amt.sum().unstack(0)
normed_sums = bucket_sums.div(bucket_sums.sum(axis=1), axis=0)
normed_sums

cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 0.805182 0.194818
(1, 10] 0.918767 0.081233
(10, 100] 0.910769 0.089231
(100, 1000] 0.710176 0.289824
(1000, 10000] 0.447326 0.552674
(10000, 100000] 0.823120 0.176880
(100000, 1000000] 1.000000 NaN
(1000000, 10000000] 1.000000 NaN
normed_sums[:-2].plot(kind='barh')

<matplotlib.axes._subplots.AxesSubplot at 0x1648ee66d08>

[external link picture transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-0PN1YIti-1592207117345)(%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_237_1.png)]

I've ruled out two of the biggest denominations because they're not donated by individuals.

Many refinements and improvements can be made to the analysis process. For example, data can be aggregated based on the name and postcode of the sponsor to find out who made multiple small donations and who made one or more large donations. I strongly recommend that you download the data and explore it for yourself.

5.3 donation statistics by state

Aggregating data based on candidates and states is a routine operation:

grouped = fec_mrbo.groupby(['cand_nm', 'contbr_st'])
totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
totals = totals[totals.sum(1) > 100000]
totals[:10]

cand_nm Obama, Barack Romney, Mitt
contbr_st
AK 281840.15 86204.24
AL 543123.48 527303.51
AR 359247.28 105556.00
AZ 1506476.98 1888436.23
CA 23824984.24 11237636.60
CO 2132429.49 1506714.12
CT 2068291.26 3499475.45
DC 4373538.80 1025137.50
DE 336669.14 82712.00
FL 7318178.58 8338458.81

If you divide each line by the total sponsorship, you get the proportion of each candidate's total sponsorship in each state:

percent = totals.div(totals.sum(1), axis=0)
percent[:10]
cand_nm Obama, Barack Romney, Mitt
contbr_st
AK 0.765778 0.234222
AL 0.507390 0.492610
AR 0.772902 0.227098
AZ 0.443745 0.556255
CA 0.679498 0.320502
CO 0.585970 0.414030
CT 0.371476 0.628524
DC 0.810113 0.189887
DE 0.802776 0.197224
FL 0.467417 0.532583

Posted by revraz on Mon, 15 Jun 2020 01:30:32 -0700