Article catalog
- Data analysis with python (Second Edition)
- Example 1. Data analysis of time zone
- Example 2. Analysis of film rating data
- Example 3. Data analysis of infant names in the United States from 1880 to 2010
- Example 4. Data analysis of USDA food database
- Example five.2012 data analysis of the United States Federal Election Commission database
The data files in this chapter can be downloaded from github below
Data analysis with python (Second Edition)
Example 1. Data analysis of time zone
In 2011, Bitly, a short URL service provider, and the U.S. government website USA.gov Collaboration, which provides anonymous data collected from users with short links ending in. Gov or. mil.
Take the hourly snapshot as an example. The format of each line in the file is JSON (JavaScript Object Notation, which is a common Web data format). For example, if we only read the first line of a file, the result should be as follows:
path = '../datasets/bitly_usagov/example.txt' open(path).readline()
'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11 (KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1, "tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r": "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u": "http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc": 1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'
Python has built-in or third-party modules to convert JSON strings into Python dictionary objects. Here, I will use the JSON module and its loads function to load the downloaded data file line by line:
import json path = '../datasets/bitly_usagov/example.txt' records = [json.loads(line) for line in open(path)]
Now, the records object becomes a set of Python Dictionaries:
records[0]
{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11', 'c': 'US', 'nk': 1, 'tz': 'America/New_York', 'gr': 'MA', 'g': 'A6qOVH', 'h': 'wfLQtf', 'l': 'orofrog', 'al': 'en-US,en;q=0.8', 'hh': '1.usa.gov', 'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf', 'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991', 't': 1331923247, 'hc': 1331822918, 'cy': 'Danvers', 'll': [42.576698, -70.954903]}
1.1 pure python time zone count
Suppose we want to know which time zone (tz field) is the most common one in the dataset, there are many ways to get the answer. First, we use list derivation to extract a set of time zones:
time_zones = [rec['tz'] for rec in records]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-5-f3fbbc37f129> in <module> ----> 1 time_zones = [rec['tz'] for rec in records] <ipython-input-5-f3fbbc37f129> in <listcomp>(.0) ----> 1 time_zones = [rec['tz'] for rec in records] KeyError: 'tz'
We found that not all records have time domain fields. if, we can judge whether the tz field is in the Julu record:
time_zones = [rec['tz'] for rec in records if 'tz' in rec] time_zones[:10]
['America/New_York', 'America/Denver', 'America/New_York', 'America/Sao_Paulo', 'America/New_York', 'America/New_York', 'Europe/Warsaw', '', '', '']
Looking only at the top 10 time zones, we find that some are unknown (i.e. empty). Although they can be filtered out, they are reserved for the time being. Next, in order to count the time zones, here are two methods: one is more difficult (only using the standard Python Library) and the other is simpler (using panda). One way to count is to save the count value in the dictionary during the time zone traversal:
def get_counts(sequence): counts = {} for x in sequence: if x in counts: counts[x] += 1 else: counts[x] = 1 return counts
If you use the more advanced tools of Python standard library, you may write the code more succinctly:
from collections import defaultdict def get_counts2(sequence): counts = defaultdict(int) # values will initialize to 0 for x in sequence: counts[x] += 1 return counts
I write logic into functions for greater reusability. To use it to process time zones, simply set time_zones in
counts = get_counts(time_zones) counts['America/New_York']
1251
len(time_zones)
3440
If we want to get the top 10 time zones and their count values, we need to use some dictionary processing skills:
def top_counts(count_dict, n=10): value_key_pairs = [(count, tz) for tz, count in count_dict.items()] value_key_pairs.sort() return value_key_pairs[-n:]
Then there are:
top_counts(counts)
[(33, 'America/Sao_Paulo'), (35, 'Europe/Madrid'), (36, 'Pacific/Honolulu'), (37, 'Asia/Tokyo'), (74, 'Europe/London'), (191, 'America/Denver'), (382, 'America/Los_Angeles'), (400, 'America/Chicago'), (521, ''), (1251, 'America/New_York')]
If you search Python's standard library, you can find collections.Counter Class, which makes this easier:
from collections import Counter counts = Counter(time_zones) counts.most_common(10)
[('America/New_York', 1251), ('', 521), ('America/Chicago', 400), ('America/Los_Angeles', 382), ('America/Denver', 191), ('Europe/London', 74), ('Asia/Tokyo', 37), ('Pacific/Honolulu', 36), ('Europe/Madrid', 35), ('America/Sao_Paulo', 33)]
1.2 time zone counting with pandas
Creating a DateFrame from a collection of original records, and passing a list of records to pandas.DataFrame Just as simple:
import pandas as pd frame = pd.DataFrame(records) frame.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3560 entries, 0 to 3559 Data columns (total 18 columns): a 3440 non-null object c 2919 non-null object nk 3440 non-null float64 tz 3440 non-null object gr 2919 non-null object g 3440 non-null object h 3440 non-null object l 3440 non-null object al 3094 non-null object hh 3440 non-null object r 3440 non-null object u 3440 non-null object t 3440 non-null float64 hc 3440 non-null float64 cy 2919 non-null object ll 2919 non-null object _heartbeat_ 120 non-null float64 kw 93 non-null object dtypes: float64(4), object(14) memory usage: 500.8+ KB
frame['tz'][:10]
0 America/New_York 1 America/Denver 2 America/New_York 3 America/Sao_Paulo 4 America/New_York 5 America/New_York 6 Europe/Warsaw 7 8 9 Name: tz, dtype: object
Here, the output form of frame is summary view, which is mainly used for large DataFrame objects. We can then use value on the Series_ Count method:
tz_counts = frame['tz'].value_counts() tz_counts[:10]
America/New_York 1251 521 America/Chicago 400 America/Los_Angeles 382 America/Denver 191 Europe/London 74 Asia/Tokyo 37 Pacific/Honolulu 36 Europe/Madrid 35 America/Sao_Paulo 33 Name: tz, dtype: int64
We can use matplotlib to visualize this data. To do this, we first fill in an alternative value for the unknown or missing time zone in the record. fillna function can replace missing value (NA), while unknown value (empty string) can be replaced by Boolean array index:
clean_tz = frame['tz'].fillna('Missing') clean_tz[clean_tz == ''] = 'Unknown' tz_counts = clean_tz.value_counts() tz_counts[:10]
America/New_York 1251 Unknown 521 America/Chicago 400 America/Los_Angeles 382 America/Denver 191 Missing 120 Europe/London 74 Asia/Tokyo 37 Pacific/Honolulu 36 Europe/Madrid 35 Name: tz, dtype: int64
At this point, we can use the seaborn package to create a horizontal histogram:
import seaborn as sns subset = tz_counts[:10] sns.barplot(y=subset.index, x=subset.values)
<matplotlib.axes._subplots.AxesSubplot at 0x164d5acf2c8>
The a field contains information about browsers, devices, and applications that perform URL shortening:
frame['a'][1]
'GoogleMaps/RochesterNY'
frame['a'][50]
'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'
frame['a'][51][:50] # First 50 characters of information
'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P9'
It's frustrating to parse all the information in these "proxy" strings. One strategy is to separate the first section of this string (roughly corresponding to the browser) and get another user behavior summary:
results = pd.Series([x.split()[0] for x in frame.a.dropna()]) results[:5]
0 Mozilla/5.0 1 GoogleMaps/RochesterNY 2 Mozilla/4.0 3 Mozilla/5.0 4 Mozilla/5.0 dtype: object
results.value_counts()[:8]
Mozilla/5.0 2594 Mozilla/4.0 601 GoogleMaps/RochesterNY 121 Opera/9.80 34 TEST_INTERNET_AGENT 24 GoogleProducer 21 Mozilla/6.0 5 BlackBerry8520/5.0.0.681 4 dtype: int64
Now, suppose you want to break down time zone statistics by windows and non Windows users. For simplicity, we assume that as long as the agent string contains "windows," the user is considered a Windows user. Because some agents are missing, they are removed from the data first:
cframe = frame[frame.a.notnull()]
Then calculate whether each row contains the value of Windows:
import numpy as np cframe['os'] = np.where(cframe['a'].str.contains('Windows'), 'Windows', 'Not Windows')
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy This is separate from the ipykernel package so we can avoid doing imports until
cframe['os'][:5]
0 Windows 1 Not Windows 2 Windows 3 Not Windows 4 Windows Name: os, dtype: object
Next, you can group the data according to the time zone and the new operating system list:
by_tz_os = cframe.groupby(['tz', 'os'])
Group count, similar to value_ Count function, which can be calculated by size. And use unstack to reconstruct the counting results
agg_counts = by_tz_os.size().unstack().fillna(0) agg_counts[:10]
os | Not Windows | Windows |
---|---|---|
tz | ||
245.0 | 276.0 | |
Africa/Cairo | 0.0 | 3.0 |
Africa/Casablanca | 0.0 | 1.0 |
Africa/Ceuta | 0.0 | 2.0 |
Africa/Johannesburg | 0.0 | 1.0 |
Africa/Lusaka | 0.0 | 1.0 |
America/Anchorage | 4.0 | 1.0 |
America/Argentina/Buenos_Aires | 1.0 | 0.0 |
America/Argentina/Cordoba | 0.0 | 1.0 |
America/Argentina/Mendoza | 0.0 | 1.0 |
Finally, let's choose the most frequent time zone. In order to achieve this goal, I based on the agg_ The number of rows in counts constructs an indirect index array:
indexer = agg_counts.sum(1).argsort() indexer[:10]
tz 24 Africa/Cairo 20 Africa/Casablanca 21 Africa/Ceuta 92 Africa/Johannesburg 87 Africa/Lusaka 53 America/Anchorage 54 America/Argentina/Buenos_Aires 57 America/Argentina/Cordoba 26 America/Argentina/Mendoza 55 dtype: int64
Then I take the maximum value of the last 10 lines in this order:
count_subset = agg_counts.take(indexer[-10:]) count_subset
os | Not Windows | Windows |
---|---|---|
tz | ||
America/Sao_Paulo | 13.0 | 20.0 |
Europe/Madrid | 16.0 | 19.0 |
Pacific/Honolulu | 0.0 | 36.0 |
Asia/Tokyo | 2.0 | 35.0 |
Europe/London | 43.0 | 31.0 |
America/Denver | 132.0 | 59.0 |
America/Los_Angeles | 130.0 | 252.0 |
America/Chicago | 115.0 | 285.0 |
245.0 | 276.0 | |
America/New_York | 339.0 | 912.0 |
pandas has a simple method nlargest, which can do the same work:
agg_counts.sum(1).nlargest(10)
tz America/New_York 1251.0 521.0 America/Chicago 400.0 America/Los_Angeles 382.0 America/Denver 191.0 Europe/London 74.0 Asia/Tokyo 37.0 Pacific/Honolulu 36.0 Europe/Madrid 35.0 America/Sao_Paulo 33.0 dtype: float64
Then, as this code shows, it can be represented by a histogram. I pass an extra parameter to seaborn's bartolt function to draw a stacked bar graph
count_subset = count_subset.stack() count_subset.name = 'total' count_subset = count_subset.reset_index() count_subset[:10]
tz | os | total | |
---|---|---|---|
0 | America/Sao_Paulo | Not Windows | 13.0 |
1 | America/Sao_Paulo | Windows | 20.0 |
2 | Europe/Madrid | Not Windows | 16.0 |
3 | Europe/Madrid | Windows | 19.0 |
4 | Pacific/Honolulu | Not Windows | 0.0 |
5 | Pacific/Honolulu | Windows | 36.0 |
6 | Asia/Tokyo | Not Windows | 2.0 |
7 | Asia/Tokyo | Windows | 35.0 |
8 | Europe/London | Not Windows | 43.0 |
9 | Europe/London | Windows | 31.0 |
sns.barplot(x='total', y='tz', hue='os', data=count_subset)
<matplotlib.axes._subplots.AxesSubplot at 0x164d60101c8>
[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-vc5bp0k-1592207117237) (% E6% 95% B0% E6% 8D% AE% E5% 88% 86% E6% 9E% 90% E7% A4% Ba% E4% be% 8b)_ files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_ 59_ 1.png)]
This figure is not easy to see the relative proportion of Windows users in small groups, so we normalized the group percentage to 1:
def norm_total(group): group['normed_total'] = group.total / group.total.sum() return group results = count_subset.groupby('tz').apply(norm_total) sns.barplot(x='normed_total', y='tz', hue='os', data=results)
<matplotlib.axes._subplots.AxesSubplot at 0x164d615d988>
[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-5iare85d-1592207117240) (% E6% 95% B0% E6% 8D% AE% E5% 88% 86% E6% 9E% 90% E7% A4% Ba% E4% be% 8b)_ files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_ 61_ 1.png)]
We can also use the transform ation method of groupby to calculate the standardized sum more efficiently:
g = count_subset.groupby('tz') results2 = count_subset.total / g.total.transform('sum')
Example 2. Analysis of film rating data
Movielens users provide a collection of movie rating data. These data include movie ratings, movie metadata, and audience data (age, zip code, gender, occupation). Here I'll show you how to slice this data into the exact form you need.
The MovieLens 1M dataset contains 1 million ratings from 6000 users for 4000 movies. It is divided into three tables: rating, user information and movie information. After extracting the data from the zip file, you can use the pandas.read_table read each table into a pandas DataFrame object:
import pandas as pd pd.options.display.max_rows = 10 # Reduce the content of the exhibition unames = ['user_id', 'gender', 'age', 'occupation', 'zip'] users = pd.read_table('../datasets/movielens/users.dat', sep='::', header=None, names=unames) rnames = ['user_id', 'movie_id', 'rating', 'timestamp'] ratings = pd.read_table('../datasets/movielens/ratings.dat', sep='::', header=None, names=rnames) mnames = ['movie_id', 'title', 'genres'] movies = pd.read_table('../datasets/movielens/movies.dat', sep='::', header=None, names=mnames)
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'. """ C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'. C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:11: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'. # This is added back by InteractiveShellApp.init_path()
Using Python's slicing syntax, you can verify that the data loading is all right by looking at the first few lines of each DataFrame:
users[:5]
user_id | gender | age | occupation | zip | |
---|---|---|---|---|---|
0 | 1 | F | 1 | 10 | 48067 |
1 | 2 | M | 56 | 16 | 70072 |
2 | 3 | M | 25 | 15 | 55117 |
3 | 4 | M | 45 | 7 | 02460 |
4 | 5 | M | 25 | 20 | 55455 |
ratings[:5]
user_id | movie_id | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |
movies[:5]
movie_id | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children's|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
Note that the ages and occupations are given in coded form. For their specific meanings, please refer to the README file of the dataset. It's not easy to analyze the data scattered in three tables. If we want to calculate the average score of a movie based on gender and age, it's much easier to combine all the data into one table. We use the merge function of pandas to merge ratings with users, and then movies. Pandas infers which columns are merge (or join) keys based on the overlap of column names:
data = pd.merge(pd.merge(ratings, users), movies) data.head()
user_id | movie_id | rating | timestamp | gender | age | occupation | zip | title | genres | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 | F | 1 | 10 | 48067 | One Flew Over the Cuckoo's Nest (1975) | Drama |
1 | 2 | 1193 | 5 | 978298413 | M | 56 | 16 | 70072 | One Flew Over the Cuckoo's Nest (1975) | Drama |
2 | 12 | 1193 | 4 | 978220179 | M | 25 | 12 | 32793 | One Flew Over the Cuckoo's Nest (1975) | Drama |
3 | 15 | 1193 | 4 | 978199279 | M | 25 | 7 | 22903 | One Flew Over the Cuckoo's Nest (1975) | Drama |
4 | 17 | 1193 | 5 | 978158471 | M | 50 | 1 | 95350 | One Flew Over the Cuckoo's Nest (1975) | Drama |
data.iloc[0]
user_id 1 movie_id 1193 rating 5 timestamp 978300760 gender F age 1 occupation 10 zip 48067 title One Flew Over the Cuckoo's Nest (1975) genres Drama Name: 0, dtype: object
To calculate the average score of each movie by gender, we can use pivot_table method:
mean_ratings = data.pivot_table('rating', index='title', columns='gender', aggfunc='mean') mean_ratings[:5]
gender | F | M |
---|---|---|
title | ||
$1,000,000 Duck (1971) | 3.375000 | 2.761905 |
'Night Mother (1986) | 3.388889 | 3.352941 |
'Til There Was You (1997) | 2.675676 | 2.733333 |
'burbs, The (1989) | 2.793478 | 2.962085 |
...And Justice for All (1979) | 3.828571 | 3.689024 |
This operation produces another DataFrame with the content as the movie average score, the row as the movie name (index), and the column as the gender. Now, I'm going to filter out movies that don't have 250 ratings. To achieve this goal, I first group the title, and then use size() to get a Series object containing the size of each movie group:
ratings_by_title = data.groupby('title').size() ratings_by_title[:10]
title $1,000,000 Duck (1971) 37 'Night Mother (1986) 70 'Til There Was You (1997) 52 'burbs, The (1989) 303 ...And Justice for All (1979) 199 1-900 (1994) 2 10 Things I Hate About You (1999) 700 101 Dalmatians (1961) 565 101 Dalmatians (1996) 364 12 Angry Men (1957) 616 dtype: int64
active_titles = ratings_by_title.index[ratings_by_title >= 250] active_titles
Index([''burbs, The (1989)', '10 Things I Hate About You (1999)', '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)', '13th Warrior, The (1999)', '2 Days in the Valley (1996)', '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)', '2010 (1984)', ... 'X-Men (2000)', 'Year of Living Dangerously (1982)', 'Yellow Submarine (1968)', 'You've Got Mail (1998)', 'Young Frankenstein (1974)', 'Young Guns (1988)', 'Young Guns II (1990)', 'Young Sherlock Holmes (1985)', 'Zero Effect (1998)', 'eXistenZ (1999)'], dtype='object', name='title', length=1216)
There are more than 250 movie names in the title index, and then we can_ Ratings selects the required row:
mean_ratings = mean_ratings.loc[active_titles] mean_ratings
gender | F | M |
---|---|---|
title | ||
'burbs, The (1989) | 2.793478 | 2.962085 |
10 Things I Hate About You (1999) | 3.646552 | 3.311966 |
101 Dalmatians (1961) | 3.791444 | 3.500000 |
101 Dalmatians (1996) | 3.240000 | 2.911215 |
12 Angry Men (1957) | 4.184397 | 4.328421 |
... | ... | ... |
Young Guns (1988) | 3.371795 | 3.425620 |
Young Guns II (1990) | 2.934783 | 2.904025 |
Young Sherlock Holmes (1985) | 3.514706 | 3.363344 |
Zero Effect (1998) | 3.864407 | 3.723140 |
eXistenZ (1999) | 3.098592 | 3.289086 |
1216 rows × 2 columns
In order to understand the favorite movies of female audiences, we can arrange the F column in descending order:
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False) top_female_ratings[:10]
gender | F | M |
---|---|---|
title | ||
Close Shave, A (1995) | 4.644444 | 4.473795 |
Wrong Trousers, The (1993) | 4.588235 | 4.478261 |
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) | 4.572650 | 4.464589 |
Wallace & Gromit: The Best of Aardman Animation (1996) | 4.563107 | 4.385075 |
Schindler's List (1993) | 4.562602 | 4.491415 |
Shawshank Redemption, The (1994) | 4.539075 | 4.560625 |
Grand Day Out, A (1992) | 4.537879 | 4.293255 |
To Kill a Mockingbird (1962) | 4.536667 | 4.372611 |
Creature Comforts (1990) | 4.513889 | 4.272277 |
Usual Suspects, The (1995) | 4.513317 | 4.518248 |
2.1 difference of measurement and evaluation
Let's say we want to find out which movies have the biggest differences between male and female audiences. One way is to give it to mean_ratings adds a column to store the difference between the average scores and sorts them:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
By "diff", you can get the most divergent movies that female audiences prefer:
sorted_by_diff = mean_ratings.sort_values(by='diff') sorted_by_diff[:10]
gender | F | M | diff |
---|---|---|---|
title | |||
Dirty Dancing (1987) | 3.790378 | 2.959596 | -0.830782 |
Jumpin' Jack Flash (1986) | 3.254717 | 2.578358 | -0.676359 |
Grease (1978) | 3.975265 | 3.367041 | -0.608224 |
Little Women (1994) | 3.870588 | 3.321739 | -0.548849 |
Steel Magnolias (1989) | 3.901734 | 3.365957 | -0.535777 |
Anastasia (1997) | 3.800000 | 3.281609 | -0.518391 |
Rocky Horror Picture Show, The (1975) | 3.673016 | 3.160131 | -0.512885 |
Color Purple, The (1985) | 4.158192 | 3.659341 | -0.498851 |
Age of Innocence, The (1993) | 3.827068 | 3.339506 | -0.487561 |
Free Willy (1993) | 2.921348 | 2.438776 | -0.482573 |
Reverse the sorting results and take out the first 10 lines to get the movies that male audiences prefer:
sorted_by_diff[::-1][:10]
gender | F | M | diff |
---|---|---|---|
title | |||
Good, The Bad and The Ugly, The (1966) | 3.494949 | 4.221300 | 0.726351 |
Kentucky Fried Movie, The (1977) | 2.878788 | 3.555147 | 0.676359 |
Dumb & Dumber (1994) | 2.697987 | 3.336595 | 0.638608 |
Longest Day, The (1962) | 3.411765 | 4.031447 | 0.619682 |
Cable Guy, The (1996) | 2.250000 | 2.863787 | 0.613787 |
Evil Dead II (Dead By Dawn) (1987) | 3.297297 | 3.909283 | 0.611985 |
Hidden, The (1987) | 3.137931 | 3.745098 | 0.607167 |
Rocky III (1982) | 2.361702 | 2.943503 | 0.581801 |
Caddyshack (1980) | 3.396135 | 3.969737 | 0.573602 |
For a Few Dollars More (1965) | 3.409091 | 3.953795 | 0.544704 |
If you just want to find out the most divergent movies (regardless of gender), you can calculate the variance or standard deviation of score data:
rating_std_by_title = data.groupby('title')['rating'].std() rating_std_by_title = rating_std_by_title.loc[active_titles] rating_std_by_title.sort_values(ascending=False)[:10]
title Dumb & Dumber (1994) 1.321333 Blair Witch Project, The (1999) 1.316368 Natural Born Killers (1994) 1.307198 Tank Girl (1995) 1.277695 Rocky Horror Picture Show, The (1975) 1.260177 Eyes Wide Shut (1999) 1.259624 Evita (1996) 1.253631 Billy Madison (1995) 1.249970 Fear and Loathing in Las Vegas (1998) 1.246408 Bicentennial Man (1999) 1.245533 Name: rating, dtype: float64
Example 3. Data analysis of infant names in the United States from 1880 to 2010
The U.S. Social Security Agency (SSA) provides data on the frequency of baby names from 1880 to the present.
Since this is a very standard comma separated format, you can use the pandas.read_csv loads it into the DataFrame:
import pandas as pd names1880 =pd.read_csv('../datasets/babynames/yob1880.txt',names=['name', 'sex', 'births']) names1880.head()
name | sex | births | |
---|---|---|---|
0 | Mary | F | 7065 |
1 | Anna | F | 2604 |
2 | Emma | F | 2003 |
3 | Elizabeth | F | 1939 |
4 | Minnie | F | 1746 |
These documents only contain names that appear more than five times in the year. For the sake of simplicity, we can use the sex subtotal of births column to represent the total births of the year:
names1880.groupby('sex').births.sum()
sex F 90993 M 110493 Name: births, dtype: int64
Since the dataset is divided into multiple files by year, the first thing is to assemble all data into a DataFrame and add a year field. use pandas.concat This can be achieved:
years = range(1880, 2011) pieces = [] columns = ['name', 'sex', 'births'] for year in years: path = '../datasets/babynames/yob%d.txt' % year frame = pd.read_csv(path, names=columns) frame['year'] = year pieces.append(frame) names = pd.concat(pieces, ignore_index=True) names.head()
name | sex | births | year | |
---|---|---|---|---|
0 | Mary | F | 7065 | 1880 |
1 | Anna | F | 2604 | 1880 |
2 | Emma | F | 2003 | 1880 |
3 | Elizabeth | F | 1939 | 1880 |
4 | Minnie | F | 1746 | 1880 |
Here are a few things to watch out for. First, concat combines multiple dataframes by line by default; second, ignore must be specified_ Index = true because we don't want to keep read_ The original line number returned by the CSV. Now we have a very large DataFrame, which contains all the name data:|
names.head()
name | sex | births | year | |
---|---|---|---|---|
0 | Mary | F | 7065 | 1880 |
1 | Anna | F | 2604 | 1880 |
2 | Emma | F | 2003 | 1880 |
3 | Elizabeth | F | 1939 | 1880 |
4 | Minnie | F | 1746 | 1880 |
With this data, we can use groupby or pivot_table aggregates them at the year and sex levels:
total_births = names.pivot_table('births', index='year',columns='sex', aggfunc=sum) total_births.tail()
sex | F | M |
---|---|---|
year | ||
2006 | 1896468 | 2050234 |
2007 | 1916888 | 2069242 |
2008 | 1883645 | 2032310 |
2009 | 1827643 | 1973359 |
2010 | 1759010 | 1898382 |
total_births.plot(title='Total births by sex and year')
<matplotlib.axes._subplots.AxesSubplot at 0x164ddf12948>
[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-CoDIEpce-1592207117243)(%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_104_1.png)]
Let's insert a prop column to hold the ratio of the number of babies with a given name to the total number of births. A prop value of 0.02 means that 2 out of every 100 babies have the current name. Therefore, we first group by year and sex, and then add new columns to each group:
def add_prop(group): group['prop'] = group.births / group.births.sum() return group names = names.groupby(['year', 'sex']).apply(add_prop) names
name | sex | births | year | prop | |
---|---|---|---|---|---|
0 | Mary | F | 7065 | 1880 | 0.077643 |
1 | Anna | F | 2604 | 1880 | 0.028618 |
2 | Emma | F | 2003 | 1880 | 0.022013 |
3 | Elizabeth | F | 1939 | 1880 | 0.021309 |
4 | Minnie | F | 1746 | 1880 | 0.019188 |
... | ... | ... | ... | ... | ... |
1690779 | Zymaire | M | 5 | 2010 | 0.000003 |
1690780 | Zyonne | M | 5 | 2010 | 0.000003 |
1690781 | Zyquarius | M | 5 | 2010 | 0.000003 |
1690782 | Zyran | M | 5 | 2010 | 0.000003 |
1690783 | Zzyzx | M | 5 | 2010 | 0.000003 |
1690784 rows × 5 columns
When performing such grouping processing, generally some validity checks should be done, such as verifying whether the sum of prop s of all groups is 1:
names.groupby(['year', 'sex']).prop.sum()
year sex 1880 F 1.0 M 1.0 1881 F 1.0 M 1.0 1882 F 1.0 ... 2008 M 1.0 2009 F 1.0 M 1.0 2010 F 1.0 M 1.0 Name: prop, Length: 262, dtype: float64
completion of jobs. To facilitate further analysis, I need to extract a subset of the data: the first 1000 names of each pair of sex/year combinations. This is another grouping operation:
def get_top1000(group): return group.sort_values(by='births', ascending=False)[:1000] grouped = names.groupby(['year', 'sex']) top1000 = grouped.apply(get_top1000) # Drop the group index, not needed top1000.reset_index(inplace=True, drop=True)
Now the result data set is much smaller:
top1000
name | sex | births | year | prop | |
---|---|---|---|---|---|
0 | Mary | F | 7065 | 1880 | 0.077643 |
1 | Anna | F | 2604 | 1880 | 0.028618 |
2 | Emma | F | 2003 | 1880 | 0.022013 |
3 | Elizabeth | F | 1939 | 1880 | 0.021309 |
4 | Minnie | F | 1746 | 1880 | 0.019188 |
... | ... | ... | ... | ... | ... |
261872 | Camilo | M | 194 | 2010 | 0.000102 |
261873 | Destin | M | 194 | 2010 | 0.000102 |
261874 | Jaquan | M | 194 | 2010 | 0.000102 |
261875 | Jaydan | M | 194 | 2010 | 0.000102 |
261876 | Maxton | M | 193 | 2010 | 0.000102 |
261877 rows × 5 columns
The next data analysis work focuses on the top1000 dataset.
3.1 analyze name trends
With the complete dataset and the top1000 dataset just generated, we can start to analyze various naming trends. First, divide the first 1000 names into two parts: male and female:
boys = top1000[top1000.sex == 'M'] girls = top1000[top1000.sex == 'F']
These are two simple time series that can be plotted with just a little sorting (for example, the number of babies called John and Mary each year). Our husband made a PivotTable of total births by year and name:
total_births = top1000.pivot_table('births', index='year',columns='name',aggfunc=sum)
Now, let's use the plot method of DataFrame to draw a graph with several names
total_births.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 131 entries, 1880 to 2010 Columns: 6868 entries, Aaden to Zuri dtypes: float64(6868) memory usage: 6.9 MB
subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']] subset.plot(subplots=True, figsize=(12, 10), grid=False, title="Number of births per year")
array([<matplotlib.axes._subplots.AxesSubplot object at 0x00000164EAD26748>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000164EAD2A948>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000164DE59F988>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000164EAD8F808>], dtype=object)
[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-anyijsbf-1592207117325) (% E6% 95% B0% E6% 8D% AE% E5% 88% 86% E6% 9E% 90% E7% A4% Ba% E4% be% 8b)_ files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_ 121_ 1.png)]
As can be seen from the picture, these names are no longer in the minds of the American people. But it's not that simple. We'll know what's going on in the next section.
Increase in diversity of measurement nomenclature
One explanation is that parents are willing to give their children fewer and fewer common names. This hypothesis can be verified from the data. One way is to calculate the proportion of the most popular 1000 names. I aggregate and plot by year and sex
table = top1000.pivot_table('prop', index='year', columns='sex', aggfunc=sum) table.plot(title='Sum of table1000.prop by year and sex', yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))
<matplotlib.axes._subplots.AxesSubplot at 0x164eb0ed788>
[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-4srprsi-1592207117328) (% E6% 95% B0% E6% 8D% AE% E5% 88% 86% E6% 9E% 90% E7% A4% Ba% E4% be% 8b)_ files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_ 125_ 1.png)]
As you can see from the figure, the diversity of names has indeed increased (the proportion of the first 1000 items has decreased). Another way is to calculate the number of different names that make up the top 50% of the total number of births, which is not very easy to calculate. We only consider the names of boys in 2010:
df = boys[boys.year == 2010] df
name | sex | births | year | prop | |
---|---|---|---|---|---|
260877 | Jacob | M | 21875 | 2010 | 0.011523 |
260878 | Ethan | M | 17866 | 2010 | 0.009411 |
260879 | Michael | M | 17133 | 2010 | 0.009025 |
260880 | Jayden | M | 17030 | 2010 | 0.008971 |
260881 | William | M | 16870 | 2010 | 0.008887 |
... | ... | ... | ... | ... | ... |
261872 | Camilo | M | 194 | 2010 | 0.000102 |
261873 | Destin | M | 194 | 2010 | 0.000102 |
261874 | Jaquan | M | 194 | 2010 | 0.000102 |
261875 | Jaydan | M | 194 | 2010 | 0.000102 |
261876 | Maxton | M | 193 | 2010 | 0.000102 |
1000 rows × 5 columns
After ranking the props in descending order, we want to know how many names in front of us add up to 50%. While writing a for loop does work, NumPy has a smarter way of vectoring. First, calculate the cumulative sum of props (cum sum), and then find out where 0.5 should be inserted through the search sorted method to ensure that the order is not broken
prop_cumsum = df.sort_values(by='prop', ascending=False).prop.cumsum() prop_cumsum
260877 0.011523 260878 0.020934 260879 0.029959 260880 0.038930 260881 0.047817 ... 261872 0.842748 261873 0.842850 261874 0.842953 261875 0.843055 261876 0.843156 Name: prop, Length: 1000, dtype: float64
prop_cumsum.values.searchsorted(0.5)
116
Since the array index starts from 0, we will add 1 to the result, that is, the final result is 117. Compared with the data of 1900, the figure is much smaller:
df = boys[boys.year == 1900] in1900 = df.sort_values(by='prop', ascending=False).prop.cumsum() in1900.values.searchsorted(0.5) + 1
25
Now you can perform this calculation for all year/sex combinations. Use these two fields for groupby processing, and then use a function to calculate the value of each group:
def get_quantile_count(group, q=0.5): group = group.sort_values(by='prop', ascending=False) return group.prop.cumsum().values.searchsorted(q) + 1 diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count) diversity = diversity.unstack('sex')
Now, the data frame of diversity has two time series (one for each gender, indexed by year). With IPython, you can view its contents and draw a chart as before:
diversity.head()
sex | F | M |
---|---|---|
year | ||
1880 | 38 | 14 |
1881 | 38 | 14 |
1882 | 38 | 15 |
1883 | 39 | 15 |
1884 | 39 | 16 |
diversity.plot(title="Number of popular names in top 50%")
<matplotlib.axes._subplots.AxesSubplot at 0x164eaf06308>
[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-ehl7rzqt-1592207117333) (% E6% 95% B0% E6% 8D% AE% E5% 88% 86% E6% 9E% 90% E7% A4% Ba% E4% be% 8b)_ files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_ 137_ 1.png)]
As can be seen from the picture, the diversity of girls' names is always higher than that of boys, and it is getting higher and higher.
'last letter' revolution
In 2007, Laura Wattenberg, a baby name researcher, pointed out on her own website that over the past hundred years, the distribution of boys' names in the last letter has changed significantly. In order to understand the specific situation, I first aggregate all the birth data on the year, gender and last letter:
get_last_letter = lambda x: x[-1] last_letters = names.name.map(get_last_letter) last_letters.name = 'last_letter' table = names.pivot_table('births', index=last_letters, columns=['sex', 'year'], aggfunc=sum)
Then, I choose three representative years and output the previous lines:
subtable = table.reindex(columns=[1910, 1960, 2010], level='year') subtable.head()
sex | F | M | ||||
---|---|---|---|---|---|---|
year | 1910 | 1960 | 2010 | 1910 | 1960 | 2010 |
last_letter | ||||||
a | 108376.0 | 691247.0 | 670605.0 | 977.0 | 5204.0 | 28438.0 |
b | NaN | 694.0 | 450.0 | 411.0 | 3912.0 | 38859.0 |
c | 5.0 | 49.0 | 946.0 | 482.0 | 15476.0 | 23125.0 |
d | 6750.0 | 3729.0 | 2607.0 | 22111.0 | 262112.0 | 44398.0 |
e | 133569.0 | 435013.0 | 313833.0 | 28655.0 | 178823.0 | 129012.0 |
Next, we need to standardize the table according to the total number of births, so as to calculate the proportion of the last letters of each gender in the total number of births:
subtable.sum()
sex year F 1910 396416.0 1960 2022062.0 2010 1759010.0 M 1910 194198.0 1960 2132588.0 2010 1898382.0 dtype: float64
letter_prop = subtable / subtable.sum() letter_prop
sex | F | M | ||||
---|---|---|---|---|---|---|
year | 1910 | 1960 | 2010 | 1910 | 1960 | 2010 |
last_letter | ||||||
a | 0.273390 | 0.341853 | 0.381240 | 0.005031 | 0.002440 | 0.014980 |
b | NaN | 0.000343 | 0.000256 | 0.002116 | 0.001834 | 0.020470 |
c | 0.000013 | 0.000024 | 0.000538 | 0.002482 | 0.007257 | 0.012181 |
d | 0.017028 | 0.001844 | 0.001482 | 0.113858 | 0.122908 | 0.023387 |
e | 0.336941 | 0.215133 | 0.178415 | 0.147556 | 0.083853 | 0.067959 |
... | ... | ... | ... | ... | ... | ... |
v | NaN | 0.000060 | 0.000117 | 0.000113 | 0.000037 | 0.001434 |
w | 0.000020 | 0.000031 | 0.001182 | 0.006329 | 0.007711 | 0.016148 |
x | 0.000015 | 0.000037 | 0.000727 | 0.003965 | 0.001851 | 0.008614 |
y | 0.110972 | 0.152569 | 0.116828 | 0.077349 | 0.160987 | 0.058168 |
z | 0.002439 | 0.000659 | 0.000704 | 0.000170 | 0.000184 | 0.001831 |
26 rows × 6 columns
With this alphabetic scale data, you can generate a bar chart for each gender in each year:
import matplotlib.pyplot as plt fig, axes = plt.subplots(2, 1, figsize=(10, 8)) letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male') letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female', legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x164eafad088>
[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-nlyeinjl-1592207117334) (% E6% 95% B0% E6% 8D% AE% E5% 88% 86% E6% 9E% 90% E7% A4% Ba% E4% be% 8b)_ files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_ 148_ 1.png)]
It can be seen that there has been a significant increase in the number of boy names ending with the letter "n" since the 1960s. Go back to the complete table you created earlier, normalize it by year and gender, select a few letters in the boy's name, and then transpose them to make each column into a time series:
letter_prop = table / table.sum() dny_ts = letter_prop.loc[['d', 'n', 'y'], 'M'].T dny_ts.head()
last_letter | d | n | y |
---|---|---|---|
year | |||
1880 | 0.083055 | 0.153213 | 0.075760 |
1881 | 0.083247 | 0.153214 | 0.077451 |
1882 | 0.085340 | 0.149560 | 0.077537 |
1883 | 0.084066 | 0.151646 | 0.079144 |
1884 | 0.086120 | 0.149915 | 0.080405 |
With the DataFrame of this time series, you can draw a trend chart through its plot method
dny_ts.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1648145b448>
[external link picture transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-3nlzrv5p-1592207117337) (% E6% 95% B0% E6% 8D% AE% E5% 88% 86% E6% 9E% 90% E7% A4% Ba% E4% be% 8b)_ files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_ 152_ 1.png)]
Boy's name becomes girl's (and vice versa)
Another interesting trend is that names that were popular with boys in the early years have "changed" in recent years, such as Lesley or Leslie. Go back to the top1000 dataset and find the set of names that start with "lesl":
all_names = pd.Series(top1000.name.unique()) lesley_like = all_names[all_names.str.lower().str.contains('lesl')] lesley_like
632 Leslie 2294 Lesley 4262 Leslee 4728 Lesli 6103 Lesly dtype: object
Then use this result to filter other names, and calculate the number of births by name group to see the relative frequency:
filtered = top1000[top1000.name.isin(lesley_like)] filtered.groupby('name').births.sum()
name Leslee 1082 Lesley 35022 Lesli 929 Leslie 370429 Lesly 10067 Name: births, dtype: int64
Next, we aggregate by gender and year, and standardize by year:
table = filtered.pivot_table('births', index='year',columns='sex', aggfunc='sum') table = table.div(table.sum(1), axis=0) table.tail()
sex | F | M |
---|---|---|
year | ||
2006 | 1.0 | NaN |
2007 | 1.0 | NaN |
2008 | 1.0 | NaN |
2009 | 1.0 | NaN |
2010 | 1.0 | NaN |
Finally, it is easy to draw an annual curve of sex
table.plot(style={'M': 'k-', 'F': 'k--'})
<matplotlib.axes._subplots.AxesSubplot at 0x164812f6508>
[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-R1byVFxf-1592207117338)(%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_161_1.png)]
Example 4. Data analysis of USDA food database
The United States Department of agriculture (USDA) has produced a database of food nutrition information. Ashley Williams produced a JSON version of the data. The records are as follows:
{ "id": 21441, "description": "KENTUCKY FRIED CHICKEN, Fried Chicken, EXTRA CRISPY, Wing, meat and skin with breading", "tags": ["KFC"], "manufacturer": "Kentucky Fried Chicken", "group": "Fast Foods", "portions": [ { "amount": 1, "unit": "wing, with skin", "grams": 68.0 }, ... ], "nutrients": [ { "value": 20.8, "units": "g", "description": "Protein", "group": "Composition" }, ... ] }
Each food has several identifying attributes and two lists of nutrients and quantities. This form of data is not very suitable for analysis, so we need to do some regularization to make it have a better form.
You can load it into python with any JSON library you like. I use Python's built-in JSON module:
import json db = json.load(open('../datasets/usda_food/database.json')) len(db)
6636
Every entry in db is a dictionary that contains all the data of a certain food. The nutrients field is a list of dictionaries, each of which corresponds to a nutrient:
db[0].keys()
dict_keys(['id', 'description', 'tags', 'manufacturer', 'group', 'portions', 'nutrients'])
db[0]['nutrients'][0]
{'value': 25.18, 'units': 'g', 'description': 'Protein', 'group': 'Composition'}
nutrients = pd.DataFrame(db[0]['nutrients']) nutrients
value | units | description | group | |
---|---|---|---|---|
0 | 25.180 | g | Protein | Composition |
1 | 29.200 | g | Total lipid (fat) | Composition |
2 | 3.060 | g | Carbohydrate, by difference | Composition |
3 | 3.280 | g | Ash | Other |
4 | 376.000 | kcal | Energy | Energy |
... | ... | ... | ... | ... |
157 | 1.472 | g | Serine | Amino Acids |
158 | 93.000 | mg | Cholesterol | Other |
159 | 18.584 | g | Fatty acids, total saturated | Other |
160 | 8.275 | g | Fatty acids, total monounsaturated | Other |
161 | 0.830 | g | Fatty acids, total polyunsaturated | Other |
162 rows × 4 columns
When you convert a dictionary list to a DataFrame, you can extract only a part of the fields. Here, we will take out the name, classification, number, manufacturer and other information of the food:
info_keys = ['description', 'group', 'id', 'manufacturer'] info = pd.DataFrame(db, columns=info_keys) info[:5]
description | group | id | manufacturer | |
---|---|---|---|---|
0 | Cheese, caraway | Dairy and Egg Products | 1008 | |
1 | Cheese, cheddar | Dairy and Egg Products | 1009 | |
2 | Cheese, edam | Dairy and Egg Products | 1018 | |
3 | Cheese, feta | Dairy and Egg Products | 1019 | |
4 | Cheese, mozzarella, part skim milk | Dairy and Egg Products | 1028 |
info.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6636 entries, 0 to 6635 Data columns (total 4 columns): description 6636 non-null object group 6636 non-null object id 6636 non-null int64 manufacturer 5195 non-null object dtypes: int64(1), object(3) memory usage: 207.5+ KB
Through value_ Count, you can check the distribution of food categories:
pd.value_counts(info.group)[:10]
Vegetables and Vegetable Products 812 Beef Products 618 Baked Products 496 Breakfast Cereals 403 Fast Foods 365 Legumes and Legume Products 365 Lamb, Veal, and Game Products 345 Sweets 341 Pork Products 328 Fruits and Fruit Juices 328 Name: group, dtype: int64
Now, in order to do some analysis of all the nutrition data, the easiest way is to integrate the nutrition of all the food into a large table. We have several steps to achieve this. First, convert the nutrition list of each food into a DataFrame, add a column representing the number, and then add the DataFrame to a list. Finally, you can connect these things through concat:
pieces = [] for i in db: frame = pd.DataFrame(i['nutrients']) frame['id'] = i['id'] pieces.append(frame) nutrients = pd.concat(pieces,ignore_index=True) nutrients
value | units | description | group | id | |
---|---|---|---|---|---|
0 | 25.180 | g | Protein | Composition | 1008 |
1 | 29.200 | g | Total lipid (fat) | Composition | 1008 |
2 | 3.060 | g | Carbohydrate, by difference | Composition | 1008 |
3 | 3.280 | g | Ash | Other | 1008 |
4 | 376.000 | kcal | Energy | Energy | 1008 |
... | ... | ... | ... | ... | ... |
389350 | 0.000 | mcg | Vitamin B-12, added | Vitamins | 43546 |
389351 | 0.000 | mg | Cholesterol | Other | 43546 |
389352 | 0.072 | g | Fatty acids, total saturated | Other | 43546 |
389353 | 0.028 | g | Fatty acids, total monounsaturated | Other | 43546 |
389354 | 0.041 | g | Fatty acids, total polyunsaturated | Other | 43546 |
389355 rows × 5 columns
I found that there will be some duplicates in this DataFrame anyway, so I can discard them directly:
nutrients.duplicated().sum() # Count duplicates
14179
nutrients = nutrients.drop_duplicates() # Delete duplicates
Since both DataFrame objects have "group" and "description", we need to rename them in order to know who is who:
col_mapping = {'description' : 'food','group': 'fgroup'} info = info.rename(columns=col_mapping, copy=False) info.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6636 entries, 0 to 6635 Data columns (total 4 columns): food 6636 non-null object fgroup 6636 non-null object id 6636 non-null int64 manufacturer 5195 non-null object dtypes: int64(1), object(3) memory usage: 207.5+ KB
col_mapping = {'description' : 'nutrient','group' : 'nutgroup'} nutrients = nutrients.rename(columns=col_mapping, copy=False) nutrients
value | units | nutrient | nutgroup | id | |
---|---|---|---|---|---|
0 | 25.180 | g | Protein | Composition | 1008 |
1 | 29.200 | g | Total lipid (fat) | Composition | 1008 |
2 | 3.060 | g | Carbohydrate, by difference | Composition | 1008 |
3 | 3.280 | g | Ash | Other | 1008 |
4 | 376.000 | kcal | Energy | Energy | 1008 |
... | ... | ... | ... | ... | ... |
389350 | 0.000 | mcg | Vitamin B-12, added | Vitamins | 43546 |
389351 | 0.000 | mg | Cholesterol | Other | 43546 |
389352 | 0.072 | g | Fatty acids, total saturated | Other | 43546 |
389353 | 0.028 | g | Fatty acids, total monounsaturated | Other | 43546 |
389354 | 0.041 | g | Fatty acids, total polyunsaturated | Other | 43546 |
375176 rows × 5 columns
After that, you can combine info and nutriments:
ndata = pd.merge(nutrients, info, on='id', how='outer') ndata.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 375176 entries, 0 to 375175 Data columns (total 8 columns): value 375176 non-null float64 units 375176 non-null object nutrient 375176 non-null object nutgroup 375176 non-null object id 375176 non-null int64 food 375176 non-null object fgroup 375176 non-null object manufacturer 293054 non-null object dtypes: float64(1), int64(1), object(6) memory usage: 25.8+ MB
ndata.iloc[30000]
value 0.04 units g nutrient Glycine nutgroup Amino Acids id 6158 food Soup, tomato bisque, canned, condensed fgroup Soups, Sauces, and Gravies manufacturer Name: 30000, dtype: object
We can now draw a median map based on food classification and nutrition type
result = ndata.groupby(['nutrient', 'fgroup'])['value'].quantile(0.5) result['Zinc, Zn'].sort_values().plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x164d65f0b48>
[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-0yrWMXr6-1592207117342)(%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_188_1.png)]
Just use your brain a little bit to find out what is the most nutritious food:
by_nutrient = ndata.groupby(['nutgroup', 'nutrient']) get_maximum = lambda x: x.loc[x.value.idxmax()] get_minimum = lambda x: x.loc[x.value.idxmin()] max_foods = by_nutrient.apply(get_maximum)[['value', 'food']] # make the food a little smaller max_foods.food = max_foods.food.str[:50]
Because the DataFrame obtained is very large, it is not convenient to print all of them. Only the "Amino Acids" nutrition group is given here:
max_foods.loc['Amino Acids']['food']
nutrient Alanine Gelatins, dry powder, unsweetened Arginine Seeds, sesame flour, low-fat Aspartic acid Soy protein isolate Cystine Seeds, cottonseed flour, low fat (glandless) Glutamic acid Soy protein isolate ... Serine Soy protein isolate, PROTEIN TECHNOLOGIES INTE... Threonine Soy protein isolate, PROTEIN TECHNOLOGIES INTE... Tryptophan Sea lion, Steller, meat with fat (Alaska Native) Tyrosine Soy protein isolate, PROTEIN TECHNOLOGIES INTE... Valine Soy protein isolate, PROTEIN TECHNOLOGIES INTE... Name: food, Length: 19, dtype: object
Example five.2012 data analysis of the United States Federal Election Commission database
The Federal Election Commission released data on political campaign sponsorship. This includes the sponsor's name, occupation, employer, address, and amount of contribution.
use pandas.read_csv load data:
fec = pd.read_csv('../datasets/fec/P00000001-ALL.csv') fec.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1001731 entries, 0 to 1001730 Data columns (total 16 columns): cmte_id 1001731 non-null object cand_id 1001731 non-null object cand_nm 1001731 non-null object contbr_nm 1001731 non-null object contbr_city 1001712 non-null object contbr_st 1001727 non-null object contbr_zip 1001620 non-null object contbr_employer 988002 non-null object contbr_occupation 993301 non-null object contb_receipt_amt 1001731 non-null float64 contb_receipt_dt 1001731 non-null object receipt_desc 14166 non-null object memo_cd 92482 non-null object memo_text 97770 non-null object form_tp 1001731 non-null object file_num 1001731 non-null int64 dtypes: float64(1), int64(1), object(14) memory usage: 122.3+ MB
fec.head()
cmte_id | cand_id | cand_nm | contbr_nm | contbr_city | contbr_st | contbr_zip | contbr_employer | contbr_occupation | contb_receipt_amt | contb_receipt_dt | receipt_desc | memo_cd | memo_text | form_tp | file_num | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | C00410118 | P20002978 | Bachmann, Michelle | HARVEY, WILLIAM | MOBILE | AL | 3.6601e+08 | RETIRED | RETIRED | 250.0 | 20-JUN-11 | NaN | NaN | NaN | SA17A | 736166 |
1 | C00410118 | P20002978 | Bachmann, Michelle | HARVEY, WILLIAM | MOBILE | AL | 3.6601e+08 | RETIRED | RETIRED | 50.0 | 23-JUN-11 | NaN | NaN | NaN | SA17A | 736166 |
2 | C00410118 | P20002978 | Bachmann, Michelle | SMITH, LANIER | LANETT | AL | 3.68633e+08 | INFORMATION REQUESTED | INFORMATION REQUESTED | 250.0 | 05-JUL-11 | NaN | NaN | NaN | SA17A | 749073 |
3 | C00410118 | P20002978 | Bachmann, Michelle | BLEVINS, DARONDA | PIGGOTT | AR | 7.24548e+08 | NONE | RETIRED | 250.0 | 01-AUG-11 | NaN | NaN | NaN | SA17A | 749073 |
4 | C00410118 | P20002978 | Bachmann, Michelle | WARDENBURG, HAROLD | HOT SPRINGS NATION | AR | 7.19016e+08 | NONE | RETIRED | 300.0 | 20-JUN-11 | NaN | NaN | NaN | SA17A | 736166 |
The records in this DataFrame are as follows:
fec.iloc[123456]
cmte_id C00431445 cand_id P80003338 cand_nm Obama, Barack contbr_nm ELLMAN, IRA contbr_city TEMPE ... receipt_desc NaN memo_cd NaN memo_text NaN form_tp SA17A file_num 772372 Name: 123456, Length: 16, dtype: object
You may have come up with many ways to extract statistics about patrons and sponsorship patterns from these campaign sponsorship data. I'll cover several different analytical tasks (using the methods I've learned so far) in the next section.
It's not hard to see that there is no party information in the data, so it's better to add it in. With unique, you can get a full list of candidates:
unique_cands = fec.cand_nm.unique() unique_cands
array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack', "Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy', 'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick', 'Cain, Herman', 'Gingrich, Newt', 'McCotter, Thaddeus G', 'Huntsman, Jon', 'Perry, Rick'], dtype=object)
unique_cands[2]
'Obama, Barack'
One way to identify party information is to use a dictionary:
parties = {'Bachmann, Michelle': 'Republican', 'Cain, Herman': 'Republican', 'Gingrich, Newt': 'Republican', 'Huntsman, Jon': 'Republican', 'Johnson, Gary Earl': 'Republican', 'McCotter, Thaddeus G': 'Republican', 'Obama, Barack': 'Democrat', 'Paul, Ron': 'Republican', 'Pawlenty, Timothy': 'Republican', 'Perry, Rick': 'Republican', "Roemer, Charles E. 'Buddy' III": 'Republican', 'Romney, Mitt': 'Republican', 'Santorum, Rick': 'Republican'}
Now, with this mapping and the Series object's map method, you can get a set of party information based on the candidate's name:
fec.cand_nm[123456:123461]
123456 Obama, Barack 123457 Obama, Barack 123458 Obama, Barack 123459 Obama, Barack 123460 Obama, Barack Name: cand_nm, dtype: object
fec.cand_nm[123456:123461].map(parties)
123456 Democrat 123457 Democrat 123458 Democrat 123459 Democrat 123460 Democrat Name: cand_nm, dtype: object
# Add it as a column fec['party'] = fec.cand_nm.map(parties) fec['party'].value_counts()
Democrat 593746 Republican 407985 Name: party, dtype: int64
Here are two things to pay attention to. First, the data includes both donations and refunds
(fec. > 0).value_counts()
True 991475 False 10256 Name: contb_receipt_amt, dtype: int64
To simplify the analysis process, I limit the dataset to only positive contributions:
fec = fec[fec.contb_receipt_amt > 0]
Since Barack Obama and Mitt Romney are the two most important candidates, I have also prepared a subset that only includes sponsorship information for their campaigns:
fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack','Romney, Mitt'])]
5.1 donation statistics by occupation and employer
Career based sponsorship information statistics is another statistical task that is often studied. Lawyers, for example, tend to fund Democrats, while business owners tend to fund Republicans. If you don't believe me, just look at the data. First, calculate the total contribution based on occupation, which is very simple:
fec.contbr_occupation.value_counts()[:10]
RETIRED 233990 INFORMATION REQUESTED 35107 ATTORNEY 34286 HOMEMAKER 29931 PHYSICIAN 23432 INFORMATION REQUESTED PER BEST EFFORTS 21138 ENGINEER 14334 TEACHER 13990 CONSULTANT 13273 PROFESSOR 12555 Name: contbr_occupation, dtype: int64
It's not hard to see that many occupations involve the same basic type of work, or there are many variants of the same thing. The following code snippet cleans up some of this data (mapping one occupation information to another). Notice that it's cleverly used here dict.get , which allows occupations without mapping relationships to "pass through":
occ_mapping = { 'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED', 'INFORMATION REQUESTED' : 'NOT PROVIDED', 'INFORMATION REQUESTED (BEST EFFORTS)' : 'NOT PROVIDED', 'C.E.O.': 'CEO' } # If no mapping provided, return x f = lambda x: occ_mapping.get(x, x) fec.contbr_occupation = fec.contbr_occupation.map(f)
I do the same with employer information:
emp_mapping = { 'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED', 'INFORMATION REQUESTED' : 'NOT PROVIDED', 'SELF' : 'SELF-EMPLOYED', 'SELF EMPLOYED' : 'SELF-EMPLOYED', } # If no mapping provided, return x f = lambda x: emp_mapping.get(x, x) fec.contbr_employer = fec.contbr_employer.map(f)
Now, you can use pivot_table aggregates data based on parties and occupations, and then filters out the total contribution of less than $2 million
by_occupation = fec.pivot_table('contb_receipt_amt', index='contbr_occupation', columns='party', aggfunc='sum') over_2mm = by_occupation[by_occupation.sum(1) > 2000000] over_2mm
party | Democrat | Republican |
---|---|---|
contbr_occupation | ||
ATTORNEY | 11141982.97 | 7.477194e+06 |
CEO | 2074974.79 | 4.211041e+06 |
CONSULTANT | 2459912.71 | 2.544725e+06 |
ENGINEER | 951525.55 | 1.818374e+06 |
EXECUTIVE | 1355161.05 | 4.138850e+06 |
... | ... | ... |
PRESIDENT | 1878509.95 | 4.720924e+06 |
PROFESSOR | 2165071.08 | 2.967027e+05 |
REAL ESTATE | 528902.09 | 1.625902e+06 |
RETIRED | 25305116.38 | 2.356124e+07 |
SELF-EMPLOYED | 672393.40 | 1.640253e+06 |
17 rows × 2 columns
Making a histogram of these data will look clearer ('bash 'for horizontal histogram)
over_2mm.plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x164d535f088>
[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-f1hwdsk6-1592207117344) (% E6% 95% B0% E6% 8D% AE% E5% 88% 86% E6% 9E% 90% E7% A4% Ba% E4% be% 8b)_ files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_ 224_ 1.png)]
You may also want to know about the occupations and businesses that contribute the most to Obama and Romney. To do this, we first group the candidates, and then use the top like approach described earlier in this chapter:
def get_top_amounts(group, key, n=5): totals = group.groupby(key)['contb_receipt_amt'].sum() return totals.nlargest(n)
Then aggregate by occupation and employer:
grouped = fec_mrbo.groupby('cand_nm') grouped.apply(get_top_amounts, 'contbr_occupation', n=7)
cand_nm contbr_occupation Obama, Barack RETIRED 25305116.38 ATTORNEY 11141982.97 INFORMATION REQUESTED 4866973.96 HOMEMAKER 4248875.80 PHYSICIAN 3735124.94 ... Romney, Mitt HOMEMAKER 8147446.22 ATTORNEY 5364718.82 PRESIDENT 2491244.89 EXECUTIVE 2300947.03 C.E.O. 1968386.11 Name: contb_receipt_amt, Length: 14, dtype: float64
grouped.apply(get_top_amounts, 'contbr_employer', n=10)
cand_nm contbr_employer Obama, Barack RETIRED 22694358.85 SELF-EMPLOYED 17080985.96 NOT EMPLOYED 8586308.70 INFORMATION REQUESTED 5053480.37 HOMEMAKER 2605408.54 ... Romney, Mitt CREDIT SUISSE 281150.00 MORGAN STANLEY 267266.00 GOLDMAN SACH & CO. 238250.00 BARCLAYS CAPITAL 162750.00 H.I.G. CAPITAL 139500.00 Name: contb_receipt_amt, Length: 20, dtype: float64
5.2 donation amount in barrels
Another very practical analysis of the data can also be done: use cut function to discretize the data into multiple facets according to the amount of contribution
bins = np.array([0, 1, 10, 100, 1000, 10000,100000, 1000000, 10000000]) labels = pd.cut(fec_mrbo.contb_receipt_amt, bins) labels
411 (10, 100] 412 (100, 1000] 413 (100, 1000] 414 (10, 100] 415 (10, 100] ... 701381 (10, 100] 701382 (100, 1000] 701383 (1, 10] 701384 (10, 100] 701385 (100, 1000] Name: contb_receipt_amt, Length: 694282, dtype: category Categories (8, interval[int64]): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]
Now you can group Obama and Romney data based on candidate names and face tags to get a histogram:
grouped = fec_mrbo.groupby(['cand_nm', labels]) grouped.size().unstack(0)
cand_nm | Obama, Barack | Romney, Mitt |
---|---|---|
contb_receipt_amt | ||
(0, 1] | 493.0 | 77.0 |
(1, 10] | 40070.0 | 3681.0 |
(10, 100] | 372280.0 | 31853.0 |
(100, 1000] | 153991.0 | 43357.0 |
(1000, 10000] | 22284.0 | 26186.0 |
(10000, 100000] | 2.0 | 1.0 |
(100000, 1000000] | 3.0 | NaN |
(1000000, 10000000] | 4.0 | NaN |
As can be seen from this data, Obama received a lot more micro sponsorship than Romney. You can also sum the contributions and normalize them in buckets to graphically display the proportions of various sponsorship amounts of the two candidates
bucket_sums = grouped.contb_receipt_amt.sum().unstack(0) normed_sums = bucket_sums.div(bucket_sums.sum(axis=1), axis=0) normed_sums
cand_nm | Obama, Barack | Romney, Mitt |
---|---|---|
contb_receipt_amt | ||
(0, 1] | 0.805182 | 0.194818 |
(1, 10] | 0.918767 | 0.081233 |
(10, 100] | 0.910769 | 0.089231 |
(100, 1000] | 0.710176 | 0.289824 |
(1000, 10000] | 0.447326 | 0.552674 |
(10000, 100000] | 0.823120 | 0.176880 |
(100000, 1000000] | 1.000000 | NaN |
(1000000, 10000000] | 1.000000 | NaN |
normed_sums[:-2].plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x1648ee66d08>
[external link picture transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-0PN1YIti-1592207117345)(%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_files/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%A4%BA%E4%BE%8B_237_1.png)]
I've ruled out two of the biggest denominations because they're not donated by individuals.
Many refinements and improvements can be made to the analysis process. For example, data can be aggregated based on the name and postcode of the sponsor to find out who made multiple small donations and who made one or more large donations. I strongly recommend that you download the data and explore it for yourself.
5.3 donation statistics by state
Aggregating data based on candidates and states is a routine operation:
grouped = fec_mrbo.groupby(['cand_nm', 'contbr_st']) totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0) totals = totals[totals.sum(1) > 100000] totals[:10]
cand_nm | Obama, Barack | Romney, Mitt |
---|---|---|
contbr_st | ||
AK | 281840.15 | 86204.24 |
AL | 543123.48 | 527303.51 |
AR | 359247.28 | 105556.00 |
AZ | 1506476.98 | 1888436.23 |
CA | 23824984.24 | 11237636.60 |
CO | 2132429.49 | 1506714.12 |
CT | 2068291.26 | 3499475.45 |
DC | 4373538.80 | 1025137.50 |
DE | 336669.14 | 82712.00 |
FL | 7318178.58 | 8338458.81 |
If you divide each line by the total sponsorship, you get the proportion of each candidate's total sponsorship in each state:
percent = totals.div(totals.sum(1), axis=0) percent[:10]
cand_nm | Obama, Barack | Romney, Mitt |
---|---|---|
contbr_st | ||
AK | 0.765778 | 0.234222 |
AL | 0.507390 | 0.492610 |
AR | 0.772902 | 0.227098 |
AZ | 0.443745 | 0.556255 |
CA | 0.679498 | 0.320502 |
CO | 0.585970 | 0.414030 |
CT | 0.371476 | 0.628524 |
DC | 0.810113 | 0.189887 |
DE | 0.802776 | 0.197224 |
FL | 0.467417 | 0.532583 |