Fundamentals of Machine Learning - Basic Use of pandas
Keywords:
Python
Excel
less
Celery
Introduction of pandas
Python Data Analysis Library or pandas is a tool based on NumPy, which is created to solve data analysis tasks. Pandas incorporates a large number of libraries and some standard data models, providing the tools needed to operate large data sets efficiently. Pandas provides a large number of functions and methods that enable us to process data quickly and conveniently.
Data structure of pandas:
Series: One-dimensional arrays, similar to one-dimensional arrays in Numpy. They are also similar to Python's basic data structure List. The difference is that elements in List can be of different data types, while Array and Series only allow the same data types to be stored, so that memory can be used more effectively and operation efficiency can be improved.
Time-Series: Series indexed by time.
DataFrame: Two-dimensional tabular data structure. Many functions are similar to data.frame in R. The DataFrame can be understood as a container for Series. Data Frame is the main content below.
Panel: A three-dimensional array can be understood as a container for the DataFrame.
This article mainly introduces DateFrame and Series, including the introduction of charging DataFrame.
The address of the data file used in this paper: Basic use of pandas. zip
This article only introduces the basic use of pandas with examples. For further study, please refer to pandas Official Documents.
DateFrame in pandas
Using pandas, we can easily perform some routine operations on the two-dimensional table structure.
1. Use pandas to read csv (or excel, etc.) files
import pandas
food_info = pandas.read_csv("food_info.csv") # read csv file
# read Excel File usage pandas.read_excel()that will do
print(type(food_info)) # food_info is a DataFrame object
print(food_info.dtypes) # Types of data
<class 'pandas.core.frame.DataFrame'>
NDB_No int64
Shrt_Desc object
Water_(g) float64
Energ_Kcal int64
Protein_(g) float64
Lipid_Tot_(g) float64
Ash_(g) float64
Carbohydrt_(g) float64
Fiber_TD_(g) float64
Sugar_Tot_(g) float64
Calcium_(mg) float64
Iron_(mg) float64
Magnesium_(mg) float64
Phosphorus_(mg) float64
Potassium_(mg) float64
Sodium_(mg) float64
Zinc_(mg) float64
Copper_(mg) float64
Manganese_(mg) float64
Selenium_(mcg) float64
Vit_C_(mg) float64
Thiamin_(mg) float64
Riboflavin_(mg) float64
Niacin_(mg) float64
Vit_B6_(mg) float64
Vit_B12_(mcg) float64
Vit_A_IU float64
Vit_A_RAE float64
Vit_E_(mg) float64
Vit_D_mcg float64
Vit_D_IU float64
Vit_K_(mcg) float64
FA_Sat_(g) float64
FA_Mono_(g) float64
FA_Poly_(g) float64
Cholestrl_(mg) float64
dtype: object
Output
2. Access to data
food_info.head(10) # Get the first 10 rows of data, default 5 rows
# first_rows = food_info.head()
# first_rows
# food_info.tail(8) # Get the last 8 rows of data, default 5 rows
# print(food_info.tail())
print(food_info.columns) # Obtain foodinfo Each field name (that is, table header)# print(food_info.shape) # 8618 rows to get a structure such as this file×36 column
Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
'Cholestrl_(mg)'],
dtype='object')
Output 1:
# print(food_info.loc[0]) # Gets row 0 data
print(food_info.loc[6000]) # Get line 6000 data
# food_info.loc[10000] Gets 10000 rows of data, exceeding the length of the data file itself, reporting KeyError:'The label [10000] is not in the [index]'
NDB_No 18995
Shrt_Desc KELLOGG'S EGGO BISCUIT SCRAMBLERS BACON EGG & CHS
Water_(g) 42.9
Energ_Kcal 258
Protein_(g) 8.8
Lipid_Tot_(g) 7.9
Ash_(g) NaN
Carbohydrt_(g) 38.3
Fiber_TD_(g) 2.1
Sugar_Tot_(g) 4.7
Calcium_(mg) 124
Iron_(mg) 2.7
Magnesium_(mg) 14
Phosphorus_(mg) 215
Potassium_(mg) 225
Sodium_(mg) 610
Zinc_(mg) 0.5
Copper_(mg) NaN
Manganese_(mg) NaN
Selenium_(mcg) NaN
Vit_C_(mg) NaN
Thiamin_(mg) 0.3
Riboflavin_(mg) 0.26
Niacin_(mg) 2.4
Vit_B6_(mg) 0.02
Vit_B12_(mcg) 0.1
Vit_A_IU NaN
Vit_A_RAE NaN
Vit_E_(mg) 0
Vit_D_mcg 0
Vit_D_IU 0
Vit_K_(mcg) NaN
FA_Sat_(g) 4.1
FA_Mono_(g) 1.5
FA_Poly_(g) 1.1
Cholestrl_(mg) 27
Name: 6000, dtype: object
Output 2
# food_info.loc[3:6]# Gets rows 3 to 6
two_five_ten = [2,5,10]
print(food_info.loc[two_five_ten]) # Get data 2, 5, 10
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) \
2 1003 BUTTER OIL ANHYDROUS 0.24 876 0.28
5 1006 CHEESE BRIE 48.42 334 20.75
10 1011 CHEESE COLBY 38.20 394 23.76
Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) \
2 99.48 0.00 0.00 0.0 0.00
5 27.68 2.70 0.45 0.0 0.45
10 32.11 3.36 2.57 0.0 0.52
... Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU \
2 ... 3069.0 840.0 2.80 1.8 73.0
5 ... 592.0 174.0 0.24 0.5 20.0
10 ... 994.0 264.0 0.28 0.6 24.0
Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
2 8.6 61.924 28.732 3.694 256.0
5 2.3 17.410 8.013 0.826 100.0
10 2.7 20.218 9.280 0.953 95.0
Output 3
# food_info['Shrt_Desc'] # Gets the column with the field name'Shrt_Desc'.
ndb_col = food_info['NDB_No'] # Gets the column whose field is'NDB_No'
# print(ndb_col)
col_name = 'Shrt_Desc'
print(food_info[col_name])
0 BUTTER WITH SALT
1 BUTTER WHIPPED WITH SALT
2 BUTTER OIL ANHYDROUS
3 CHEESE BLUE
4 CHEESE BRICK
5 CHEESE BRIE
6 CHEESE CAMEMBERT
7 CHEESE CARAWAY
8 CHEESE CHEDDAR
9 CHEESE CHESHIRE
10 CHEESE COLBY
11 CHEESE COTTAGE CRMD LRG OR SML CURD
12 CHEESE COTTAGE CRMD W/FRUIT
13 CHEESE COTTAGE NONFAT UNCRMD DRY LRG OR SML CURD
14 CHEESE COTTAGE LOWFAT 2% MILKFAT
15 CHEESE COTTAGE LOWFAT 1% MILKFAT
16 CHEESE CREAM
17 CHEESE EDAM
18 CHEESE FETA
19 CHEESE FONTINA
20 CHEESE GJETOST
21 CHEESE GOUDA
22 CHEESE GRUYERE
23 CHEESE LIMBURGER
24 CHEESE MONTEREY
25 CHEESE MOZZARELLA WHL MILK
26 CHEESE MOZZARELLA WHL MILK LO MOIST
27 CHEESE MOZZARELLA PART SKIM MILK
28 CHEESE MOZZARELLA LO MOIST PART-SKIM
29 CHEESE MUENSTER
...
8588 BABYFOOD CRL RICE W/ PEARS & APPL DRY INST
8589 BABYFOOD BANANA NO TAPIOCA STR
8590 BABYFOOD BANANA APPL DSSRT STR
8591 SNACKS TORTILLA CHIPS LT (BAKED W/ LESS OIL)
8592 CEREALS RTE POST HONEY BUNCHES OF OATS HONEY RSTD
8593 POPCORN MICROWAVE LOFAT&NA
8594 BABYFOOD FRUIT SUPREME DSSRT
8595 CHEESE SWISS LOW FAT
8596 BREAKFAST BAR CORN FLAKE CRUST W/FRUIT
8597 CHEESE MOZZARELLA LO NA
8598 MAYONNAISE DRSNG NO CHOL
8599 OIL CORN PEANUT AND OLIVE
8600 SWEETENERS TABLETOP FRUCTOSE LIQ
8601 CHEESE FOOD IMITATION
8602 CELERY FLAKES DRIED
8603 PUDDINGS CHOC FLAVOR LO CAL INST DRY MIX
8604 BABYFOOD GRAPE JUC NO SUGAR CND
8605 JELLIES RED SUGAR HOME PRESERVED
8606 PIE FILLINGS BLUEBERRY CND
8607 COCKTAIL MIX NON-ALCOHOLIC CONCD FRZ
8608 PUDDINGS CHOC FLAVOR LO CAL REG DRY MIX
8609 PUDDINGS ALL FLAVORS XCPT CHOC LO CAL REG DRY MIX
8610 PUDDINGS ALL FLAVORS XCPT CHOC LO CAL INST DRY...
8611 VITAL WHEAT GLUTEN
8612 FROG LEGS RAW
8613 MACKEREL SALTED
8614 SCALLOP (BAY&SEA) CKD STMD
8615 SYRUP CANE
8616 SNAIL RAW
8617 TURTLE GREEN RAW
Name: Shrt_Desc, Length: 8618, dtype: object
Output 4
columns = ['Water_(g)', 'Shrt_Desc']
zinc_copper = food_info[columns] # Get the columns named'Water_ (g)','Shrt_Desc'.
print(zinc_copper)
# Get the columns ending with "(mg)"
col_names = food_info.columns.tolist()
# print(col_names)
milligram_columns = []
for items in col_names:
if items.endswith("(mg)"):
milligram_columns.append(items)
milligram_df = food_info[milligram_columns]
print(milligram_df)
3. Simple data processing:
import pandas
food_info = pandas.read_csv('food_info.csv')
# food_info.head(3)
# print(food_info.shape)
# print(food_info['Iron_(mg)'])
# Iron_(mg)The units in this column are mg,Turn it into mg,Divide its value by 1000
div_1000 = food_info['Iron_(mg)'] / 1000
# print(div_1000)
# Compute two columns in each row of data
water_energy = food_info['Water_(g)'] * food_info['Energ_Kcal']
# print(food_info.shape)
# DateFrame Structure inserts a column with the field name'water_energy',The value is water_energy Data
food_info['water_energy'] = water_energy
# print(food_info[['Water_(g)', 'Energ_Kcal', 'water_energy']])
# print(food_info.shape)
# Finding the Maximum of a Column
max_calories = food_info['Energ_Kcal'].max()
# print(max_calories)
# Sort the specified fields. inplace=False Generate a new sorted result DataFrame,inplace=True Sort on the original basis, default ascending sort
# food_info.sort_values('Sodium_(mg)', inplace=True)
# print(food_info['Sodium_(mg)'])
a = food_info.sort_values('Sodium_(mg)', inplace=False, ascending=False) # ascending=False Use descending sort
# print(food_info['Sodium_(mg)'])
# print(a['Sodium_(mg)'])
4. Routine operation of data
import pandas as pd
import numpy as np
titanic_survival = pd.read_csv('titanic_train.csv')
# titanic_survival.head()
age = titanic_survival['Age']
# print(age.loc[0:10])
age_is_null = pd.isnull(age) # Iteration determines whether the value is empty and the result can be used as an index
# print(age_is_null)
age_null_true = age[age_is_null] # Get data sets with empty values
# print(age_null_true)
print(len(age_null_true)) # Judge how many empty data there are altogether
# Find the average value and apply data sets that are not empty
good_ages = age[age_is_null == False] # Get data sets whose values are not empty
# print(good_ages)
correct_mean_age = sum(good_ages) / len(good_ages) # Seeking average
print(correct_mean_age)
# Or use pandas Built-in mean function automatically removes empty data
correct_mean_age = age.mean() # Average and discard the null value
print(correct_mean_age)
# pivot_table The default method is to average if the requirement is to average. aggfunc Parameters may not be written
# index tells the method which column to group by
# values is the column that we want to apply the calculation to
# aggfunc specifies the calculation we want to perform
passenger_surival = titanic_survival.pivot_table(index='Pclass', values='Survived', aggfunc=np.mean) # Yes index Find the same average value separately
print(passenger_surival)
# Group-to-column summation
# port_stats = titanic_survival.pivot_table(index="Embarked", values=['Fare', "Survived"], aggfunc=np.sum) # ,Sum the price and the number of survivors, respectively.
# print(port_stats)
# Discard null data
drop_na_columns = titanic_survival.dropna(axis=1, inplace=False) # axis=1,On the basis of behavioral judgment, if the data is empty, then Dataframe Discard, inplace=False Return a new Dataframe Object, otherwise operate on the current object
# print(drop_na_columns)
new_titanic_survival = titanic_survival.dropna(axis=0, subset=['Age', 'Sex'], inplace=False) # axis=0,To be listed as the basis of judgment, the fields of the judgment column need to be specified. If the data is empty, then the fields of the judgment column need to be specified from Dataframe Discard
# print(new_titanic_survival)
# Locate it in a row or column
row_index_83_age = titanic_survival.loc[83, 'Age']
row_index_766_pclass = titanic_survival.loc[766, 'Pclass']
print(row_index_83_age)
print(row_index_766_pclass)
new_titanic_survival = titanic_survival.sort_values("Age", ascending=False) # Age of each row is sorted in descending order
print(new_titanic_survival[0:10])
print('------------------------>')
titanic_reindexed = new_titanic_survival.reset_index(drop=True) # Reset the index value for each row
print(titanic_reindexed[0:20])
# Custom functions for each row or column
def null_count(column):
column_null = pd.isnull(column)
null = column[column_null]
return len(null)
column_null_count = titanic_survival.apply(null_count, axis=0) # By using custom functions, count the number of empty columns
print(column_null_count)
def which_class(row):
pclass = row['Pclass']
if pclass == 1:
return 'First Class'
elif pclass == 2:
return 'Second Class'
elif pclass == 3:
return 'Third Class'
else:
return 'Unknow'
classes = titanic_survival.apply(which_class, axis=1) # Replace each line with a custom function Pclass Value, note axis=1
print(classes)
5. Preprocess data after loading with numpy
import pandas as pd
import numpy as np
fandango = pd.read_csv('fandango_score_comparison.csv')
# print(type(fandango))
# Return a new dataframe,New data returned with a set value index,And will discard index The value is empty. drop=True,Discard columns that are indexed, otherwise do not discard
fandango_films = fandango.set_index('FILM', drop=False)
# fandango_films
# print(fandango_films.index)
# Obtain data by index
fandango_films["Avengers: Age of Ultron (2015)" : "Hot Tub Time Machine 2 (2015)"]
fandango_films.loc["Avengers: Age of Ultron (2015)" : "Hot Tub Time Machine 2 (2015)"]
fandango_films.loc['Southpaw (2015)']
movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']
fandango_films.loc[movies]
# def func(coloumn):
# return np.std(coloumn)
types = fandango_films.dtypes
# print(types)
float_columns = types[types.values == 'float64'].index # Get an index of a particular type of data
# print(float_columns)
float_df = fandango_films[float_columns] # Getting specific types of data
# print(float_df.dtypes)
# float_df
# print(float_df)
deviations = float_df.apply(lambda x: np.std(x)) # Calculate standard deviation per column
print(deviations)
# print('----------------------->')
# print(float_df.apply(func))
# help(np.std)
rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]
print(rt_mt_user.apply(np.std, axis=1)) # Calculate standard deviation per row of data
# rt_mt_user.apply(np.std, axis=0)
Series in Data Frame
Series is the data structure of a row or column in the DateFrame
1. Get a Series object
import pandas as pd
from pandas import Series
fandango = pd.read_csv('fandango_score_comparison.csv')
series_film = fandango['FILM'] # Obtain fandango in FILM This column
# print(type(series_film))
print(series_film[0:5])
series_rt = fandango['RottenTomatoes'] # Obtain fandango in RottenTomatoes This column
print(series_rt[0:5])
2. Some routine operations on Series objects
file_names = series_film.values # Obtain series_film All values, the return value is one<class 'numpy.ndarray'>
# print(type(file_names))
# print(file_names)
rt_scores = series_rt.values
# print(rt_scores)
series_custom = Series(rt_scores, index=file_names) # Building a New Series, index by file_names, value by rt_scores
# help(Series)
print(series_custom[['Top Five (2014)', 'Night at the Museum: Secret of the Tomb (2014)']]) # with index get data
# print(type(series_custom))
print('--------------------------------->')
print(series_custom[5:10]) # Slicing operation
# print(series_custom[["'71 (2015)"]])
original_index = series_custom.index.tolist() # Get all index Value and convert it to list
# print(original_index)
sorted_index = sorted(original_index) # Yes list sort
# print(sort_index)
sorted_by_index = series_custom.reindex(sorted_index) # Ordered list Resend series_custom catalog index
print(sorted_by_index)
sc2 = series_custom.sort_index() # with index Sort the whole in ascending order series_custom
# print(sc2)
sc3 = series_custom.sort_values(ascending=False) # with values Sort the whole in descending order series_custom
print(sc3)
import numpy as np
# print(np.add(series_custom, series_custom)) #Use numpy to compute series_custom as a matrix
print(np.sin(series_custom))
print(np.max(series_custom))
# series_custom > 50
series_greater_than_50 = series_custom[series_custom > 50] # Obtain series_custom Data with a value greater than 50
# series_greater_than_50
criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[criteria_one & criteria_two] # Obtain series_custom Data with a value greater than 50 and less than 75
print(both_criteria)
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users) / 2 # take rt_critics and rt_users The sum of the values divided by 2
print(rt_mean)
Posted by toms on Sat, 11 May 2019 12:04:20 -0700