Fundamentals of Machine Learning - Basic Use of pandas

Keywords: Python Excel less Celery

Introduction of pandas

Python Data Analysis Library or pandas is a tool based on NumPy, which is created to solve data analysis tasks. Pandas incorporates a large number of libraries and some standard data models, providing the tools needed to operate large data sets efficiently. Pandas provides a large number of functions and methods that enable us to process data quickly and conveniently.

Data structure of pandas:

Series: One-dimensional arrays, similar to one-dimensional arrays in Numpy. They are also similar to Python's basic data structure List. The difference is that elements in List can be of different data types, while Array and Series only allow the same data types to be stored, so that memory can be used more effectively and operation efficiency can be improved.

Time-Series: Series indexed by time.

DataFrame: Two-dimensional tabular data structure. Many functions are similar to data.frame in R. The DataFrame can be understood as a container for Series. Data Frame is the main content below.

Panel: A three-dimensional array can be understood as a container for the DataFrame.

This article mainly introduces DateFrame and Series, including the introduction of charging DataFrame.

The address of the data file used in this paper: Basic use of pandas. zip

This article only introduces the basic use of pandas with examples. For further study, please refer to pandas Official Documents.

DateFrame in pandas

Using pandas, we can easily perform some routine operations on the two-dimensional table structure.  

1. Use pandas to read csv (or excel, etc.) files

import pandas
food_info = pandas.read_csv("food_info.csv")          # read csv file
# read Excel File usage pandas.read_excel()that will do
print(type(food_info))           # food_info is a DataFrame object
print(food_info.dtypes)          # Types of data
<class 'pandas.core.frame.DataFrame'>
NDB_No               int64
Shrt_Desc           object
Water_(g)          float64
Energ_Kcal           int64
Protein_(g)        float64
Lipid_Tot_(g)      float64
Ash_(g)            float64
Carbohydrt_(g)     float64
Fiber_TD_(g)       float64
Sugar_Tot_(g)      float64
Calcium_(mg)       float64
Iron_(mg)          float64
Magnesium_(mg)     float64
Phosphorus_(mg)    float64
Potassium_(mg)     float64
Sodium_(mg)        float64
Zinc_(mg)          float64
Copper_(mg)        float64
Manganese_(mg)     float64
Selenium_(mcg)     float64
Vit_C_(mg)         float64
Thiamin_(mg)       float64
Riboflavin_(mg)    float64
Niacin_(mg)        float64
Vit_B6_(mg)        float64
Vit_B12_(mcg)      float64
Vit_A_IU           float64
Vit_A_RAE          float64
Vit_E_(mg)         float64
Vit_D_mcg          float64
Vit_D_IU           float64
Vit_K_(mcg)        float64
FA_Sat_(g)         float64
FA_Mono_(g)        float64
FA_Poly_(g)        float64
Cholestrl_(mg)     float64
dtype: object
Output

2. Access to data

food_info.head(10)     # Get the first 10 rows of data, default 5 rows
# first_rows = food_info.head()
# first_rows
# food_info.tail(8)     # Get the last 8 rows of data, default 5 rows   
# print(food_info.tail())
print(food_info.columns)    # Obtain foodinfo Each field name (that is, table header)# print(food_info.shape)    # 8618 rows to get a structure such as this file×36 column
Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
       'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
       'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
       'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
       'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
       'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
       'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
       'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
       'Cholestrl_(mg)'],
      dtype='object')
Output 1:

# print(food_info.loc[0]) # Gets row 0 data
 print(food_info.loc[6000]) # Get line 6000 data
 # food_info.loc[10000] Gets 10000 rows of data, exceeding the length of the data file itself, reporting KeyError:'The label [10000] is not in the [index]'
NDB_No                                                         18995
Shrt_Desc          KELLOGG'S EGGO BISCUIT SCRAMBLERS BACON EGG & CHS
Water_(g)                                                       42.9
Energ_Kcal                                                       258
Protein_(g)                                                      8.8
Lipid_Tot_(g)                                                    7.9
Ash_(g)                                                          NaN
Carbohydrt_(g)                                                  38.3
Fiber_TD_(g)                                                     2.1
Sugar_Tot_(g)                                                    4.7
Calcium_(mg)                                                     124
Iron_(mg)                                                        2.7
Magnesium_(mg)                                                    14
Phosphorus_(mg)                                                  215
Potassium_(mg)                                                   225
Sodium_(mg)                                                      610
Zinc_(mg)                                                        0.5
Copper_(mg)                                                      NaN
Manganese_(mg)                                                   NaN
Selenium_(mcg)                                                   NaN
Vit_C_(mg)                                                       NaN
Thiamin_(mg)                                                     0.3
Riboflavin_(mg)                                                 0.26
Niacin_(mg)                                                      2.4
Vit_B6_(mg)                                                     0.02
Vit_B12_(mcg)                                                    0.1
Vit_A_IU                                                         NaN
Vit_A_RAE                                                        NaN
Vit_E_(mg)                                                         0
Vit_D_mcg                                                          0
Vit_D_IU                                                           0
Vit_K_(mcg)                                                      NaN
FA_Sat_(g)                                                       4.1
FA_Mono_(g)                                                      1.5
FA_Poly_(g)                                                      1.1
Cholestrl_(mg)                                                    27
Name: 6000, dtype: object
Output 2
# food_info.loc[3:6]# Gets rows 3 to 6
two_five_ten = [2,5,10]     
print(food_info.loc[two_five_ten]) # Get data 2, 5, 10
    NDB_No             Shrt_Desc  Water_(g)  Energ_Kcal  Protein_(g)  \
2     1003  BUTTER OIL ANHYDROUS       0.24         876         0.28   
5     1006           CHEESE BRIE      48.42         334        20.75   
10    1011          CHEESE COLBY      38.20         394        23.76   

    Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  Fiber_TD_(g)  Sugar_Tot_(g)  \
2           99.48     0.00            0.00           0.0           0.00   
5           27.68     2.70            0.45           0.0           0.45   
10          32.11     3.36            2.57           0.0           0.52   

         ...        Vit_A_IU  Vit_A_RAE  Vit_E_(mg)  Vit_D_mcg  Vit_D_IU  \
2        ...          3069.0      840.0        2.80        1.8      73.0   
5        ...           592.0      174.0        0.24        0.5      20.0   
10       ...           994.0      264.0        0.28        0.6      24.0   

    Vit_K_(mcg)  FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  Cholestrl_(mg)  
2           8.6      61.924       28.732        3.694           256.0  
5           2.3      17.410        8.013        0.826           100.0  
10          2.7      20.218        9.280        0.953            95.0  
Output 3

# food_info['Shrt_Desc'] # Gets the column with the field name'Shrt_Desc'.
ndb_col = food_info['NDB_No'] # Gets the column whose field is'NDB_No'
# print(ndb_col)
col_name = 'Shrt_Desc'
print(food_info[col_name])
0                                        BUTTER WITH SALT
1                                BUTTER WHIPPED WITH SALT
2                                    BUTTER OIL ANHYDROUS
3                                             CHEESE BLUE
4                                            CHEESE BRICK
5                                             CHEESE BRIE
6                                        CHEESE CAMEMBERT
7                                          CHEESE CARAWAY
8                                          CHEESE CHEDDAR
9                                         CHEESE CHESHIRE
10                                           CHEESE COLBY
11                    CHEESE COTTAGE CRMD LRG OR SML CURD
12                            CHEESE COTTAGE CRMD W/FRUIT
13       CHEESE COTTAGE NONFAT UNCRMD DRY LRG OR SML CURD
14                       CHEESE COTTAGE LOWFAT 2% MILKFAT
15                       CHEESE COTTAGE LOWFAT 1% MILKFAT
16                                           CHEESE CREAM
17                                            CHEESE EDAM
18                                            CHEESE FETA
19                                         CHEESE FONTINA
20                                         CHEESE GJETOST
21                                           CHEESE GOUDA
22                                         CHEESE GRUYERE
23                                       CHEESE LIMBURGER
24                                        CHEESE MONTEREY
25                             CHEESE MOZZARELLA WHL MILK
26                    CHEESE MOZZARELLA WHL MILK LO MOIST
27                       CHEESE MOZZARELLA PART SKIM MILK
28                   CHEESE MOZZARELLA LO MOIST PART-SKIM
29                                        CHEESE MUENSTER
                              ...                        
8588           BABYFOOD CRL RICE W/ PEARS & APPL DRY INST
8589                       BABYFOOD BANANA NO TAPIOCA STR
8590                       BABYFOOD BANANA APPL DSSRT STR
8591         SNACKS TORTILLA CHIPS LT (BAKED W/ LESS OIL)
8592    CEREALS RTE POST HONEY BUNCHES OF OATS HONEY RSTD
8593                           POPCORN MICROWAVE LOFAT&NA
8594                         BABYFOOD FRUIT SUPREME DSSRT
8595                                 CHEESE SWISS LOW FAT
8596               BREAKFAST BAR CORN FLAKE CRUST W/FRUIT
8597                              CHEESE MOZZARELLA LO NA
8598                             MAYONNAISE DRSNG NO CHOL
8599                            OIL CORN PEANUT AND OLIVE
8600                     SWEETENERS TABLETOP FRUCTOSE LIQ
8601                                CHEESE FOOD IMITATION
8602                                  CELERY FLAKES DRIED
8603             PUDDINGS CHOC FLAVOR LO CAL INST DRY MIX
8604                      BABYFOOD GRAPE JUC NO SUGAR CND
8605                     JELLIES RED SUGAR HOME PRESERVED
8606                           PIE FILLINGS BLUEBERRY CND
8607                 COCKTAIL MIX NON-ALCOHOLIC CONCD FRZ
8608              PUDDINGS CHOC FLAVOR LO CAL REG DRY MIX
8609    PUDDINGS ALL FLAVORS XCPT CHOC LO CAL REG DRY MIX
8610    PUDDINGS ALL FLAVORS XCPT CHOC LO CAL INST DRY...
8611                                   VITAL WHEAT GLUTEN
8612                                        FROG LEGS RAW
8613                                      MACKEREL SALTED
8614                           SCALLOP (BAY&SEA) CKD STMD
8615                                           SYRUP CANE
8616                                            SNAIL RAW
8617                                     TURTLE GREEN RAW
Name: Shrt_Desc, Length: 8618, dtype: object
Output 4

columns = ['Water_(g)', 'Shrt_Desc']   
zinc_copper = food_info[columns] # Get the columns named'Water_ (g)','Shrt_Desc'.
print(zinc_copper)


# Get the columns ending with "(mg)" col_names = food_info.columns.tolist() # print(col_names) milligram_columns = [] for items in col_names: if items.endswith("(mg)"): milligram_columns.append(items) milligram_df = food_info[milligram_columns] print(milligram_df)

 

3. Simple data processing:

import pandas

food_info = pandas.read_csv('food_info.csv')
# food_info.head(3)
# print(food_info.shape) 

# print(food_info['Iron_(mg)'])
# Iron_(mg)The units in this column are mg,Turn it into mg,Divide its value by 1000
div_1000 = food_info['Iron_(mg)'] / 1000
# print(div_1000) 

# Compute two columns in each row of data
water_energy = food_info['Water_(g)'] * food_info['Energ_Kcal']  
# print(food_info.shape)
# DateFrame Structure inserts a column with the field name'water_energy',The value is water_energy Data
food_info['water_energy'] = water_energy
# print(food_info[['Water_(g)', 'Energ_Kcal', 'water_energy']])
# print(food_info.shape)

# Finding the Maximum of a Column
max_calories = food_info['Energ_Kcal'].max()
# print(max_calories)

# Sort the specified fields. inplace=False Generate a new sorted result DataFrame,inplace=True Sort on the original basis, default ascending sort
# food_info.sort_values('Sodium_(mg)', inplace=True)
# print(food_info['Sodium_(mg)'])
a = food_info.sort_values('Sodium_(mg)', inplace=False, ascending=False)  # ascending=False Use descending sort
# print(food_info['Sodium_(mg)'])
# print(a['Sodium_(mg)'])

 

4. Routine operation of data

import pandas as pd
import numpy as np
titanic_survival = pd.read_csv('titanic_train.csv')
# titanic_survival.head()

age = titanic_survival['Age']
# print(age.loc[0:10])
age_is_null = pd.isnull(age)    # Iteration determines whether the value is empty and the result can be used as an index
# print(age_is_null)
age_null_true = age[age_is_null]   # Get data sets with empty values
# print(age_null_true)
print(len(age_null_true))     # Judge how many empty data there are altogether


# Find the average value and apply data sets that are not empty
good_ages = age[age_is_null == False]     # Get data sets whose values are not empty
# print(good_ages)
correct_mean_age = sum(good_ages) / len(good_ages)   # Seeking average
print(correct_mean_age)
# Or use pandas Built-in mean function automatically removes empty data
correct_mean_age = age.mean()   # Average and discard the null value
print(correct_mean_age)


# pivot_table The default method is to average if the requirement is to average. aggfunc Parameters may not be written
# index tells the method which column to group by
# values is the column that we want to apply the calculation to
# aggfunc specifies the calculation we want to perform
passenger_surival = titanic_survival.pivot_table(index='Pclass', values='Survived', aggfunc=np.mean)  # Yes index Find the same average value separately
print(passenger_surival)

# Group-to-column summation
# port_stats = titanic_survival.pivot_table(index="Embarked", values=['Fare', "Survived"], aggfunc=np.sum)  # ,Sum the price and the number of survivors, respectively.
# print(port_stats)


# Discard null data
drop_na_columns = titanic_survival.dropna(axis=1, inplace=False)    # axis=1,On the basis of behavioral judgment, if the data is empty, then Dataframe Discard, inplace=False Return a new Dataframe Object, otherwise operate on the current object
# print(drop_na_columns)
new_titanic_survival = titanic_survival.dropna(axis=0, subset=['Age', 'Sex'], inplace=False)  # axis=0,To be listed as the basis of judgment, the fields of the judgment column need to be specified. If the data is empty, then the fields of the judgment column need to be specified from Dataframe Discard
# print(new_titanic_survival)


# Locate it in a row or column
row_index_83_age = titanic_survival.loc[83, 'Age']
row_index_766_pclass = titanic_survival.loc[766, 'Pclass']
print(row_index_83_age)
print(row_index_766_pclass)


new_titanic_survival = titanic_survival.sort_values("Age", ascending=False)   # Age of each row is sorted in descending order    
print(new_titanic_survival[0:10])
print('------------------------>')
titanic_reindexed = new_titanic_survival.reset_index(drop=True)    # Reset the index value for each row
print(titanic_reindexed[0:20])


# Custom functions for each row or column
def null_count(column):
    column_null = pd.isnull(column)
    null = column[column_null]
    return len(null)
column_null_count = titanic_survival.apply(null_count, axis=0)    # By using custom functions, count the number of empty columns
print(column_null_count)


def which_class(row):
    pclass = row['Pclass'] 
    if pclass == 1:
        return 'First Class'
    elif pclass == 2:
        return 'Second Class'
    elif pclass == 3:
        return 'Third Class'
    else:
        return 'Unknow'
classes = titanic_survival.apply(which_class, axis=1)    # Replace each line with a custom function Pclass Value, note axis=1
print(classes)

 

5. Preprocess data after loading with numpy

import pandas as pd
import numpy as np

fandango = pd.read_csv('fandango_score_comparison.csv')
# print(type(fandango))
# Return a new dataframe,New data returned with a set value index,And will discard index The value is empty. drop=True,Discard columns that are indexed, otherwise do not discard
fandango_films = fandango.set_index('FILM', drop=False)
# fandango_films
# print(fandango_films.index)

# Obtain data by index
fandango_films["Avengers: Age of Ultron (2015)" : "Hot Tub Time Machine 2 (2015)"]
fandango_films.loc["Avengers: Age of Ultron (2015)" : "Hot Tub Time Machine 2 (2015)"]

fandango_films.loc['Southpaw (2015)']

movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']
fandango_films.loc[movies]




# def func(coloumn):
#     return np.std(coloumn)

types = fandango_films.dtypes
# print(types)

float_columns = types[types.values == 'float64'].index  # Get an index of a particular type of data
# print(float_columns)
float_df = fandango_films[float_columns]        # Getting specific types of data
# print(float_df.dtypes)
# float_df
# print(float_df)
deviations = float_df.apply(lambda x: np.std(x))    # Calculate standard deviation per column
print(deviations)
# print('----------------------->')
# print(float_df.apply(func))
# help(np.std)

rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]
print(rt_mt_user.apply(np.std, axis=1))   # Calculate standard deviation per row of data
# rt_mt_user.apply(np.std, axis=0)

Series in Data Frame

Series is the data structure of a row or column in the DateFrame

1. Get a Series object

import pandas as pd
from pandas import Series

fandango = pd.read_csv('fandango_score_comparison.csv')
series_film = fandango['FILM']   # Obtain fandango in FILM This column
# print(type(series_film))
print(series_film[0:5])
series_rt = fandango['RottenTomatoes']  # Obtain fandango in RottenTomatoes This column
print(series_rt[0:5])

 

2. Some routine operations on Series objects

file_names = series_film.values  # Obtain series_film All values, the return value is one<class 'numpy.ndarray'>
# print(type(file_names))
# print(file_names)
rt_scores = series_rt.values
# print(rt_scores)
series_custom = Series(rt_scores, index=file_names)   # Building a New Series, index by file_names, value by rt_scores
# help(Series)
print(series_custom[['Top Five (2014)', 'Night at the Museum: Secret of the Tomb (2014)']])  # with index get data
# print(type(series_custom))
print('--------------------------------->')
print(series_custom[5:10])   # Slicing operation


# print(series_custom[["'71 (2015)"]])
original_index = series_custom.index.tolist()   # Get all index Value and convert it to list
# print(original_index)
sorted_index = sorted(original_index)   # Yes list sort
# print(sort_index)
sorted_by_index = series_custom.reindex(sorted_index)   # Ordered list Resend series_custom catalog index
print(sorted_by_index)



sc2 = series_custom.sort_index()     # with index Sort the whole in ascending order series_custom
# print(sc2)
sc3 = series_custom.sort_values(ascending=False)   # with values Sort the whole in descending order series_custom
print(sc3)



import numpy as np
# print(np.add(series_custom, series_custom))   #Use numpy to compute series_custom as a matrix
print(np.sin(series_custom))
print(np.max(series_custom))


# series_custom > 50
series_greater_than_50 = series_custom[series_custom > 50]    # Obtain series_custom Data with a value greater than 50
# series_greater_than_50



criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[criteria_one & criteria_two]   # Obtain series_custom Data with a value greater than 50 and less than 75
print(both_criteria)




rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users) / 2  # take rt_critics and rt_users The sum of the values divided by 2
print(rt_mean)

Posted by toms on Sat, 11 May 2019 12:04:20 -0700