Basic usage of Pandas

Keywords: Python Lambda

First, Series

pandas.Series(): used to create a one-dimensional array of ndarray s with "axis labels"

Call method:

pandas.Series(data=None, index=None, dtype=None, name=None, copy=False)

data: it can be a one-dimensional array created by list, dictionary, scalar value and numpy

Index: the "label" of data is allocated from 0 by default. The length of index and data are the same. If data is a dictionary and index is given, index will replace the key of the dictionary and become the label of value. The default label and the given label coexist.

import pandas as pd
import numpy as np
#List creation a1 = pd.Series([1,2,3], index=['a','b','c']) print(a1) #Scalar value creation a2 = pd.Series(10, index=list('12345')) print(a2) #Dictionary creation a3 = pd.Series({'a':1, 'b':2, 'c':3}) print(a3) a4 = pd.Series({'a':1, 'b':2, 'c':3}, index=['e','b','a']) #index automatically matches the key of dict print(a4) #numpy Establish a5 = pd.Series(np.arange(1,5), index=np.arange(1,5)) print(a5)

Common properties:

index: returns the label of the array

values: returns the value of the array

Name: returns the name of the Series. It can also be used to modify the name of the Series

size: returns the number of elements in the array

value_counts(self, normalize=False, sort=True, ascending=True, bins=None, dropna=True): returns the number of elements. When normalize is True, the proportion of each element is counted; when bins is an integer, the array is discretized into integer segments according to the integer value; when dropna is False, the number of NaN in the array is counted

Common methods:

Series.add(self, other, fill_value=None)

If the tags are the same, the tags will return NaN if they are different; if the fill value is given, the missing value will be filled with fill value

import pandas as pd
import numpy as np

a = pd.Series(np.arange(1,5), index=list('abcd'))
b = pd.Series(np.arange(5,9), index=list('abce'))
print(a.add(b))  #6.0, 8.0, 10.0, NaN, NaN
print(a.add(b, fill_value=0))  #6.0, 8.0, 10.0, 4.0, 8.0

Series.copy(self, deep=True)

Copy of array. The default is deep copy. data and index are shared between the shallow copy array and the original array. If one of them is modified, the other will also be modified. The deep copy will not be affected

import pandas as pd
import numpy as np

a = pd.Series(np.arange(1,5), index=list('abcd'))
b = a.copy()
c = a.copy(deep=False)

a['a'] = 10
print(a)  #10, 2, 3, 4
print(b)  #1, 2, 3, 4
print(c)  #10, 2, 3, 4

Series.get(self, key, default=None)

Get the value corresponding to the key in the array (for example, the column of DataFrame). If not found, return the default value

Indexes:

When an index is given, you can use either the given index or the default index. You can slice through a custom index list

import pandas as pd

a = pd.Series([1,2,3], index=list('abd'))
#Cannot index like this on DataFrame
print(a[0]) #1 
print(a['a']) #1
#Both of the following methods can be used for slicing
print(a['a':'d'])
print(a[['a','b','d']])

Two. DataFrame

pandas.DataFrame(): used to create a tabular data type with row label and column label

Call method:

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

Data: 2D ndarray array, including Series, array, tuple, list and other data type Dictionaries

index: row label. If it is not defined or data is not provided, it starts from 0 by default

columns: column label. If it is not defined, it starts from 0 by default

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(1,10).reshape(3,3), columns=list('abc'))
print(df)
# Create from dictionary
d1 = {'one':[1,2,3], 'two':[4,5,6]}
df1 = pd.DataFrame(d1, index=list('abc'))
print(df1)
# Change column index
df2 = pd.DataFrame(d1, columns=['two', 'one'])
print(df2)

A way to change a DataFrame into a list:

import numpy as np
import pandas as pd

frame = pd.DataFrame(np.arange(12).reshape(3,4), columns=list('abcd'))

frame_to_list1 = frame.values.tolist()
frame_to_list2 = frame['a'].tolist()

print(frame_to_list1) #[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]
print(frame_to_list2) #[0, 4, 8]

Common properties:

index: return row label

columns: return column labels

values: returns the value of an element in an array

Shape: returns the shape of the DataFrame

size: returns the number of elements in the ndarray array

at: accessing individual elements through Tags

iat: accessing individual elements through integers

iloc: accessing DataFrame data through integers

loc: access DataFrame data through tags or Boolean arrays

import numpy as np
import pandas as pd

frame = pd.DataFrame(np.arange(12).reshape(3,4), 
index=list('abc'), columns=['num','name','sex','age']) #Single label print(frame.loc['a']) #Label list, return a DataFrame print(frame.loc[['a','b'],['num','name']]) #Single label of row and column, output corresponding single element print(frame.loc['a','num']) #When slicing, the start and end of the slice are included in it print(frame.loc['a':'c']) #Boolean list print(frame.loc[[False,True,True]])
#Filter by criteria
print(frame[frame.age>4].iloc[:, :3])
print(frame[frame['age']>4].iloc[:, :3])

#To get the data of a row in DataFrame, use iloc or loc instead of row label

Add a new column:

import pandas as pd

dict = {'one':[1,2,3], 'two':[4,5,6]}
df1 = pd.DataFrame(dict, index=['a','b','c'])
col = pd.Series([7,8,9], index=df1.index)
df1['three'] = col
print(df1)

Common methods:

drop: delete the specified row or column label

import pandas as pd
import numpy as np

frame = pd.DataFrame(np.arange(12).reshape(3,4),
index=list('123'),columns=list('abcd')) #Delete a,b The two column print(frame.drop(['a','b'],axis=1)) #Delete 1,2 Two elements print(frame.drop(index=['1','2']))
#Permanently delete a column
del frame['d']
print(frame.columns)

reindex: reorder or add a newly defined index

Call method:

DataFrame.reindex(self,labels = None,index = None,column = None,axis = None,method = None,copy = True,fill_value = nan)

index\columns: Custom index for new columns

Fill value: the fill value of the new row and column

import pandas as pd
import numpy as np

frame = pd.DataFrame(np.arange(12).reshape(3,4),index=list('123'),columns=list('abcd'))
#Reorder row indexes
print(frame.reindex(index=list('321')))
#Add new index column
newc = frame.columns.insert(3,'e')
print(frame.reindex(columns=newc,fill_value='20'))
#Create a new index
new_index = ['0','2','3']
print(frame.reindex(new_index))

head: returns the first n rows of data. The default is 5 rows

Arithmetic operation (index corresponding operation):

add(self, other, axis='columns', level=None, fill_value=None): add operation

sub(self, other, axis='columns', level=None, fill_value=None): subtraction

div(self, other, axis='columns', level=None, fill_value=None): division operation

mul(self, other, axis='columns', level=None, fill_value=None): multiplication

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(12).reshape(3,4))
print(frame1)
frame2 = pd.DataFrame(np.arange(9).reshape(3,3))
print(frame2)
#frame1 And frame2 Add corresponding, add the non corresponding place with the filled value, no data loss of the filled value
print(frame1.add(frame2, fill_value=100))

series = pd.Series(np.arange(3))
print(series)
#default frame1 Add each line of series
print(frame1.add(series))
#frame1 Subtract from each column of series
print(frame1.sub(series,axis=0))

append: add a DataFrame to the end of another DataFrame to return a new object

Call method: append(self, other, ignore_index=False)

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(6).reshape(3,2))
frame2 = pd.DataFrame(np.arange(4).reshape(2,2))
print(frame1.append(frame2)) 
#Discard original index, row index starts from 0
print(frame1.append(frame2, ignore_index=True))

info: print brief information of DataFrame

Where: keep the original value when the given condition is true; when the given condition is not true, replace the value where the condition is not true with the given value

corr(self, method='pearson ', min_periods=1): calculates the pairwise correlation of columns, excluding NA and null values

sort_values(by, axis=0, ascending=True): sorts the values. By is a list of labels in one column or several columns

sort_index(axis=0, ascending=True): sort labels

Function application and mapping apply(func, axis=0):

import numpy as np
import pandas as pd

frame = pd.DataFrame(np.random.randn(3,4),
                     index=list('abc'), columns=['name','sex','age','score'])
print(frame)
frame1 = frame.apply(lambda x: x.max()-x.min())#Line by line, top down
print(frame1)
frame2 = frame.apply((lambda x: x.max()-x.min()), axis=1)
print(frame2)

#Series.map(arg)

import
numpy as np import pandas as pd frame = pd.DataFrame(np.random.randn(3,4), index=list('abc'), columns=['name','sex','age','score']) print(frame) frame1 = frame['score'].map(lambda x: '%.2f'%x) print(frame1)

Group by (by = none, axis = 0): group data as required, and then perform numerical operations

Missing value handling:

dropna(axis=0, how=any): delete the missing value. Axis can be {0, 'index', 1, 'columns'}; how can be {' any ',' all '}. All means delete if all values are missing

fillna(value=None, method=None, axis=None): fill in the missing value in some way. Value cannot be list if scalar, dict, Series, DataFame

replace(to_replace=None,value=None) : fill the value of to replace with the value of value. To replace can be scalar, list, dict

Posted by mikeashfield on Thu, 05 Mar 2020 19:53:46 -0800