Getting started with Pandas 1 (DataFrame+Series read / write / Index+Select+Assign)

Keywords: Python less

Article catalog

1. Creating, Reading and Writing

one point one Dataframe data framework
one point two Series series
one point three Reading reading reading data

2. Indexing, Selecting, Assigning

two point one Access to python like mode
two point two Unique access mode of pandas

2.2.1 Iloc access based on index
2.2.2 LOC access based on label

two point three set_ Index() set index column
two point four Conditional selection

2.4.1 Boolean ` &, |==`
2.4.2 Pandas built in symbols ` isin, isnull, notnull`

two point five Assigning data assignment

2.5.1 Assignment constant
2.5.2 Sequence of assignment iterations

learn from https://www.kaggle.com/learn/pandas

Next: Getting started with Pandas 2 (DataFunctions+Maps+groupby+sort_values)

1. Creating, Reading and Writing

one point one Dataframe data framework

Create a DataFrame, which is a table with a dictionary inside, key: [value_1,...,value_n]

#%%
# -*- coding:utf-8 -*-
# @Python Version: 3.7
# @Time: 2020/5/16 21:10
# @Author: Michael Ming
# @Website: https://michael.blog.csdn.net/
# @File: pandasExercise.ipynb
# @Reference: https://www.kaggle.com/learn/pandas
import pandas as pd

#%%
pd.DataFrame({'Yes':[50,22],"No":[131,2]})

fruits = pd.DataFrame([[30, 21],[40, 22]], columns=['Apples', 'Bananas'])

The value in the dictionary can also be: String

pd.DataFrame({"Michael":['handsome','good'],"Ming":['love basketball','coding']})

Index data, index=['index1','index2 ',...]

pd.DataFrame({"Michael":['handsome','good'],"Ming":['love basketball','coding']},
             index=['people1 say','people2 say'])

one point two Series series

Series is a series of data, which can be regarded as a list

pd.Series([5,2,0,1,3,1,4])

0    5
1    2
2    0
3    1
4    3
5    1
6    4
dtype: int64

You can also assign data to Series, but Series has no column name, only the total name
DataFrame is essentially multiple Series glued together

pd.Series([30,40,50],index=['2018 sales volume','2019 sales volume','2020 sales volume'],
          name='Blog visits')
          
2018sales volume    30
2019sales volume    40
2020sales volume    50
Name: Blog visits, dtype: int64

one point three Reading reading reading data

Read the csv ("comma separated values") file, pd.read_csv('file '), stored in a DataFrame

wine_rev = pd.read_csv("winemag-data-130k-v2.csv")

wine_rev.shape # size
(129971, 14)

wine_rev.head() # View top 5 lines

Index columns can be customized_ Col =, can be the sequence number of the column, or the name of the column

wine_rev = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
wine_rev.head()

(the following figure has one less column than the above one, because the index column is defined as 0)

Save, to_csv('xxx.csv )

wine_rev.to_csv('XXX.csv')

2. Indexing, Selecting, Assigning

two point one Access to python like mode

item.col_name # Disadvantage: cannot access the column with the name of space, [] operation can
item['col_name']

wine_rev.country
wine_rev['country']

0            Italy
1         Portugal
2               US
3               US
4               US
            ...   
129966     Germany
129967          US
129968      France
129969      France
129970      France
Name: country, Length: 129971, dtype: object

wine_rev['country'][0]   # 'Italy', first column, then row
wine_rev.country[1]  # 'Portugal'

two point two Unique access mode of pandas

2.2.1 Iloc access based on index

To select the first row of data in the DataFrame, we can use the following code:
wine_rev.iloc[0]

country                                                              Italy
description              Aromas include tropical fruit, broom, brimston...
designation                                                   Vulkà Bianco
points                                                                  87
price                                                                  NaN
province                                                 Sicily & Sardinia
region_1                                                              Etna
region_2                                                               NaN
taster_name                                                  Kerin O'Keefe
taster_twitter_handle                                         @kerinokeefe
title                                    Nicosia 2013 Vulkà Bianco  (Etna)
variety                                                        White Blend
winery                                                             Nicosia
Name: 0, dtype: object

loc and iloc are both row first and column second, which is the opposite of the python operation above

wine_rev.iloc[:,0], get the first column,: indicates all

0            Italy
1         Portugal
2               US
3               US
4               US
            ...   
129966     Germany
129967          US
129968      France
129969      France
129970      France
Name: country, Length: 129971, dtype: object

wine_rev.iloc[:3,0],: 3 for [0:3) row 0,1,2

0       Italy
1    Portugal
2          US
Name: country, dtype: object

You can also use the discrete list to get lines, wine_rev.iloc[[1,2],0]

1    Portugal
2          US
Name: country, dtype: object

Take the last few lines, wine_rev.iloc[-5:], line 5 to the end

2.2.2 LOC access based on label

wine_rev.loc[0, 'country'], rows can also use [0,1] to represent discrete rows, and columns cannot use index

'Italy'

wine_ rev.loc [: 3, 'country'], different from iloc, where line 3 is included, and LOC contains the last

0       Italy
1    Portugal
2          US
3          US
Name: country, dtype: object

wine_ rev.loc [1: 3, ['country ',' points']], multiple columns are enclosed in a list

The advantages of LOC, such as the line with the string index, df.loc ['apples':'positions'] can be selected

two point three set_ Index() set index column

set_index() can reset the index, wine_rev.set_index("title")

two point four Conditional selection

2.4.1 Boolean &, |==

wine_ rev.country =='US', find by country, generate Series of True/False, which can be used for loc

0         False
1         False
2          True
3          True
4          True
          ...  
129966    False
129967     True
129968    False
129969    False
129970    False
Name: country, Length: 129971, dtype: bool

wine_ rev.loc [wine_ rev.country =='US'], select all US lines

wine_ rev.loc [(wine_ rev.country == 'US') & (wine_ rev.points >=90)], US's & with a score of more than 90
You can also use | to represent or (like C + + bit operators)

2.4.2 Pandas built-in symbols isin, isnull, notnull

wine_rev.loc[wine_rev.country.isin(['US','Italy ']]), select only rows of US and Italy

wine_rev.loc[wine_rev.price.notnull(), price is not empty
wine_rev.loc[wine_rev.price.isnull(), with NaN price

two point five Assigning data assignment

2.5.1 Assignment constant

wine_ Rev ['critical '] ='michael', adding a new column
wine_ rev.country ='Ming', the value of the existing column will be directly overwritten

2.5.2 Sequence of assignment iterations

wine_rev['test_id'] = range(len(wine_rev),0,-1)