Getting started with Pandas 1 (DataFrame+Series read / write / Index+Select+Assign)

Keywords: Python less

Article catalog

learn from https://www.kaggle.com/learn/pandas

Next: Getting started with Pandas 2 (DataFunctions+Maps+groupby+sort_values)

1. Creating, Reading and Writing

one point one Dataframe data framework

  • Create a DataFrame, which is a table with a dictionary inside, key: [value_1,...,value_n]
#%%
# -*- coding:utf-8 -*-
# @Python Version: 3.7
# @Time: 2020/5/16 21:10
# @Author: Michael Ming
# @Website: https://michael.blog.csdn.net/
# @File: pandasExercise.ipynb
# @Reference: https://www.kaggle.com/learn/pandas
import pandas as pd

#%%
pd.DataFrame({'Yes':[50,22],"No":[131,2]})

fruits = pd.DataFrame([[30, 21],[40, 22]], columns=['Apples', 'Bananas'])

  • The value in the dictionary can also be: String
pd.DataFrame({"Michael":['handsome','good'],"Ming":['love basketball','coding']})

  • Index data, index=['index1','index2 ',...]
pd.DataFrame({"Michael":['handsome','good'],"Ming":['love basketball','coding']},
             index=['people1 say','people2 say'])

one point two Series series

  • Series is a series of data, which can be regarded as a list
pd.Series([5,2,0,1,3,1,4])

0    5
1    2
2    0
3    1
4    3
5    1
6    4
dtype: int64
  • You can also assign data to Series, but Series has no column name, only the total name
  • DataFrame is essentially multiple Series glued together
pd.Series([30,40,50],index=['2018 sales volume','2019 sales volume','2020 sales volume'],
          name='Blog visits')
          
2018sales volume    30
2019sales volume    40
2020sales volume    50
Name: Blog visits, dtype: int64

one point three Reading reading reading data

  • Read the csv ("comma separated values") file, pd.read_csv('file '), stored in a DataFrame
wine_rev = pd.read_csv("winemag-data-130k-v2.csv")
wine_rev.shape # size
(129971, 14)
wine_rev.head() # View top 5 lines

  • Index columns can be customized_ Col =, can be the sequence number of the column, or the name of the column
wine_rev = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
wine_rev.head()

(the following figure has one less column than the above one, because the index column is defined as 0)

  • Save, to_csv('xxx.csv )
wine_rev.to_csv('XXX.csv')

2. Indexing, Selecting, Assigning

two point one Access to python like mode

item.col_name # Disadvantage: cannot access the column with the name of space, [] operation can
item['col_name']
wine_rev.country
wine_rev['country']

0            Italy
1         Portugal
2               US
3               US
4               US
            ...   
129966     Germany
129967          US
129968      France
129969      France
129970      France
Name: country, Length: 129971, dtype: object
wine_rev['country'][0]   # 'Italy', first column, then row
wine_rev.country[1]  # 'Portugal'

two point two Unique access mode of pandas

2.2.1 Iloc access based on index

  • To select the first row of data in the DataFrame, we can use the following code:

  • wine_rev.iloc[0]

country                                                              Italy
description              Aromas include tropical fruit, broom, brimston...
designation                                                   Vulkà Bianco
points                                                                  87
price                                                                  NaN
province                                                 Sicily & Sardinia
region_1                                                              Etna
region_2                                                               NaN
taster_name                                                  Kerin O'Keefe
taster_twitter_handle                                         @kerinokeefe
title                                    Nicosia 2013 Vulkà Bianco  (Etna)
variety                                                        White Blend
winery                                                             Nicosia
Name: 0, dtype: object

loc and iloc are both row first and column second, which is the opposite of the python operation above

  • wine_rev.iloc[:,0], get the first column,: indicates all
0            Italy
1         Portugal
2               US
3               US
4               US
            ...   
129966     Germany
129967          US
129968      France
129969      France
129970      France
Name: country, Length: 129971, dtype: object
  • wine_rev.iloc[:3,0],: 3 for [0:3) row 0,1,2
0       Italy
1    Portugal
2          US
Name: country, dtype: object
  • You can also use the discrete list to get lines, wine_rev.iloc[[1,2],0]
1    Portugal
2          US
Name: country, dtype: object
  • Take the last few lines, wine_rev.iloc[-5:], line 5 to the end

2.2.2 LOC access based on label

  • wine_rev.loc[0, 'country'], rows can also use [0,1] to represent discrete rows, and columns cannot use index
'Italy'
  • wine_ rev.loc [: 3, 'country'], different from iloc, where line 3 is included, and LOC contains the last
0       Italy
1    Portugal
2          US
3          US
Name: country, dtype: object
  • wine_ rev.loc [1: 3, ['country ',' points']], multiple columns are enclosed in a list

  • The advantages of LOC, such as the line with the string index, df.loc ['apples':'positions'] can be selected

two point three set_ Index() set index column

  • set_index() can reset the index, wine_rev.set_index("title")

two point four Conditional selection

2.4.1 Boolean &, |==

  • wine_ rev.country =='US', find by country, generate Series of True/False, which can be used for loc
0         False
1         False
2          True
3          True
4          True
          ...  
129966    False
129967     True
129968    False
129969    False
129970    False
Name: country, Length: 129971, dtype: bool
  • wine_ rev.loc [wine_ rev.country =='US'], select all US lines

  • wine_ rev.loc [(wine_ rev.country == 'US') & (wine_ rev.points >=90)], US's & with a score of more than 90
  • You can also use | to represent or (like C + + bit operators)

2.4.2 Pandas built-in symbols isin, isnull, notnull

  • wine_rev.loc[wine_rev.country.isin(['US','Italy ']]), select only rows of US and Italy

  • wine_rev.loc[wine_rev.price.notnull(), price is not empty
  • wine_rev.loc[wine_rev.price.isnull(), with NaN price

two point five Assigning data assignment

2.5.1 Assignment constant

  • wine_ Rev ['critical '] ='michael', adding a new column
  • wine_ rev.country ='Ming', the value of the existing column will be directly overwritten

2.5.2 Sequence of assignment iterations

  • wine_rev['test_id'] = range(len(wine_rev),0,-1)

Next: Getting started with Pandas 2 (DataFunctions+Maps+groupby+sort_values)

Posted by MetaDark on Tue, 19 May 2020 00:11:01 -0700