Pandas 15 - missing data NaN

Pandas 15 - missing data NaN

Article directory

Reprinted and adapted from:

https://www.yiibai.com/pandas/python_pandas_missing_data.html

Data loss is always a problem in real life. In the fields of machine learning and data mining, the data quality is poor due to the lack of data, so there are serious problems in the accuracy of model prediction. In these areas, missing value processing is the key to make the model more accurate and effective.

1, When and why is data lost?

Imagine an online survey of a product.

Some people will share their use experience, a few will share how long they have used it, and will not share their personal information.

Many times, people don't share all the information about them. Few people share their experience, but not how long they use the product;

Few people share the time and experience of using the product, but not their personal contact information.

Therefore, in some way or other ways, some data will always be lost, which is a very common phenomenon.

Now let's see how to deal with missing values (such as NA or NaN) using Pandas.

Two, use

0. Data preparation

  • Using reindexing, a DataFrame with missing values is created.
  • In the output, NaN represents a value that is not a number.
df = pd.DataFrame([[1,3,5,7],[2,4,6,8],
                   [11,18,13,21]],
                  index = ['a', 'c', 'e'],
                  columns = ['one', 'two', 'three', 'four'])

df = df.reindex(['a', 'b', 'c', 'd', 'e',])
print(df)
'''
    one   two  three  four
a   1.0   3.0    5.0   7.0
b   NaN   NaN    NaN   NaN
c   2.0   4.0    6.0   8.0
d   NaN   NaN    NaN   NaN
e  11.0  18.0   13.0  21.0
'''

1. Isnull() & notnull() check for missing values

To make it easier to detect missing values (and dtypes across different arrays), Pandas provides isnull() and notnull() functions, which are also methods for Series and DataFrame objects-

Example 1

print (df['one'].isnull())
'''
a    False
b     True
c    False
d     True
e    False
Name: one, dtype: bool
'''


print (df['one'].notnull())
'''
a     True
b    False
c     True
d    False
e     True
Name: one, dtype: bool
'''

2. Calculation of missing data

  • When summing data, NA will be treated as 0
  • If the data is all NA, the result will be NA

Example 1

print (df['one'].sum())  # 14.0

Example 2

df1 = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])
print(df1)
'''
   one  two
0  NaN  NaN
1  NaN  NaN
2  NaN  NaN
3  NaN  NaN
4  NaN  NaN
5  NaN  NaN
'''

print(df1['one'].sum())  # 0

3. Clean / fill missing data

Pandas provides various methods to clear missing values.

The fillna() function can "fill" NA values with non null data in several ways, as you will learn and use in the following sections.

3.1 replace NaN with scalar value

The following procedure shows how to replace NaN with 0.

print (df.fillna(0))
'''
    one   two  three  four
a   1.0   3.0    5.0   7.0
b   0.0   0.0    0.0   0.0
c   2.0   4.0    6.0   8.0
d   0.0   0.0    0.0   0.0
e  11.0  18.0   13.0  21.0
'''

3.2 fill in NA forward and backward

Fill in missing values using the fill concept discussed in the refactoring index chapter.

Method action
pad/fill Fill method forward
bfill/backfill Fill method backward

Example 1

print (df.fillna(method='pad'))
'''
    one   two  three  four
a   1.0   3.0    5.0   7.0
b   1.0   3.0    5.0   7.0
c   2.0   4.0    6.0   8.0
d   2.0   4.0    6.0   8.0
e  11.0  18.0   13.0  21.0
'''

Example 2

print (df.fillna(method='backfill'))
'''
    one   two  three  four
a   1.0   3.0    5.0   7.0
b   2.0   4.0    6.0   8.0
c   2.0   4.0    6.0   8.0
d  11.0  18.0   13.0  21.0
e  11.0  18.0   13.0  21.0

4. Missing value

If you want to exclude only missing values, use the dropna function and axis parameters. By default, axis = 0, which is applied on the row, meaning that if any value in the row is NA, the entire row is excluded.

Example 1

print (df.dropna())
'''
    one   two  three  four
a   1.0   3.0    5.0   7.0
c   2.0   4.0    6.0   8.0
e  11.0  18.0   13.0  21.0
'''

Example 2

print (df.dropna(axis=1))
'''
Empty DataFrame
Columns: []
Index: [a, b, c, d, e]
'''

5. Replace missing (or) common values

Many times, you have to replace a generic value with a specific value. This can be achieved by applying replacement methods.

Replacing NA with a scalar value is the equivalent behavior of the fillna() function.

Example 1

print (df.replace({1:10,2:60}))
'''
    one   two  three  four
a  10.0   3.0    5.0   7.0
b   NaN   NaN    NaN   NaN
c  60.0   4.0    6.0   8.0
d   NaN   NaN    NaN   NaN
e  11.0  18.0   13.0  21.0
'''
Published 15 original articles, won praise 1, visited 1672
Private letter follow

Posted by tomtimms on Mon, 27 Jan 2020 04:28:47 -0800