Pandas 15 - missing data NaN
Article directory
Reprinted and adapted from:
https://www.yiibai.com/pandas/python_pandas_missing_data.html
Data loss is always a problem in real life. In the fields of machine learning and data mining, the data quality is poor due to the lack of data, so there are serious problems in the accuracy of model prediction. In these areas, missing value processing is the key to make the model more accurate and effective.
1, When and why is data lost?
Imagine an online survey of a product.
Some people will share their use experience, a few will share how long they have used it, and will not share their personal information.
Many times, people don't share all the information about them. Few people share their experience, but not how long they use the product;
Few people share the time and experience of using the product, but not their personal contact information.
Therefore, in some way or other ways, some data will always be lost, which is a very common phenomenon.
Now let's see how to deal with missing values (such as NA or NaN) using Pandas.
Two, use
0. Data preparation
- Using reindexing, a DataFrame with missing values is created.
- In the output, NaN represents a value that is not a number.
df = pd.DataFrame([[1,3,5,7],[2,4,6,8], [11,18,13,21]], index = ['a', 'c', 'e'], columns = ['one', 'two', 'three', 'four']) df = df.reindex(['a', 'b', 'c', 'd', 'e',]) print(df) ''' one two three four a 1.0 3.0 5.0 7.0 b NaN NaN NaN NaN c 2.0 4.0 6.0 8.0 d NaN NaN NaN NaN e 11.0 18.0 13.0 21.0 '''
1. Isnull() & notnull() check for missing values
To make it easier to detect missing values (and dtypes across different arrays), Pandas provides isnull() and notnull() functions, which are also methods for Series and DataFrame objects-
Example 1
print (df['one'].isnull()) ''' a False b True c False d True e False Name: one, dtype: bool ''' print (df['one'].notnull()) ''' a True b False c True d False e True Name: one, dtype: bool '''
2. Calculation of missing data
- When summing data, NA will be treated as 0
- If the data is all NA, the result will be NA
Example 1
print (df['one'].sum()) # 14.0
Example 2
df1 = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two']) print(df1) ''' one two 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN 5 NaN NaN ''' print(df1['one'].sum()) # 0
3. Clean / fill missing data
Pandas provides various methods to clear missing values.
The fillna() function can "fill" NA values with non null data in several ways, as you will learn and use in the following sections.
3.1 replace NaN with scalar value
The following procedure shows how to replace NaN with 0.
print (df.fillna(0)) ''' one two three four a 1.0 3.0 5.0 7.0 b 0.0 0.0 0.0 0.0 c 2.0 4.0 6.0 8.0 d 0.0 0.0 0.0 0.0 e 11.0 18.0 13.0 21.0 '''
3.2 fill in NA forward and backward
Fill in missing values using the fill concept discussed in the refactoring index chapter.
Method | action |
---|---|
pad/fill | Fill method forward |
bfill/backfill | Fill method backward |
Example 1
print (df.fillna(method='pad')) ''' one two three four a 1.0 3.0 5.0 7.0 b 1.0 3.0 5.0 7.0 c 2.0 4.0 6.0 8.0 d 2.0 4.0 6.0 8.0 e 11.0 18.0 13.0 21.0 '''
Example 2
print (df.fillna(method='backfill')) ''' one two three four a 1.0 3.0 5.0 7.0 b 2.0 4.0 6.0 8.0 c 2.0 4.0 6.0 8.0 d 11.0 18.0 13.0 21.0 e 11.0 18.0 13.0 21.0
4. Missing value
If you want to exclude only missing values, use the dropna function and axis parameters. By default, axis = 0, which is applied on the row, meaning that if any value in the row is NA, the entire row is excluded.
Example 1
print (df.dropna()) ''' one two three four a 1.0 3.0 5.0 7.0 c 2.0 4.0 6.0 8.0 e 11.0 18.0 13.0 21.0 '''
Example 2
print (df.dropna(axis=1)) ''' Empty DataFrame Columns: [] Index: [a, b, c, d, e] '''
5. Replace missing (or) common values
Many times, you have to replace a generic value with a specific value. This can be achieved by applying replacement methods.
Replacing NA with a scalar value is the equivalent behavior of the fillna() function.
Example 1
print (df.replace({1:10,2:60})) ''' one two three four a 10.0 3.0 5.0 7.0 b NaN NaN NaN NaN c 60.0 4.0 6.0 8.0 d NaN NaN NaN NaN e 11.0 18.0 13.0 21.0 '''