How to replace nan in dataframe?

Keywords: Lambda

In data cleaning and other work, the essential link is missing value processing. When using pandas to read or process data, the missing value of dataframe is filled with nan by default. But in most cases, we need None or Null instead of nan. So, how to replace nan in dataframe?

There are many ways to replace nan. This paper summarizes three methods. dataframe.fillna() method, dataframe.applymap() and dataframe.where() methods. This paper analyzes each method through test data. The final choice of which method to use can be combined with the analysis of this paper, according to the real scene for choice.

1. Prepare test data

types = {'name': str, 'cost': float, 'age': float, 'phone': str}
# Note that age cannot be specified as int type,
# Read data of specified type
data = pd.read_excel('../../test.xlsx', sheet_name='Sheet1', header=0, dtype=types)
#View data
print(data)
# View data types for each column
print(data.dtypes)

Note: the age column contains missing values, so it cannot be specified as int type, otherwise an error will be reported: valueerror: unable to convert column age to type < class' Int '>

Output results:

  name          cost   age        phone
0   indigo plant           NaN  25.0  12341234123
1   Xiao Hong 1.628771e+06  23.0  12341234124
2   Xiao Ming 8.438553e+05  26.0  12341234125
3   Xiao Li 1.699444e+06  25.0  12341234126
4   Xiao Wang 2.635745e+06   NaN  12341234127
5   Xiao Zhang 1.168142e+06  25.0          nan
6   Xiao Liu 1.607670e+06  28.0  12341234129
7  nan  1.234523e+04  27.0          nan
name      object
cost     float64
age      float64
phone     object
dtype: object

Process finished with exit code 0

2.dataframe.fillna(); this method populates the NA / NaN value with the specified method

values = {'name': '', 'cost': 0, 'age': -1, 'phone': ''}
data.fillna(value=values,inplace=True)
print(data)

Output results:

  name          cost   age        phone
0   Xiaoqing 0.000000e+00  25.0  12341234123
1   Xiao Hong 1.628771e+06  23.0  12341234124
2   Xiao Ming 8.438553e+05  26.0  12341234125
3   Xiao Li 1.699444e+06  25.0  12341234126
4   Xiao Wang 2.635745e+06  -1.0  12341234127
5   Xiao Zhang 1.168142e+06  25.0          nan
6   Xiao Liu 1.607670e+06  28.0  12341234129
7  nan  1.234523e+04  27.0          nan

Process finished with exit code 0

From the above output, we can see that the fillna() method must specify a value that is filled and not None, and the filling of object type does not take effect

3.dataframe.applymap() custom processing method

data = data.applymap(lambda x: x if str(x) != 'nan' else None)
# View data
print(data)

Output results:

   name          cost   age        phone
0    indigo plant           NaN  25.0  12341234123
1    Xiao Hong 1.628771e+06  23.0  12341234124
2    Xiao Ming 8.438553e+05  26.0  12341234125
3    Xiao Li 1.699444e+06  25.0  12341234126
4    Xiao Wang 2.635745e+06   NaN  12341234127
5    Xiao Zhang 1.168142e+06  25.0         None
6    Xiao Liu 1.607670e+06  28.0  12341234129
7  None  1.234523e+04  27.0         None

Process finished with exit code 0

From the above output results, we can see that this method only works for data of string type, and data of numeric type can't work

4.dataframe.where() filters the required data. If it meets the requirements, it returns the original value. If it does not meet the requirements, it fills in with the value of the parameter other. The default value of other is numpy.nan

4.1 do not specify data type

 

data = pd.read_excel('../../test.xlsx', sheet_name='Sheet1', header=0)
data = data.where((data.notna()),None)
# View data
print(data)

Output results:

   name         cost   age        phone
0    indigo plant         None    25  1.23412e+10
1    Xiao Hong 1.62877e+06    23  1.23412e+10
2    Xiao Ming       Eight hundred and forty-three thousand eight hundred and fifty-five    261.23412e+10
3    Xiao Li 1.69944e+06    25  1.23412e+10
4    Xiao Wang 2.63574e+06  None  1.23412e+10
5    Xiao Zhang 1.16814e+06    25         None
6    Xiao Liu 1.60767e+06    28  1.23412e+10
7  None      12345.2    27         None

Process finished with exit code 0

Note: when the where method is used in the above code, the name is not specified as str type when reading the test data. If it is specified as str type, the missing value of pandas when reading the column is filled with string 'Nan'. Data.notna() returns True. If the type of name column is not specified, the missing value of the column is filled with numpy.nan of float type, and data.notna() returns False.

4.2 specify the data type and the cond parameter of where method

types = {'name': str, 'cost': float, 'age': float, 'phone': str}
# Read data of specified type
data = pd.read_excel('../../test.xlsx', sheet_name='Sheet1', header=0, dtype=types)
data = data.where((data.applymap(lambda x: True if str(x) != 'nan' else False)), None)
# View data
print(data)

Output results:

   name         cost   age        phone
0    indigo plant         None    25  12341234123
1    Xiao Hong 1.62877e+06    23  12341234124
2    Xiao Ming       Eight hundred and forty-three thousand eight hundred and fifty-five    26  12341234125
3    Xiao Li 1.69944e+06    25  12341234126
4    Xiao Wang 2.63574e+06  None  12341234127
5    Xiao Zhang 1.16814e+06    25         None
6    Xiao Liu 1.60767e+06    28  12341234129
7  None      12345.2    27         None

 

96 original articles published, 62 praised, 230000 visitors+
Private letter follow

Posted by JessePHP on Mon, 20 Jan 2020 03:33:10 -0800