In data cleaning and other work, the essential link is missing value processing. When using pandas to read or process data, the missing value of dataframe is filled with nan by default. But in most cases, we need None or Null instead of nan. So, how to replace nan in dataframe?
There are many ways to replace nan. This paper summarizes three methods. dataframe.fillna() method, dataframe.applymap() and dataframe.where() methods. This paper analyzes each method through test data. The final choice of which method to use can be combined with the analysis of this paper, according to the real scene for choice.
1. Prepare test data
types = {'name': str, 'cost': float, 'age': float, 'phone': str} # Note that age cannot be specified as int type, # Read data of specified type data = pd.read_excel('../../test.xlsx', sheet_name='Sheet1', header=0, dtype=types) #View data print(data) # View data types for each column print(data.dtypes)
Note: the age column contains missing values, so it cannot be specified as int type, otherwise an error will be reported: valueerror: unable to convert column age to type < class' Int '>
Output results:
name cost age phone 0 indigo plant NaN 25.0 12341234123 1 Xiao Hong 1.628771e+06 23.0 12341234124 2 Xiao Ming 8.438553e+05 26.0 12341234125 3 Xiao Li 1.699444e+06 25.0 12341234126 4 Xiao Wang 2.635745e+06 NaN 12341234127 5 Xiao Zhang 1.168142e+06 25.0 nan 6 Xiao Liu 1.607670e+06 28.0 12341234129 7 nan 1.234523e+04 27.0 nan name object cost float64 age float64 phone object dtype: object Process finished with exit code 0
2.dataframe.fillna(); this method populates the NA / NaN value with the specified method
values = {'name': '', 'cost': 0, 'age': -1, 'phone': ''} data.fillna(value=values,inplace=True) print(data)
Output results:
name cost age phone 0 Xiaoqing 0.000000e+00 25.0 12341234123 1 Xiao Hong 1.628771e+06 23.0 12341234124 2 Xiao Ming 8.438553e+05 26.0 12341234125 3 Xiao Li 1.699444e+06 25.0 12341234126 4 Xiao Wang 2.635745e+06 -1.0 12341234127 5 Xiao Zhang 1.168142e+06 25.0 nan 6 Xiao Liu 1.607670e+06 28.0 12341234129 7 nan 1.234523e+04 27.0 nan Process finished with exit code 0
From the above output, we can see that the fillna() method must specify a value that is filled and not None, and the filling of object type does not take effect
3.dataframe.applymap() custom processing method
data = data.applymap(lambda x: x if str(x) != 'nan' else None) # View data print(data)
Output results:
name cost age phone 0 indigo plant NaN 25.0 12341234123 1 Xiao Hong 1.628771e+06 23.0 12341234124 2 Xiao Ming 8.438553e+05 26.0 12341234125 3 Xiao Li 1.699444e+06 25.0 12341234126 4 Xiao Wang 2.635745e+06 NaN 12341234127 5 Xiao Zhang 1.168142e+06 25.0 None 6 Xiao Liu 1.607670e+06 28.0 12341234129 7 None 1.234523e+04 27.0 None Process finished with exit code 0
From the above output results, we can see that this method only works for data of string type, and data of numeric type can't work
4.dataframe.where() filters the required data. If it meets the requirements, it returns the original value. If it does not meet the requirements, it fills in with the value of the parameter other. The default value of other is numpy.nan
4.1 do not specify data type
data = pd.read_excel('../../test.xlsx', sheet_name='Sheet1', header=0) data = data.where((data.notna()),None) # View data print(data)
Output results:
name cost age phone 0 indigo plant None 25 1.23412e+10 1 Xiao Hong 1.62877e+06 23 1.23412e+10 2 Xiao Ming Eight hundred and forty-three thousand eight hundred and fifty-five 261.23412e+10 3 Xiao Li 1.69944e+06 25 1.23412e+10 4 Xiao Wang 2.63574e+06 None 1.23412e+10 5 Xiao Zhang 1.16814e+06 25 None 6 Xiao Liu 1.60767e+06 28 1.23412e+10 7 None 12345.2 27 None Process finished with exit code 0
Note: when the where method is used in the above code, the name is not specified as str type when reading the test data. If it is specified as str type, the missing value of pandas when reading the column is filled with string 'Nan'. Data.notna() returns True. If the type of name column is not specified, the missing value of the column is filled with numpy.nan of float type, and data.notna() returns False.
4.2 specify the data type and the cond parameter of where method
types = {'name': str, 'cost': float, 'age': float, 'phone': str} # Read data of specified type data = pd.read_excel('../../test.xlsx', sheet_name='Sheet1', header=0, dtype=types) data = data.where((data.applymap(lambda x: True if str(x) != 'nan' else False)), None) # View data print(data)
Output results:
name cost age phone 0 indigo plant None 25 12341234123 1 Xiao Hong 1.62877e+06 23 12341234124 2 Xiao Ming Eight hundred and forty-three thousand eight hundred and fifty-five 26 12341234125 3 Xiao Li 1.69944e+06 25 12341234126 4 Xiao Wang 2.63574e+06 None 12341234127 5 Xiao Zhang 1.16814e+06 25 None 6 Xiao Liu 1.60767e+06 28 12341234129 7 None 12345.2 27 None