## Dealing with NaN

### Introduction

NaN was introduced, at least officially, by the IEEE Standard for Floating-Point Arithmetic (IEEE 754). It is a technical standard for floating-point computation established in 1985 - many years before Python was invented, and even a longer time befor Pandas was created - by the Institute of Electrical and Electronics Engineers (IEEE). It was introduced to solve problems found in many floating point implementations that made them difficult to use reliably and portably.

This standard added NaN to the arithmetic formats: "arithmetic formats: sets of binary and decimal floating-point data, which consist of finite numbers (including signed zeros and subnormal numbers), infinities, and special 'not a number' values (NaNs)"

### 'nan' in Python

Python knows NaN values as well. We can create it with "float":

n1 = float("nan") n2 = float("Nan") n3 = float("NaN") n4 = float("NAN") print(n1, n2, n3, n4)

nan nan nan nan

"nan" is also part of the math module since Python 3.5:

import math n1 = math.nan print(n1) print(math.isnan(n1))

nan True

Warning: Do not perform comparison between "NaN" values or "Nan" values and regular numbers. A simple or simplified reasoning is this: Two things are "not a number", so they can be anything but most probably not the same. Above all there is no way of ordering NaNs:

print(n1 == n2) print(n1 == 0) print(n1 == 100) print(n2 < 0)

False False False False

### NaN in Pandas

#### Example without NaNs

Before we will work with NaN data, we will process a file without any NaN values. The data file temperatures.csv contains the temperature data of six sensors taken every 15 minuts between 6:00 to 19.15 o'clock.

Reading in the data file can be done with the read_csv function:

import pandas as pd df = pd.read_csv("data1/temperatures.csv", sep=";", decimal=",") df.loc[:3]After having executed the Python code above we received the following result:

time | sensor1 | sensor2 | sensor3 | sensor4 | sensor5 | sensor6 | |
---|---|---|---|---|---|---|---|

0 | 06:00:00 | 14.3 | 13.7 | 14.2 | 14.3 | 13.5 | 13.6 |

1 | 06:15:00 | 14.5 | 14.5 | 14.0 | 15.0 | 14.5 | 14.7 |

2 | 06:30:00 | 14.6 | 15.1 | 14.8 | 15.3 | 14.0 | 14.2 |

3 | 06:45:00 | 14.8 | 14.5 | 15.6 | 15.2 | 14.7 | 14.6 |

We want to calculate the avarage temperatures per measuring point over all the sensors. We can use the DataFrame method 'mean'. If we use 'mean' without parameters it will sum up the sensor columns, which isn't what we want, but it may be interesting as well:

df.mean()The code above returned the following:

sensor1 19.775926 sensor2 19.757407 sensor3 19.840741 sensor4 20.187037 sensor5 19.181481 sensor6 19.437037 dtype: float64

average_temp_series = df.mean(axis=1) print(average_temp_series[:8])

0 13.933333 1 14.533333 2 14.666667 3 14.900000 4 15.083333 5 15.116667 6 15.283333 7 15.116667 dtype: float64

sensors = df.columns.values[1:] # all columns except the time column will be removed: df = df.drop(sensors, axis=1) print(df[:5])

time 0 06:00:00 1 06:15:00 2 06:30:00 3 06:45:00 4 07:00:00

We will assign now the average temperature values as a new column 'temperature':

# best practice: df = df.assign(temperature=average_temp_series) # inplace option not available # alternatively: #df.loc[:,"temperature"] = average_temp_series

df[:3]The Python code above returned the following:

time | temperature | |
---|---|---|

0 | 06:00:00 | 13.933333 |

1 | 06:15:00 | 14.533333 |

2 | 06:30:00 | 14.666667 |

#### Example with NaNs

We will use now a data file similar to the previous temperature csv, but this time we will have to cope with NaN data, when the sensors malfunctioned.

We will create a temperature DataFrame, in which some data is not defined, i.e. NaN.

We will use and change the data from the the temperatures.csv file:

temp_df = pd.read_csv("data1/temperatures.csv", sep=";", index_col=0, decimal=",")

We will randomly assign some NaN values into the data frame. For this purpose, we will use the where method from DataFrame. If we apply where to a DataFrame object df, i.e. df.where(cond, other_df), it will return an object of same shape as df and whose corresponding entries are from df where the corresponding element of cond is True and otherwise are taken from other_df.

Before we continue with our task, we will demonstrate the way of working of where with some simple examples:

s = pd.Series(range(5)) s.where(s > 0)

import numpy as np A = np.random.randint(1, 30, (4, 2)) df = pd.DataFrame(A, columns=['Foo', 'Bar']) m = df % 2 == 0 df.where(m, -df, inplace=True) dfThe code above returned the following:

Foo | Bar | |
---|---|---|

0 | -9 | 4 |

1 | -7 | 24 |

2 | -29 | 12 |

3 | -15 | 2 |

For our task, we need to create a DataFrame 'nan_df', which consists purely of NaN values and has the same shape as our temperature DataFrame 'temp_df'. We will use this DataFrame in 'where'. We also need a DataFrame with the conditions "df_bool" as True values. For this purpose we will create a DataFrame with random values between 0 and 1 and by applying 'random_df < 0.8' we get the df_bool DataFrame, in which about 20 % of the values will be True:

random_df = pd.DataFrame(np.random.random(size=temp_df.shape), columns=temp_df.columns.values, index=temp_df.index) nan_df = pd.DataFrame(np.nan, columns=temp_df.columns.values, index=temp_df.index) df_bool = random_df<0.8 df_bool[:5]We received the following output:

sensor1 | sensor2 | sensor3 | sensor4 | sensor5 | sensor6 | |
---|---|---|---|---|---|---|

time | ||||||

06:00:00 | False | True | True | True | True | True |

06:15:00 | True | True | True | True | True | True |

06:30:00 | True | True | False | False | False | False |

06:45:00 | True | False | True | True | True | True |

07:00:00 | False | True | True | False | True | True |

Finally, we have everything toghether to create our DataFrame with distrubed measurements:

disturbed_data = temp_df.where(df_bool, nan_df) disturbed_data.to_csv("data1/temperatures_with_NaN.csv") disturbed_data[:10]We received the following output:

sensor1 | sensor2 | sensor3 | sensor4 | sensor5 | sensor6 | |
---|---|---|---|---|---|---|

time | ||||||

06:00:00 | NaN | 13.7 | 14.2 | 14.3 | 13.5 | 13.6 |

06:15:00 | 14.5 | 14.5 | 14.0 | 15.0 | 14.5 | 14.7 |

06:30:00 | 14.6 | 15.1 | NaN | NaN | NaN | NaN |

06:45:00 | 14.8 | NaN | 15.6 | 15.2 | 14.7 | 14.6 |

07:00:00 | NaN | 14.9 | 15.7 | NaN | 14.0 | 15.3 |

07:15:00 | NaN | 15.2 | 14.6 | 15.3 | 15.5 | 14.9 |

07:30:00 | 15.4 | 15.3 | 15.6 | 15.6 | NaN | 15.1 |

07:45:00 | 15.5 | 14.8 | 15.4 | 15.5 | 14.6 | 14.9 |

08:00:00 | 15.7 | 15.6 | 15.9 | NaN | 15.4 | NaN |

08:15:00 | 15.9 | 15.8 | NaN | 16.9 | NaN | 16.2 |

### Using dropna on the DataFrame

'dropna' is a DataFrame method. If we call this method without arguments, it will return an object where every row is ommitted, in which data are missing, i.e. some value is NaN:

df = disturbed_data.dropna() dfThe previous Python code returned the following result:

sensor1 | sensor2 | sensor3 | sensor4 | sensor5 | sensor6 | |
---|---|---|---|---|---|---|

time | ||||||

06:15:00 | 14.5 | 14.5 | 14.0 | 15.0 | 14.5 | 14.7 |

07:45:00 | 15.5 | 14.8 | 15.4 | 15.5 | 14.6 | 14.9 |

09:00:00 | 16.8 | 17.3 | 17.7 | 17.8 | 15.9 | 16.1 |

10:00:00 | 19.0 | 19.7 | 18.8 | 18.9 | 17.5 | 18.9 |

12:30:00 | 23.6 | 24.2 | 23.6 | 24.1 | 22.1 | 22.5 |

13:45:00 | 22.7 | 23.3 | 22.2 | 22.2 | 21.0 | 23.0 |

14:00:00 | 22.5 | 23.0 | 22.1 | 24.1 | 22.5 | 22.7 |

15:30:00 | 21.4 | 21.3 | 21.7 | 21.9 | 21.0 | 21.7 |

16:15:00 | 20.9 | 20.6 | 20.8 | 20.5 | 20.3 | 21.6 |

17:30:00 | 20.1 | 20.5 | 19.7 | 19.7 | 18.7 | 19.7 |

19:00:00 | 19.2 | 18.7 | 20.1 | 19.9 | 18.3 | 19.3 |

19:15:00 | 19.0 | 19.7 | 18.9 | 19.2 | 18.5 | 19.4 |

'dropna' can also be used to drop all columns in which some values are NaN. This can be achieved by assigning 1 to the axis parameter. The default value is False, as we have seen in our previous example. As every column from our sensors contain NaN values, they will all disappear:

df = disturbed_data.dropna(axis=1) df[:5]The previous code returned the following:

time |
---|

06:00:00 |

06:15:00 |

06:30:00 |

06:45:00 |

07:00:00 |

Let us change our task: We only want to get rid of all the rows, which contain more than one NaN value. The parameter 'thresh' is ideal for this task. It can be set to the minimum number. 'thresh' is set to an integer value, which defines the minimum number of non-NaN values. We have six temperature values in every row. Setting 'thresh' to 5 makes sure that we will have at least 5 valid floats in every remaining row:

cleansed_df = disturbed_data.dropna(thresh=5, axis=0) cleansed_df[:7]This gets us the following output:

sensor1 | sensor2 | sensor3 | sensor4 | sensor5 | sensor6 | |
---|---|---|---|---|---|---|

time | ||||||

06:15:00 | NaN | 14.5 | 14.0 | 15.0 | 14.5 | 14.7 |

06:45:00 | 14.8 | NaN | 15.6 | 15.2 | 14.7 | 14.6 |

07:00:00 | 15.0 | 14.9 | 15.7 | 15.6 | 14.0 | 15.3 |

07:30:00 | 15.4 | 15.3 | 15.6 | 15.6 | 14.7 | 15.1 |

07:45:00 | 15.5 | 14.8 | 15.4 | 15.5 | 14.6 | 14.9 |

08:15:00 | 15.9 | 15.8 | 15.9 | 16.9 | NaN | 16.2 |

08:30:00 | 16.1 | 15.7 | 16.1 | 15.9 | 14.9 | 15.2 |

Now we will calculate the mean values again, but this time on the DataFrame 'cleansed_df', i.e. where we have taken out all the rows, where more than one NaN value occurred.

average_temp_series = cleansed_df.mean(axis=1) sensors = cleansed_df.columns.values df = cleansed_df.drop(sensors, axis=1) # best practice: df = df.assign(temperature=average_temp_series) # inplace option not available df[:6]This gets us the following output:

temperature | |
---|---|

time | |

06:15:00 | 14.540000 |

06:45:00 | 14.980000 |

07:00:00 | 15.083333 |

07:30:00 | 15.283333 |

07:45:00 | 15.116667 |

08:15:00 | 16.140000 |