How to remove Nan values from data in Pandas

Lets just import the library first

import pandas as pd

I have the movies database which I have downloaded from Kaggle for this exercise. Lets read the data and look at first few rows by using head which will first 10 rows...

df = pd.read_csv("movies_metadata.csv")
df.head()

Lets find out the name of columns we have in the data by using df.columns

df.columns
Out[12]:
Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

Lets see how many rows we have in the data...

df.size
Out[47]:
1091184

So there are 1091184 rows in the data.

Lets do a simple query on the data. Lets find out all the rows which contains title "Toy Story". Here is the query to do that...

df[df.title.str.contains('Toy Story', case=False)]

But I got following error...

ValueError: cannot index with vector containing NA / NaN values

To fix the above error, we can either ignore the Na/Nan values and then run above command or remove the Na/Nan values altogether. Lets try the first idea that is ignore the Nan values. The command to do that is following...

df[df.title.str.contains('Toy Story', case=False) & (df.title.isna()==False)]

To find out how many records we get , we can do len python on the df since it is a list.

len(df[df.title.str.contains('Toy Story', case=False) & (df.title.isna()==False)])
Out[52]:
5

We got 5 rows.

The above method will ignore the NaN values from title column. We can also remove all the rows which have NaN values...

df1 = df.dropna()
In [46]:
df1.size
Out[46]:
16632

As we can see above dropna() will remove all the rows where at least one value has Na/NaN value. Number of rows have reduced to 16632.