programming python

Hướng dẫn dùng data cleanse python

For an updated version of this guide, please visit Data Cleaning Techniques in Python: the Ultimate Guide.

Before fitting a machine learning or statistical model, we always have to clean the data. No models create meaningful results with messy data.

Data cleaning or cleansing is the process of detecting and correcting [or removing] corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

What a long definition! It is certainly not fun and very time-consuming.

Source: kdnuggets.com

To make it easier, we created this new complete step-by-step guide in Python. You’ll learn techniques on how to find and clean:

Missing Data
Irregular Data [Outliers]
Unnecessary Data — Repetitive Data, Duplicates and more
Inconsistent Data — Capitalization, Addresses and more

Further learning: we’ve launched a course Python for Data Analysis: step-by-step with projects, which includes an updated whole section for data cleaning, and more!
Please check it out here.

Within this guide, we use the Russian housing dataset from Kaggle. The goal of this project is to predict housing price fluctuations in Russia. We are not cleaning the entire dataset but will show examples from it.

Before we jump into the cleaning process, let’s take a brief look at the data.

From these results, we learn that the dataset has 30,471 rows and 292 columns. We also identify whether the features are numeric or categorical variables. These are all useful information.

Now we can run through the checklist of “dirty” data types and fix them one by one.

Let’s get started.

Source: GIPHY

Missing data

Dealing with missing data/value is one of the most tricky but common parts of data cleaning. While many models can live with other problems of the data, most models don’t accept missing data.

How to find out?

We cover three techniques to learn more about missing data in our dataset.

Technique #1: Missing Data Heatmap

When there is a smaller number of features, we can visualize the missing data via heatmap.

The chart below demonstrates the missing data patterns of the first 30 features. The horizontal axis shows the feature name; the vertical axis shows the number of observations/rows; the yellow color represents the missing data while the blue color otherwise.

For example, we see that the life_sq feature has missing values throughout many rows. While the floor feature only has little missing values around the 7000th row.

Missing Data Heatmap

Technique #2: Missing Data Percentage List

When there are many features in the dataset, we can make a list of missing data % for each feature.

This produces a list below showing the percentage of missing values for each of the features.

Specifically, we see that the life_sq feature has 21% missing, while floor has only 1% missing. This list is a useful summary that can complement the heatmap visualization.

Missing Data % List — the first 30 features

Technique #3: Missing Data Histogram

Missing data histogram is also a technique for when we have many features.

To learn more about the missing value patterns among observations, we can visualize it by a histogram.

This histogram helps to identify the missing values situations among the 30,471 observations.

For example, there are over 6000 observations with no missing values and close to 4000 observations with one missing value.

Missing Data Histogram

What to do?

There are NO agreed-upon solutions to dealing with missing data. We have to study the specific feature and dataset to decide the best way of handling them.

Below covers the four most common methods of handling missing data. But, if the situation is more complicated than usual, we need to be creative to use more sophisticated methods such as missing data modeling.

Solution #1: Drop the Observation

In statistics, this method is called the listwise deletion technique. In this solution, we drop the entire observation as long as it contains a missing value.

Only if we are sure that the missing data is not informative, we perform this. Otherwise, we should consider other solutions.

There could be other criteria to use to drop the observations.

For example, from the missing data histogram, we notice that only a minimal amount of observations have over 35 features missing altogether. We may create a new dataset df_less_missing_rows deleting observations with over 35 missing features.

Solution #2: Drop the Feature

Similar to Solution #1, we only do this when we are confident that this feature doesn’t provide useful information.

For example, from the missing data % list, we notice that hospital_beds_raion has a high missing value percentage of 47%. We may drop the entire feature.

Solution #3: Impute the Missing

When the feature is a numeric variable, we can conduct missing data imputation. We replace the missing values with the average or median value from the data of the same feature that is not missing.

When the feature is a categorical variable, we may impute the missing data by the mode [the most frequent value].

Using life_sq as an example, we can replace the missing values of this feature by its median.

Moreover, we can apply the same imputation strategy for all the numeric features at once.

Luckily, our dataset has no missing value for categorical features. Yet, we can apply the mode imputation strategy for all the categorical features at once.

Solution #4: Replace the Missing

For categorical features, we can add a new category with a value such as “_MISSING_”. For numerical features, we can replace it with a particular value such as -999.

This way, we are still keeping the missing values as valuable information.

Irregular data [Outliers]

Outliers are data that is distinctively different from other observations. They could be real outliers or mistakes.

How to find out?

Depending on whether the feature is numeric or categorical, we can use different techniques to study its distribution to detect outliers.

Technique #1: Histogram/Box Plot

When the feature is numeric, we can use a histogram and box plot to detect outliers.

Below is the histogram of feature life_sq.

The data looks highly skewed with the possible existence of outliers.

Histogram

To study the feature closer, let’s make a box plot.

In this plot, we can see there is an outlier at a value of over 7000.

Box Plot

Technique #2: Descriptive Statistics

Also, for numeric features, the outliers could be too distinct that the box plot can’t visualize them. Instead, we can look at their descriptive statistics.

For example, for the feature life_sq again, we can see that the maximum value is 7478, while the 75% quartile is only 43. The 7478 value is an outlier.

Technique #3: Bar Chart

When the feature is categorical. We can use a bar chart to learn about its categories and distribution.

For example, the feature ecology has a reasonable distribution. But if there is a category with only one value called “other”, then that would be an outlier.

Bar Chart

Other Techniques: Many other techniques can spot outliers as well, such as scatter plot, z-score, and clustering. This article does not cover all of those.

What to do?

While outliers are not hard to detect, we have to determine the right solutions to handle them. It highly depends on the dataset and the goal of the project.

The methods of handling outliers are somewhat similar to missing data. We either drop or adjust or keep them. We can refer back to the missing data section for possible solutions.

Unnecessary data

After all the hard work done for missing data and outliers, let’s look at unnecessary data, which is more straightforward.

All the data feeding into the model should serve the purpose of the project. The unnecessary data is when the data doesn’t add value. We cover three main types of unnecessary data due to different reasons.

Unnecessary type #1: Uninformative / Repetitive

Sometimes one feature is uninformative because it has too many rows being the same value.

How to find out?

We can create a list of features with a high percentage of the same value.

For example, we specify below to show features with over 95% rows being the same value.

We can look into these variables one by one to see whether they are informative or not. We won’t show the details here.

What to do?

We need to understand the reasons behind the repetitive feature. When they are genuinely uninformative, we can toss them out.

Unnecessary type #2: Irrelevant

Again, the data needs to provide valuable information for the project. If the features are not related to the question we are trying to solve in the project, they are irrelevant.

How to find out?

We need to skim through the features to identify irrelevant ones.

For example, a feature recording the temperature in Toronto doesn’t provide any useful insights to predict Russian housing prices.

What to do?

When the features are not serving the project’s goal, we can remove them.

Unnecessary type #3: Duplicates

The duplicate data is when copies of the same observation exist.

There are two main types of duplicate data.

Duplicates type #1: All Features based

How to find out?

This duplicate happens when all the features’ values within the observations are the same. It is easy to find.

We first remove the unique identifier id in the dataset. Then we create a dataset called df_dedupped by dropping the duplicates. We compare the shapes of the two datasets [df and df_dedupped] to find out the number of duplicated rows.

10 rows are being complete duplicate observations.

What to do?

We should remove these duplicates, which we already did.

Duplicates type #2: Key Features based

How to find out?

Sometimes it is better to remove duplicate data based on a set of unique identifiers.

For example, the chances of two transactions happening at the same time, with the same square footage, the same price, and the same build year are close to zero.

We can set up a group of critical features as unique identifiers for transactions. We include timestamp, full_sq, life_sq, floor, build_year, num_room, price_doc. We check if there are duplicates based on them.

There are 16 duplicates based on this set of key features.

What to do?

We can drop these duplicates based on the key features.

We dropped the 16 duplicates within the new dataset named df_dedupped2.

Inconsistent data

It is also crucial to have the dataset follow specific standards to fit a model. We need to explore the data in different ways to find out the inconsistent data. Much of the time, it depends on observations and experience. There is no set code to run and fix them all.

Below we cover four inconsistent data types.

Inconsistent type #1: Capitalization

Inconsistent usage of upper and lower cases in categorical values is a common mistake. It could cause issues since analyses in Python is case sensitive.

How to find out?

Let’s look at the sub_area feature.

It stores the name of different areas and looks very standardized.

But sometimes there is inconsistent capitalization usage within the same feature. The “Poselenie Sosenskoe” and “pOseleNie sosenskeo” could refer to the same area.

What to do?

To avoid this, we can put all letters to lower cases [or upper cases].

Inconsistent type #2: Formats

Another standardization we need to perform is the data formats. One example is to convert the feature from string to DateTime format.

How to find out?

The feature timestamp is in string format while it represents dates.

What to do?

We can convert it and extract the date or time values by using the code below. After this, it’s easier to analyze the transaction volume group by either year or month.

Related article: How To Manipulate Date And Time In Python Like A Boss

Inconsistent type #3: Categorical Values

Inconsistent categorical values are the last inconsistent type we cover. A categorical feature has a limited number of values. Sometimes there may be other values due to reasons such as typos.

How to find out?

We need to observe the feature to find out this inconsistency. Let’s show this with an example.

We create a new dataset below since we don’t have such a problem in the real estate dataset. For instance, the value of city was typed by mistakes as “torontoo” and “tronto”. But they both refer to the correct value “toronto”.

A simple way to identify them is fuzzy logic [or edit distance]. It measures how many letters [distance] we need to change the spelling of one value to match with another value.

We know that the categories should only have four values of “toronto”, “vancouver”, “montreal”, and “calgary”. We calculate the distance between all the values and the word “toronto” [and “vancouver”]. We can see that the ones likely to be typos have a smaller distance with the correct word. Since they only differ by a couple of letters.

What to do?

We can set criteria to convert these typos to the correct values. For example, the below code sets all the values within 2 letters distance from “toronto” to be “toronto”.

Inconsistent type #4: Addresses

The address feature could be a headache for many of us. Because people entering the data into the database often don’t follow a standard format.

How to find out?

We can find messy address data by looking at it. Even though sometimes we can’t spot any issues, we can still run code to standardize them.

There is no address column in our dataset for privacy reasons. So we create a new dataset df_add_ex with feature address.

As we can see, the address feature is quite messy.

What to do?

We run the below code to lowercase the letters, remove white space, delete periods and standardize wordings.

It looks much better now.

We did it! What a long journey we have come along.

Further learning: if you enjoyed this article, you might like our new course Python for Data Analysis: step-by-step with projects, which includes a more updated whole section for data cleaning, and more!
Please check it out here.

Clear all the “dirty” data that’s blocking your way to fit the model.

Be the boss of cleaning!

Source: GIPHY

Thank you for reading.

I hope you found this data cleaning guide helpful. Please leave any comments to let us know your thoughts.

Unnecessary type #1: Uninformative / Repetitive

Unnecessary type #2: Irrelevant

Unnecessary type #3: Duplicates

Inconsistent type #1: Capitalization

Inconsistent type #2: Formats

Inconsistent type #3: Categorical Values

Inconsistent type #4: Addresses

Bài Viết Liên Quan

Toplist mới

Bài mới nhất

Chủ Đề