Replace outliers with mean python

Question

I have a python data-frame in which there are some outlier values. I would like to replace them with the median values of the data, had those values not been there.

Nội dung chính Show

Introduction
Outlier Identification
Identifying Outliers with Interquartile Range (IQR)
Identifying Outliers with Skewness
Identifying Outliers with Visualization
Outlier Treatment
Quantile-based Flooring and Capping
Log Transformation
Replacing Outliers with Median Values
Can you replace outliers with mean?
How do you replace outliers in Python?
How do you replace outliers in multiple columns in Python?
How do you replace outliers using IQR in Python?

id         Age
10236    766105
11993       288
9337        205
38189        88
35555        82
39443        75
10762        74
33847        72
21194        70
39450        70

So, I want to replace all the values > 75 with the median value of the dataset of the remaining dataset, i.e., the median value of 70,70,72,74,75.

I'm trying to do the following:

Replace with 0, all the values that are greater than 75
Replace the 0s with median value.

But somehow, the below code not working

df['age'].replace(df.age>75,0,inplace=True)

Introduction

The difference between a good and an average machine learning model is often its ability to clean data. One of the biggest challenges in data cleaning is the identification and treatment of outliers. In simple terms, outliers are observations that are significantly different from other data points. Even the best machine learning algorithms will underperform if outliers are not cleaned from the data because outliers can adversely affect the training process of a machine learning algorithm, resulting in a loss of accuracy.

In this guide, you will learn about techniques for outlier identification and treatment in Python.

Data

In this guide, we will be using a fictitious dataset of loan applications containing 600 observations and 6 variables:

1. Income - Annual income of the applicant (in US dollars)

2. Loan_amount - Loan amount (in US dollars) for which the application was submitted

3. Term_months - Tenure of the loan (in months)

4. Credit_score - Whether the applicant's credit score was good ("1") or not ("0")

5. Age - The applicant’s age in years

6. Approval_status - Whether the loan application was approved ("1") or not ("0")

Let's start by loading the required libraries and the data.

1# Import required libraries
2import pandas as pd
3import numpy as np 
4import matplotlib.pyplot as plt
5
6# Reading the data
7df = pd.read_csv("data_out.csv")
8print(df.shape)
9print(df.info())

python

1    (600, 6)
2    
3    RangeIndex: 600 entries, 0 to 599
4    Data columns (total 6 columns):
5    Income             600 non-null int64
6    Loan_amount        600 non-null int64
7    Term_months        600 non-null int64
8    Credit_score       600 non-null int64
9    approval_status    600 non-null int64
10    Age                600 non-null int64
11    dtypes: int64(6)
12    memory usage: 28.2 KB
13    None

The above output shows that there are 600 observations of 6 variables. All the variables have 600 records, indicating that there is no missing value in the data.

Outlier Identification

There can be many reasons for the presence of outliers in the data. Sometimes the outliers may be genuine, while in other cases, they could exist because of data entry errors. It is important to understand the reasons for the outliers before cleaning them.

We will start the process of finding outliers by running the summary statistics on the variables. This is done using the describe() function below, which provides a statistical summary of all the quantitative variables.

1|       	| Income        	| Loan_amount 	| Term_months 	| Credit_score 	| approval_status 	| Age        	|
2|-------	|---------------	|-------------	|-------------	|--------------	|-----------------	|------------	|
3| count 	| 600.000000    	| 600.000000  	| 600.00000   	| 600.000000   	| 600.000000      	| 600.000000 	|
4| mean  	| 7210.720000   	| 161.571667  	| 367.10000   	| 0.788333     	| 0.686667        	| 51.766667  	|
5| std   	| 8224.445086   	| 93.467598   	| 63.40892    	| 0.408831     	| 0.464236        	| 21.240704  	|
6| min   	| 200.000000    	| 10.000000   	| 36.00000    	| 0.000000     	| 0.000000        	| 0.000000   	|
7| 25%   	| 3832.500000   	| 111.000000  	| 384.00000   	| 1.000000     	| 0.000000        	| 36.000000  	|
8| 50%   	| 5075.000000   	| 140.000000  	| 384.00000   	| 1.000000     	| 1.000000        	| 51.000000  	|
9| 75%   	| 7641.500000   	| 180.500000  	| 384.00000   	| 1.000000     	| 1.000000        	| 64.000000  	|
10| max   	| 108000.000000 	| 778.000000  	| 504.00000   	| 1.000000     	| 1.000000        	| 200.000000 	|

Looking at the 'Age' variable, it is easy to detect outliers resulting from incorrect data. The minimum and maximum ages are 0, and 200, respectively. These are incorrect, and we will treat them later in the guide. These outliers were easy to detect, butthat will not always be the case. In other cases, mathematical and visualization techniques must be used. These techniques are discussed in the following sections.

Identifying Outliers with Interquartile Range (IQR)

The interquartile range (IQR) is a measure of statistical dispersion and is calculated as the difference between the 75th and 25th percentiles. It is represented by the formula IQR = Q3 − Q1. The lines of code below calculate and print the interquartile range for each of the variables in the dataset.

1Q1 = df.quantile(0.25)
2Q3 = df.quantile(0.75)
3IQR = Q3 - Q1
4print(IQR)

python

1    Income             3809.0
2    Loan_amount          69.5
3    Term_months           0.0
4    Credit_score          0.0
5    approval_status       1.0
6    Age                  28.0
7    dtype: float64

The above output prints the IQR scores, which can be used to detect outliers. The code below generates an output with the 'True' and 'False' values. Points where the values are 'True' represent the presence of the outlier.

1print(df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))

python

1 2    0     False 3    1     False 4    2     False 5    3     False 6    4     False 7    5     False 8    6     False 9    7     False 10    8     False 11    9     False 12    10    False 13    11    False 14    12    False 15    13    False 16    14    False 17    15    False 18    16    False 19    17    False 20    18    False 21    19    False 22    20    False 23    21    False 24    22    False 25    23    False 26    24    False 27    25    False 28    26    False 29    27    False 30    28    False 31    29    False 32    ..      ... 33    570   False 34    571   False 35    572   False 36    573   False 37    574   False 38    575   False 39    576   False 40    577   False 41    578   False 42    579   False 43    580   False 44    581   False 45    582   False 46    583   False 47    584   False 48    585   False 49    586   False 50    587   False 51    588   False 52    589   False 53    590   False 54    591   False 55    592   False 56    593   False 57    594   False 58    595   False 59    596   False 60    597   False 61    598   False 62    599   False 63 64    [600

Income  Loan_amount  Term_months  Credit_score  approval_status    Age False        False         False            False  False False         True         False            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False         True          True            False  False False        False         False            False  False False        False         False            False  False False        False          True            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False        False          True            False  False False        False         False            False  False False        False         False            False  False False        False          True            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False         True          True            False  False False        False         False            False  False False        False          True            False  False ...          ...           ...              ...    ... False        False         False            False  False False        False         False            False  False False        False         False            False  False False        False          True            False  False False        False         False            False  False False        False          True            False  False False        False         False            False  False False        False          True            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False        False          True            False  False False        False          True            False  False False        False         False            False  False False         True         False            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False        False          True            False  False False        False         False            False  False False        False         False            False  False False        False          True            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False        False         False            False  False False        False          True            False  False False        False         False            False  False rows x 6 columns]

Identifying Outliers with Skewness

Several machine learning algorithms make the assumption that the data follow a normal (or Gaussian) distribution. This is easy to check with the skewness value, which explains the extent to which the data is normally distributed. Ideally, the skewness value should be between -1 and +1, and any major deviation from this range indicates the presence of extreme values.

The first line of code below prints the skewness value for the 'Income' variable, while the second line prints the summary statistics.

1print(df['Income'].skew())
2df['Income'].describe()

python

1    6.499
2
3    count       600.000000
4    mean       7210.720000
5    std        8224.445086
6    min         200.000000
7    25%        3832.500000
8    50%        5075.000000
9    75%        7641.500000
10    max      108000.000000
11    Name: Income, dtype: float64

The skewness value of 6.5 shows that the variable 'Income' has a right-skewed distribution, indicating the presence of extreme higher values. The maximum 'Income' value of USD 108,000 proves this point.

Identifying Outliers with Visualization

In the previous section, we used quantitative methods for outlier identification. This can also be achieved with visualization. Some of the common plots used for outlier detection are discussed below.

1. Box Plot

The box plot is a standardized way of displaying the distribution of data based on the five-number summary (minimum, first quartile (Q1), median, third quartile (Q3), and maximum). It is often used to identify data distribution and detect outliers. The line of code below plots the box plot of the numeric variable 'Loan_amount'.

1plt.boxplot(df["Loan_amount"])
2plt.show()

python

In the above output, the circles indicate the outliers, and there are many. It is also possible to identify outliers using more than one variable. We can modify the above code to visualize outliers in the 'Loan_amount' variable by the approval status.

1df.boxplot(column='Loan_amount', by='approval_status')

python

The output shows that the number of outliers is higher for approved loan applicants (denoted by the label '1') than for rejected applicants (denoted by the label '0').

2. Histogram

A histogram is used to visualize the distribution of a numerical variable. An outlier will appear outside the overall pattern of distribution. The line of code below plots a histogram of the 'Income' variable, using the hist() function.

The above chart shows that the distribution is right-skewed, and there are extreme higher values at the right of the histogram. This step can be repeated for other variables as well.

3. Scatterplot

A scatterplot visualizes the relationship between two quantitative variables. The data are displayed as a collection of points, and any points that fall outside the general clustering of the two variables may indicate outliers. The lines of code below generate a scatterplot between the variables 'Income' and 'Loan_amount'.

1fig, ax = plt.subplots(figsize=(12,6))
2ax.scatter(df['Income'], df['Loan_amount'])
3ax.set_xlabel('Income of applicants in USD')
4ax.set_ylabel('Loan amount applied for in USD')
5plt.show()

python

The above chart indicates that most of the data points are clustered in the lower half of the plot. The points located to the extreme right of the x-axis or the y-axis indicate outliers.

Outlier Treatment

In the previous sections, we learned about techniques for outlier detection. However, this is only half of the task. Once we have identified the outliers, we need to treat them. There are several techniques for this, and we will discuss the most widely used ones below.

Quantile-based Flooring and Capping

In this technique, we will do the flooring (e.g., the 10th percentile) for the lower values and capping (e.g., the 90th percentile) for the higher values. The lines of code below print the 10th and 90th percentiles of the variable 'Income', respectively. These values will be used for quantile-based flooring and capping.

1print(df['Income'].quantile(0.10))
2print(df['Income'].quantile(0.90))

python

Now we will remove the outliers, as shown in the lines of code below. Finally, we calculate the skewness value again, which comes out much better now.

1df["Income"] = np.where(df["Income"] <2960.0, 2960.0,df['Income'])
2df["Income"] = np.where(df["Income"] >12681.0, 12681.0,df['Income'])
3print(df['Income'].skew())

python

Trimming

In this method, we completely remove data points that are outliers. Consider the 'Age' variable, which had a minimum value of 0 and a maximum value of 200. The first line of code below creates an index for all the data points where the age takes these two values. The second line drops these index rows from the data, while the third line of code prints summary statistics for the variable.

After trimming, the number of observations is reduced from 600 to 594, and the minimum and maximum values are much more acceptable.

1index = df[(df['Age'] >= 100)|(df['Age'] <= 18)].index
2df.drop(index, inplace=True)
3df['Age'].describe()

python

1    count    594.000000
2    mean      50.606061
3    std       16.266324
4    min       22.000000
5    25%       36.000000
6    50%       50.500000
7    75%       64.000000
8    max       80.000000
9    Name: Age, dtype: float64

IQR Score

This technique uses the IQR scores calculated earlier to remove outliers. The rule of thumb is that anything not in the range of (Q1 - 1.5 IQR) and (Q3 + 1.5 IQR) is an outlier, and can be removed. The first line of code below removes outliers based on the IQR range and stores the result in the data frame 'df_out'. The second line prints the shape of this data, which comes out to be 375 observations of 6 variables. This shows that for our data, a lot of records get deleted if we use the IQR method.

1df_out = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
2print(df_out.shape)

python

Log Transformation

Transformation of the skewed variables may also help correct the distribution of the variables. These could be logarithmic, square root, or square transformations. The most common is the logarithmic transformation, which is done on the 'Loan_amount' variable in the first line of code below. The second and third lines of code print the skewness value before and after the transformation.

1df["Log_Loanamt"] = df["Loan_amount"].map(lambda i: np.log(i) if i > 0 else 0) 
2print(df['Loan_amount'].skew())
3print(df['Log_Loanamt'].skew())

python

1    2.8146019248106815
2    -0.17792641310111373

The above output shows that the skewness value came down from 2.8 to -0.18, confirming that the distribution has been treated for extreme values.

Replacing Outliers with Median Values

In this technique, we replace the extreme values with median values. It is advised to not use mean values as they are affected by outliers. The first line of code below prints the 50th percentile value, or the median, which comes out to be 140. The second line prints the 95th percentile value, which comes out to be around 326. The third line of code below replaces all those values in the 'Loan_amount' variable, which are greater than the 95th percentile, with the median value. Finally, the fourth line prints summary statistics after all these techniques have been employed for outlier treatment.

1print(df['Loan_amount'].quantile(0.50)) 
2print(df['Loan_amount'].quantile(0.95)) 
3df['Loan_amount'] = np.where(df['Loan_amount'] > 325, 140, df['Loan_amount'])
4df.describe()

python

1    140.0
2    325.7500000000001
3
4
5|       	| Income       	| Loan_amount 	| Term_months 	| Credit_score 	| approval_status 	| Age        	| Log_Loanamt 	|
6|-------	|--------------	|-------------	|-------------	|--------------	|-----------------	|------------	|-------------	|
7| count 	| 594.000000   	| 594.000000  	| 594.000000  	| 594.000000   	| 594.000000      	| 594.000000 	| 594.000000  	|
8| mean  	| 6112.375421  	| 144.289562  	| 366.929293  	| 0.787879     	| 0.688552        	| 50.606061  	| 4.957050    	|
9| std   	| 3044.257269  	| 53.033735   	| 63.705994   	| 0.409155     	| 0.463476        	| 16.266324  	| 0.494153    	|
10| min   	| 2960.000000  	| 10.000000   	| 36.000000   	| 0.000000     	| 0.000000        	| 22.000000  	| 2.302585    	|
11| 25%   	| 3831.500000  	| 111.000000  	| 384.000000  	| 1.000000     	| 0.000000        	| 36.000000  	| 4.709530    	|
12| 50%   	| 5050.000000  	| 140.000000  	| 384.000000  	| 1.000000     	| 1.000000        	| 50.500000  	| 4.941642    	|
13| 75%   	| 7629.000000  	| 171.000000  	| 384.000000  	| 1.000000     	| 1.000000        	| 64.000000  	| 5.192957    	|
14| max   	| 12681.000000 	| 324.000000  	| 504.000000  	| 1.000000     	| 1.000000        	| 80.000000  	| 6.656727    	|

Conclusion

In this guide, you have learned methods of identifying outliersusing bothquantitative and visualization techniques. You have also learnedtechniques for treating the identified outliers. Your usage of these techniques will depend on the data, the problem statement, and the machine learning algorithm selected for building the model.

To learn more about data preparation and building machine learning models using Python, please refer to the following guides:

To learn more about building deep learning models using Keras, please refer to the following guides:

Can you replace outliers with mean?

in this technique, we replace the extreme values with the mode value, you can use median or mean value but it is advised not to use the mean values because it is highly susceptible to outliers.

How do you replace outliers in Python?

Using IQR, we can follow the below approach to replace the outliers with a NULL value:.

Calculate the first and third quartile (Q1 and Q3)..

Further, evaluate the interquartile range, IQR = Q3-Q1..

Estimate the lower bound, the lower bound = Q1*1.5..

Estimate the upper bound, upper bound = Q3*1.5..

How do you replace outliers in multiple columns in Python?

“pandas remove outliers for multiple columns” Code Answer.

def cap_data(df):.

for col in df. columns:.

print("capping the ",col).

if (((df[col]. dtype)=='float64') | ((df[col]. ... .

percentiles = df[col]. ... .

df[col][df[col] <= percentiles[0]] = percentiles[0].

df[col][df[col] >= percentiles[1]] = percentiles[1].

How do you replace outliers using IQR in Python?

Outlier detection using IQR method and Box plot in Python.

# method 1. Q1 = np.percentile(grades , 25) Q3 = np. ... .

IQR = Q3 - Q1. ul = Q3+1.5*IQR. ll = Q1-1.5*IQR. ... .

outliers = grades[(grades > ul) | (grades < ll)] outliers. ... .

fig = plt.figure(figsize=(6,5)) hypo = np.random.randint(20, 81, size=500).

programming python Remove outliers MATLAB Remove outlier Python Detect outliers Filter outlier

Replace outliers with mean python

Introduction

Data

Outlier Identification

Identifying Outliers with Interquartile Range (IQR)

Identifying Outliers with Skewness

Identifying Outliers with Visualization

1. Box Plot

2. Histogram

3. Scatterplot

Outlier Treatment

Quantile-based Flooring and Capping

Trimming

IQR Score

Log Transformation

Replacing Outliers with Median Values

Conclusion

Can you replace outliers with mean?

How do you replace outliers in Python?

How do you replace outliers in multiple columns in Python?

How do you replace outliers using IQR in Python?

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội