Everything about outliers in a data set

Sayantan Sadhu
2 min readJun 4, 2021

--

So first, what is an outlier ?

Let’s take a look at a image and then start the article

An outlier is a data point that is a distant value in a random sample of a population . A data-point that lies outside the

overall standard normal distribution of the data is called an outlier. Detecting and handling an outlier is a very important part of exploratory data analysis. Outliers are those values in a data set that lies beyond the third standard deviation of the of the data or lies outside of the range between 25 percentile to 75 percentile.

Why are these so important?

Outliers often tend to change the mean and standard deviation of a data set thus causing problems for statistical analysis. Also they distort the prediction of the model to some extent as well.

Statistics for understanding outlier.

1. Population — The entire data set together is called a population. For example, if there is a data of 10,000 people then that all together is called the population of the data.

2. Random Sample — When a part from the data set is taken out for calculating different sampling properties. For example, from the above population if 100 examples are taken out randomly that will be its Random Sample.

3. Percentile — It is the percentage of data points that are behind the considered data points in a sorted array. for example,A = [ 1, 4, 6, 9, 11, 15, 18, 19, 21, 25, 27 ,29, 200], here the percentile of the number of 19 is ( 7/14 )*100 = 50 %. And the numbers like 200 is the outliers there.

4. Z-score — z-score says in what standard deviation the data point falls in. It has a formula of (x-mean)/standard deviation.

5. Inter quartile range — The range of values between 25 percentile or 75 percentile.


[ 1, 4, 6, 9,11, 15, 18, 19, 21, 200, 25, 27 ,29]
import numpy as np
outliers = []
def z_score(data):
thresold = 3
mean = np.mean(data)
std = np.std(data)
for i in data:
z_score = (i-mean)/std
if (np.abs(z_score) > thresold):
outliers.append(i)
return outliers

Output : 200

Detecting z-score using inter quartile range:

import numpy as np 
Data = [ 1, 4, 6, 9,11,300, 15, 18, 19, 21, 200, 25, 27 ,29]
Data = sorted(Data)
q1,q3 = np.percentile(Data,[25,75])
iqr = q3 - q1
lower_bound = q1 - (1.5*iqr)
upper_bound = q3 + (1.5*iqr)
outlier = []
def iqr(data):
for i in data:
if (i< lower_bound or i> upper_bound):
outlier.append(i)
return outlier

Output : [200,300]

--

--

Sayantan Sadhu

Just another guy exploring datasets in the world of data!!!