Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Thursday, 29 February 2024

Spotting the Exception: Classical Methods for Outlier Detection in Data Science

Outliers are unique in that they often don’t play by the rules. These data points, which significantly differ from the rest, can skew your analyses and make your predictive models less accurate. Although detecting outliers is critical, there is no universally agreed-upon method for doing so. While some advanced techniques like machine learning offer solutions, in this post, we will focus on the foundational Data Science methods that have been in use for decades.

Let’s get started.

Overview

This post is divided into three parts; they are:

Understanding Outliers and Their Impact
Traditional Methods for Outlier Detection
Detecting Outliers in the Ames Dataset

Understanding Outliers and Their Impact

Outliers can emerge for a variety of reasons, from data entry errors to genuine anomalies. Their presence can be attributed to factors like:

Measurement errors
Data processing errors
Genuine extreme observations

Understanding the source of an outlier is crucial for determining whether to keep, modify, or discard it. The impact of outliers on statistical analyses can be profound. They can change the results of data visualizations, central tendency measurements, and other statistical tests. Outliers can also influence the assumptions of normality, linearity, and homoscedasticity in a dataset, leading to unreliable and spurious conclusions.

Traditional Methods for Outlier Detection

In the realm of Data Science, several classical methods exist for detecting outliers. These can be broadly categorized into:

Visual methods: Plots and graphs, such as scatter plots, box plots, and histograms, provide an intuitive feel of the data distribution and any extreme values.
Statistical methods: Techniques like the Z-score, IQR (Interquartile Range), and the modified Z-score are mathematical methods used to define outliers based on data distribution.
Probabilistic and statistical models: These leverage the probability distribution of data, such as the Gaussian distribution, to detect unlikely observations.

It’s essential to understand that the choice of method often depends on the nature of your dataset and the specific problem at hand.

Detecting Outliers in the Ames Dataset

In this section, you’ll dive into the practical application of detecting outliers using the Ames Housing Dataset. Specifically, you’ll explore three features: Lot Area, Sales Price, and Total Rooms Above Ground.

Visual Inspection

Visual methods are a quick and intuitive way to identify outliers. Let’s start with box plots for your chosen features.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Import the necessary libraries & load the dataset
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
 
Ames = pd.read_csv('Ames.csv')
 
# Define feature names in full form for titles and axis
feature_names_full = {
    'LotArea': 'Lot Area (sq ft)',
    'SalePrice': 'Sales Price (US$)',
    'TotRmsAbvGrd': 'Total Rooms Above Ground'
}
 
plt.figure(figsize=(18, 6))
features = ['LotArea', 'SalePrice', 'TotRmsAbvGrd']
 
for i, feature in enumerate(features, 1):
    plt.subplot(1, 3, i)
    sns.boxplot(y=Ames[feature], color="lightblue")
    plt.title(feature_names_full[feature], fontsize=16)
    plt.ylabel(feature_names_full[feature], fontsize=14)
    plt.xlabel('')  # Removing the x-axis label as it's not needed
 
plt.tight_layout()
plt.show()

These plots provide immediate insights into potential outliers in your data. The dots you see beyond the whiskers represent data points that are considered outliers, lying outside 1.5 times the Interquartile Range (IQR) from the first and third quartiles. For instance, you might notice properties with exceptionally large lot areas or homes with a large number of rooms above ground.

Statistical Methods: IQR

The dots in the box plots above are greater than 1.5 times the Interquartile Range (IQR) from the third quartiles. It is a robust method to quantitatively identify outliers. You can precisely find and count these dots from the pandas DataFrame without the box plot:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def detect_outliers_iqr_summary(dataframe, features):
    outliers_summary = {}
 
    for feature in features:
        data = dataframe[feature]
        Q1 = data.quantile(0.25)
        Q3 = data.quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outliers = data[(data < lower_bound) | (data > upper_bound)]
        outliers_summary[feature] = len(outliers)
 
    return outliers_summary
 
outliers_summary = detect_outliers_iqr_summary(Ames, features)
print(outliers_summary)

This prints:

1
{'LotArea': 113, 'SalePrice': 116, 'TotRmsAbvGrd': 35}

In your analysis of the Ames Housing Dataset using the Interquartile Range (IQR) method, you identified 113 outliers in the “Lot Area” feature, 116 outliers in the “Sales Price” feature, and 35 outliers for the “Total Rooms Above Ground” feature. These outliers are visually represented as dots beyond the whiskers in the box plots. The whiskers of the box plots typically extend up to 1.5 times the IQR from the first and third quartiles, and data points beyond these whiskers are considered outliers. This is just one definition of outliers. Such values should be further investigated or treated appropriately in subsequent analyses.

Probabilistic and Statistical Models

The natural distribution of data can sometimes help you identify outliers. One of the most common assumptions about data distribution is that it follows a Gaussian (or normal) distribution. In a perfectly Gaussian distribution, about 68% of the data lies within one standard deviation from the mean, 95% within two standard deviations, and 99.7% within three standard deviations. Data points that fall far away from the mean (typically beyond three standard deviations) can be considered outliers.

This method is particularly effective when the dataset is large and is believed to be normally distributed. Let’s apply this technique to your Ames Housing Dataset and see what you find.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Define a function to detect outliers using the Gaussian model
def detect_outliers_gaussian(dataframe, features, threshold=3):
    outliers_summary = {}
 
    for feature in features:
        data = dataframe[feature]
        mean = data.mean()
        std_dev = data.std()
        outliers = data[(data < mean - threshold * std_dev) | (data > mean + threshold * std_dev)]
        outliers_summary[feature] = len(outliers)
 
        # Visualization
        plt.figure(figsize=(12, 6))
        sns.histplot(data, color="lightblue")
        plt.axvline(mean, color='r', linestyle='-', label=f'Mean: {mean:.2f}')
        plt.axvline(mean - threshold * std_dev, color='y', linestyle='--', label=f'—{threshold} std devs')
        plt.axvline(mean + threshold * std_dev, color='g', linestyle='--', label=f'+{threshold} std devs')
 
        # Annotate upper 3rd std dev value
        annotate_text = f'{mean + threshold * std_dev:.2f}'
        plt.annotate(annotate_text, xy=(mean + threshold * std_dev, 0),
                     xytext=(mean + (threshold + 1.45) * std_dev, 50),
                     arrowprops=dict(facecolor='black', arrowstyle='wedge,tail_width=0.7'),
                     fontsize=12, ha='center')
 
        plt.title(f'Distribution of {feature_names_full[feature]} with Outliers', fontsize=16)
        plt.xlabel(feature_names_full[feature], fontsize=14)
        plt.ylabel('Frequency', fontsize=14)
        plt.legend()
        plt.show()
 
    return outliers_summary
 
outliers_gaussian_summary = detect_outliers_gaussian(Ames, features)
print(outliers_gaussian_summary)

This shows these charts of distribution:

Then it prints the following:

1
{'LotArea': 24, 'SalePrice': 42, 'TotRmsAbvGrd': 35}

Upon applying the Gaussian model for outlier detection, you observed that there are outliers in the “Lot Area,” “Sales Price,” and “Total Rooms Above Ground” features. These outliers are identified based on the upper threshold of three standard deviations from the mean:

Lot Area: Any observation with a lot area larger than 34,505.44 square feet is considered an outlier. You found 24 such outliers in the dataset.
Sales Price: Any observation above US$403,188.39 is considered an outlier. Your analysis revealed 42 outliers in the “Sales Price” feature.
Total Rooms Above Ground: Observations with more than 10.99 rooms above ground are considered outliers. You identified 35 outliers using this criterion.

The number of outliers is different because the definition of outliers is different. These figures differ from your earlier IQR method, emphasizing the importance of utilizing multiple techniques for a more comprehensive understanding. The visualizations accentuate these outliers, allowing for a clear distinction from the main distribution of the data. Such discrepancies underscore the necessity of domain expertise and context when deciding on the best approach for outlier management.

To enhance your understanding and facilitate further analysis, it’s valuable to compile a comprehensive list of identified outliers. This list provides a clear overview of the specific data points that deviate significantly from the norm. In the following section, you’ll illustrate how to systematically organize and list these outliers into a DataFrame for each feature: “Lot Area,” “Sales Price,” and “Total Rooms Above Ground.” This tabulated format allows for easy inspection and potential actions, such as further investigation or targeted data treatment.

Let’s explore the approach that accomplishes this task.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Define a function to tabulate outliers into a DataFrame
def create_outliers_dataframes_gaussian(dataframe, features, threshold=3, num_rows=None):
    outliers_dataframes = {}
 
    for feature in features:
        data = dataframe[feature]
        mean = data.mean()
        std_dev = data.std()
        outliers = data[(data < mean - threshold * std_dev) | (data > mean + threshold * std_dev)]
 
        # Create a new DataFrame for outliers of the current feature
        outliers_df = dataframe.loc[outliers.index, [feature]].copy()
        outliers_df.rename(columns={feature: 'Outlier Value'}, inplace=True)
        outliers_df['Feature'] = feature
        outliers_df.reset_index(inplace=True)
 
        # Display specified number of rows (default: full dataframe)
        outliers_df = outliers_df.head(num_rows) if num_rows is not None else outliers_df
 
        outliers_dataframes[feature] = outliers_df
 
    return outliers_dataframes
 
# Example usage with user-defined number of rows = 7
outliers_gaussian_dataframes = create_outliers_dataframes_gaussian(Ames, features, num_rows=7)
 
# Print each DataFrame with the original format and capitalized 'index'
for feature, df in outliers_gaussian_dataframes.items():
    df_reset = df.reset_index().rename(columns={'index': 'Index'})
    print(f"Outliers for {feature}:\n", df_reset[['Index', 'Feature', 'Outlier Value']])
    print()

Now, before you unveil the results, it’s essential to note that the code snippet allows for user customization. By adjusting the parameter num_rows, you have the flexibility to define the number of rows you want to see in each DataFrame. In the example shared earlier, you used num_rows=7 for a concise display, but the default setting is num_rows=None, which prints the entire DataFrame. Feel free to tailor this parameter to suit your preferences and the specific requirements of your analysis.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Outliers for LotArea:
    Index  Feature  Outlier Value
0    104  LotArea          53107
1    195  LotArea          53227
2    249  LotArea         159000
3    309  LotArea          40094
4    329  LotArea          45600
5    347  LotArea          50271
6    355  LotArea         215245
 
Outliers for SalePrice:
    Index    Feature  Outlier Value
0     29  SalePrice         450000
1     65  SalePrice         615000
2    103  SalePrice         468000
3    108  SalePrice         500067
4    124  SalePrice         475000
5    173  SalePrice         423000
6    214  SalePrice         500000
 
Outliers for TotRmsAbvGrd:
    Index       Feature  Outlier Value
0     50  TotRmsAbvGrd             12
1    165  TotRmsAbvGrd             11
2    244  TotRmsAbvGrd             11
3    309  TotRmsAbvGrd             11
4    407  TotRmsAbvGrd             11
5    424  TotRmsAbvGrd             13
6    524  TotRmsAbvGrd             11

In this exploration of probabilistic and statistical models for outlier detection, you focused on the Gaussian model applied to the Ames Housing Dataset, specifically utilizing a threshold of three standard deviations. By leveraging the insights provided by visualizations and statistical methods, you identified outliers and demonstrated their listing in a customizable DataFrame.

Summary

Outliers, stemming from diverse causes, significantly impact statistical analyses. Recognizing their origins is crucial as they can distort visualizations, central tendency measures, and statistical tests. Classical Data Science methods for outlier detection encompass visual, statistical, and probabilistic approaches, with the choice dependent on dataset nature and specific problems.

Application of these methods on the Ames Housing Dataset, focusing on Lot Area, Sales Price, and Total Rooms Above Ground, revealed insights. Visual methods like box plots provided quick outlier identification. The Interquartile Range (IQR) method quantified outliers, revealing 113, 116, and 35 outliers for Lot Area, Sales Price, and Total Rooms Above Ground. Probabilistic models, particularly the Gaussian model with three standard deviations, found 24, 42, and 35 outliers in the respective features.

These results underscore the need for a multifaceted approach to outlier detection. Beyond identification, systematically organizing and listing outliers in tabulated DataFrames facilitates in-depth inspection. Customizability, demonstrated by the num_rows parameter, ensures flexibility in presenting tailored results. In conclusion, this exploration enhances understanding and provides practical guidance for managing outliers in real-world datasets.

Specifically, you learned:

The significance of outliers and their potential impact on data analyses.
Various traditional methods are used in Data Science for outlier detection.
How to apply these methods in a real-world dataset, using the Ames Housing Dataset as an example.
Systematic organization and listing of identified outliers into customizable DataFrames for detailed inspection and further analysis.

Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.

Artificial Intelligence , Machine Learning and Data Science Hubspot

Thursday, 29 February 2024

Spotting the Exception: Classical Methods for Outlier Detection in Data Science

Overview

Understanding Outliers and Their Impact

Traditional Methods for Outlier Detection

Detecting Outliers in the Ames Dataset

Visual Inspection

Statistical Methods: IQR

Probabilistic and Statistical Models

Further Reading

Resources

Summary

No comments:

Post a Comment

Report Abuse

Labels

"Donate for a Noble Cause