Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Tuesday, 3 March 2026

The Strategic Use of Sequential Feature Selector for Housing Price Predictions

To understand housing prices better, simplicity and clarity in our models are key. Our aim with this post is to demonstrate how straightforward yet powerful techniques in feature selection and engineering can lead to creating an effective, simple linear regression model. Working with the Ames dataset, we use a Sequential Feature Selector (SFS) to identify the most impactful numeric features and then enhance our model’s accuracy through thoughtful feature engineering.

Kick-start your project with my book Next-Level Data Science. It provides self-study tutorials with working code.

Let’s get started.

Overview

This post is divided into three parts; they are:

Identifying the Most Predictive Numeric Feature
Evaluating Individual Features’ Predictive Power
Enhancing Predictive Accuracy with Feature Engineering

Identifying the Most Predictive Numeric Feature

In the initial segment of our exploration, we embark on a mission to identify the most predictive numeric feature within the Ames dataset. This is achieved by applying Sequential Feature Selector (SFS), a tool designed to sift through features and select the one that maximizes our model’s predictive accuracy. The process is straightforward, focusing solely on numeric columns and excluding any with missing values to ensure a clean and robust analysis:

# Load only the numeric columns from the Ames dataset
import pandas as pd
Ames = pd.read_csv('Ames.csv').select_dtypes(include=['int64', 'float64'])
 
# Drop any columns with missing values
Ames = Ames.dropna(axis=1)
 
# Import Linear Regression and Sequential Feature Selector from scikit-learn
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector
 
# Initializing the Linear Regression model
model = LinearRegression()
 
# Perform Sequential Feature Selector
sfs = SequentialFeatureSelector(model, n_features_to_select=1)
X = Ames.drop('SalePrice', axis=1)  # Features
y = Ames['SalePrice']  # Target variable
sfs.fit(X,y)           # Uses a default of cv=5
selected_feature = X.columns[sfs.get_support()]
print("Feature selected for highest predictability:", selected_feature[0])

This will output:

Feature selected for highest predictability: OverallQual

This result notably challenges the initial presumption that the area might be the most predictive feature for housing prices. Instead, it underscores the significance of overall quality, suggesting that, contrary to initial expectations, quality is the paramount consideration for buyers. It is important to note that the Sequential Feature Selector utilizes cross-validation with a default of five folds (cv=5) to evaluate the performance of each feature subset. This approach ensures that the selected feature—reflected by the highest mean cross-validation R² score—is robust and likely to generalize well on unseen data.

Evaluating Individual Features’ Predictive Power

Building upon our initial findings, we delve deeper to rank features by their predictive capabilities. Employing cross-validation, we evaluate each feature independently, calculating their mean R² scores from cross-validation to ascertain their individual contributions to the model’s accuracy.

# Building on the earlier block of code:
from sklearn.model_selection import cross_val_score
 
# Dictionary to hold feature names and their corresponding mean CV R² scores
feature_scores = {}
 
# Iterate over each feature, perform CV, and store the mean R² score
for feature in X.columns:
    X_single = X[[feature]]
    cv_scores = cross_val_score(model, X_single, y, cv=5)
    feature_scores[feature] = cv_scores.mean()
 
# Sort features based on their mean CV R² scores in descending order
sorted_features = sorted(feature_scores.items(), key=lambda item: item[1], reverse=True)
 
# Print the top 3 features and their scores
top_3 = sorted_features[0:3]
for feature, score in top_3:
    print(f"Feature: {feature}, Mean CV R²: {score:.4f}")

This will output:

Feature: OverallQual, Mean CV R²: 0.6183
Feature: GrLivArea, Mean CV R²: 0.5127
Feature: 1stFlrSF, Mean CV R²: 0.3957

These findings underline the key role of overall quality (“OverallQual”), as well as the importance of living area (“GrLivArea”) and first-floor space (“1stFlrSF”) in the context of housing price predictions.

Enhancing Predictive Accuracy with Feature Engineering

In the final stride of our journey, we employ feature engineering to create a novel feature, “Quality Weighted Area,” by multiplying ‘OverallQual’ by ‘GrLivArea’. This fusion aims to synthesize a more powerful predictor, encapsulating both the quality and size dimensions of a property.

# Building on the earlier blocks of code:
Ames['QualityArea'] = Ames['OverallQual'] * Ames['GrLivArea']
 
# Setting up the feature and target variable for the new 'QualityArea' feature
X = Ames[['QualityArea']]  # New feature
y = Ames['SalePrice']
 
# 5-Fold CV on Linear Regression
model = LinearRegression()
cv_scores = cross_val_score(model, X, y, cv=5)
 
# Calculating the mean of the CV scores
mean_cv_score = cv_scores.mean()
print(f"Mean CV R² score using 'Quality Weighted Area': {mean_cv_score:.4f}")

This will output:

Mean CV R² score using 'Quality Weighted Area': 0.7484

This remarkable increase in R² score vividly demonstrates the efficacy of combining features to capture more nuanced aspects of data, providing a compelling case for the thoughtful application of feature engineering in predictive modeling.

Summary

Through this three-part exploration, you have navigated the process of pinpointing and enhancing predictors for housing price predictions with an emphasis on simplicity. Starting with identifying the most predictive feature using a Sequential Feature Selector (SFS), we discovered that overall quality is paramount. This initial step was crucial, especially since our goal was to create the best simple linear regression model, leading us to exclude categorical features for a streamlined analysis. The exploration led us from identifying overall quality as the key predictor using Sequential Feature Selector (SFS) to evaluating the impacts of living area and first-floor space. Creating “Quality Weighted Area,” a feature blending quality with size, notably enhanced our model’s accuracy. The journey through feature selection and engineering underscored the power of simplicity in improving real estate predictive models, offering deeper insights into what truly influences housing prices. This exploration emphasizes that with the right techniques, even simple models can yield profound insights into complex datasets like Ames’ housing prices.

Specifically, you learned:

The value of Sequential Feature Selection in revealing the most important predictors for housing prices.
The importance of quality over size when predicting housing prices in Ames, Iowa.
How merging features into a “Quality Weighted Area” enhances model accuracy.

Do you have experiences with feature selection or engineering you would like to share, or questions about the process? Please ask your questions or give us feedback in the comments below, and I will do my best to answer.

Artificial Intelligence , Machine Learning and Data Science Hubspot

Tuesday, 3 March 2026

The Strategic Use of Sequential Feature Selector for Housing Price Predictions

Overview

Identifying the Most Predictive Numeric Feature

Evaluating Individual Features’ Predictive Power

Enhancing Predictive Accuracy with Feature Engineering

Further Reading

APIs

Tutorials

Ames Housing Dataset & Data Dictionary

Summary

No comments:

Post a Comment

Report Abuse

Labels

"Donate for a Noble Cause