The real estate market is a complex ecosystem driven by numerous variables such as location, property features, market trends, and economic indicators. One dataset that offers a deep dive into this complexity is the Ames Housing dataset. Originating from Ames, Iowa, this dataset comprises various properties and their characteristics, ranging from the type of alley access to the property’s overall condition.
In this post, your aim is to take a closer look at this dataset using data science techniques. Specifically, you’ll focus on how to identify categorical and numerical variables, as understanding these variables is crucial for any data-driven decision-making process.
Let’s get started.
Overview
This post is divided into three parts; they are:
- The Importance of a Data Dictionary
- Identifying Categorical and Numerical Variables
- Missing Data Imputation
The Importance of a Data Dictionary
Before we delve into the code and dataset, let’s discuss the value of a data dictionary. A data dictionary is essentially a “map” that describes the nature of data we’re dealing with. It explains each variable, indicating whether it is categorical or numerical and, in the case of categorical, what the variable codes mean. This is particularly helpful when working with comprehensive datasets like the Ames Housing dataset.
Understanding the levels of measurement is fundamental in this context. Levels of measurement, namely nominal, ordinal, interval, and ratio, provide a framework for interpreting variables. Nominal variables represent categories without any inherent order (such as user ID), ordinal variables have a meaningful order but unequal intervals (such as ranking), interval variables have consistent intervals but no true zero point to make division meaningful (such as timestamp), and ratio variables have a meaningful order, consistent intervals, and a true zero point (such as length and weight). This distinction is crucial as it guides the appropriate statistical analyses and informs the interpretation of the dataset. In our case, the Ames Data Dictionary provides invaluable insights into the variables we will be analyzing, aiding in the accurate classification of variables and subsequent data analysis.
For example, the variable MSSubClass is explained as “Identifies the type of dwelling involved in the sale,” with codes like “20” for “1-STORY 1946 & NEWER ALL STYLES” and “30” for “1-STORY 1945 & OLDER.” Despite its numerical appearance, MSSubClass is a categorical variable. Without additional context or domain knowledge, one might categorize the numerical codes as labels without inherent order, making them “nominal,” and the average of 20 and 30 doesn’t make any sense.
However, it’s essential to note that the accurate classification of MSSubClass hinges on a nuanced understanding of the housing domain. With domain knowledge, one could discern the meaningful order among the categories, potentially reclassifying MSSubClass as an ordinal variable, in which you can conclude that 20 is better than 30. This emphasizes the critical role of domain expertise in refining the interpretation of variables, ensuring a more accurate representation of their nature and relationships.
Identifying Categorical and Numerical Variables
Identifying the nature of variables in a dataset is a crucial step in any data analysis task. While it’s tempting to consider any variable with numerical values as a numerical variable, that’s not always the case. As we saw with MSSubClass in the Ames data dictionary, some numerical variables are actually categorical in nature—they represent codes, not measurable quantities.
Few Basic Methods To Classify Data Types
select_dtypes(): Find the columns in a DataFrame of certain data types
describe(): Gather basic descriptive statistics for numerical columns (and ignore non-numerical columns). Hence you can shortlist all the numerical columns from its output
nunique(): Count the number of unique values in each column, in which categorical columns should only have a few unique values
value_counts(): Count the number of occurrences of each unique value
The value_counts() method can be used to explore the unique values and their counts for each feature. If a feature has a limited set of unique values, it might be categorical.
info(): This prints a concise summary of the DataFrame, including the the count of non-null values and the data type of each column
In summary, this code combines all the above snippets to investigate a DataFrame:
Running the above code will print you the following:
Converting Numerical Features to Categorical Features Upon Careful Consideration
Most of the basic methods above would classify MSSubClass as a numerical feature. However, as highlighted earlier, this feature is in fact a categorical variable. Similarly, features like MoSold (Month Sold) and YrSold (Year Sold) are numerical in nature, but they can often be treated as categorical variables, especially when there is no interest in performing mathematical operations on them. We can use the astype() method in pandas to convert MSSubClass, MoSold, and YrSold to categorical features.
After performing this conversion, the count of columns with the “object” data type has increased to 47 (from the previous 44), while “int64” has dropped to 24 (from 27).
A careful assessment of the data dictionary, the nature of the dataset, and domain expertise can contribute to properly reclassifying data types.
The following is the complete code to count the number of columns of each data type:
Missing Data Imputation
Dealing with missing data is a challenge that every data scientist faces. Ignoring missing values or handling them inadequately can lead to skewed analysis and incorrect conclusions. The choice of imputation technique often depends on the nature of the data—categorical or numerical. In addition, information in the data dictionary will be useful (such as the case for Pool Quality) where a missing value (“NA”) has a meaning, namely the absence of this feature for a particular property.
Data Imputation For Categorical Features
You can identify categorical data types and rank them in the order in which they are most affected by missing data.
The data dictionary indicates that missing values for the entire list of categorical features above indicate the absence of that feature for a given property, except for “Electrical”. With this insight, we can impute with the “mode” for the 1 missing data point for the electrical system and impute all others using "None" (with quotations to make it a Python string).
This verifies the mode value is "SBrkr" and the missing value in “Electrical” column is fixed:
Afterward, you can replace all missing values with string "None":
This confirms that there are now no more missing values for categorical features:
The complete code is as follows:
Data Imputation For Numerical Features
We can apply the same technique demonstrated above to identify numerical data types and rank them in the order in which they are most affected by missing data.
The above illustrates that there are fewer instances of missing numerical data versus missing categorical data. However, the data dictionary is not as useful for a straightforward imputation. Whether or not to impute missing data in data science largely depends on the goal of the analysis. Often, a data scientist may generate multiple imputations to account for the uncertainty in the imputation process. Common multiple imputation methods include (but are not limited to) mean, median, and regression imputation. As a baseline, we will illustrate how to employ mean imputation here, but may refer to other techniques depending on the task at hand.
This prints:
At times, we may also opt to leave the missing value without any imputation to retain the authenticity of the original dataset and remove the observations that do not have complete and accurate data if required. Alternatively, you may also try to build a machine learning model to guess the missing value based on some other data in the same rows, which is the principle behind imputation by regression. As a final step of the above baseline imputation, let us cross-check if there are any missing values.
You should see:
Congratulations! We have successfully imputed every missing value in the Ames dataset using baseline operations. It’s important to note that numerous other techniques exist for imputing missing data. As a data scientist, exploring various options and determining the most appropriate method for the given context is crucial to producing reliable and meaningful results.
The complete code is as follows:
Further Reading
This section provides more resources on the topic if you want to go deeper.
Resources
Summary
In this tutorial, we explored the Ames Housing dataset through the lens of data science techniques. We discussed the importance of a data dictionary in understanding the dataset’s variables and dove into Python code snippets that help identify and handle these variables effectively.
Understanding the nature of the variables you’re working with is crucial for any data-driven decision-making process. As we’ve seen, the Ames data dictionary serves as a valuable guide in this respect. Coupled with Python’s powerful data manipulation libraries, navigating complex datasets like the Ames Housing dataset becomes a much more manageable task.
Specifically, you learned:
- The importance of a data dictionary when assessing data types and imputation strategies.
- Identification and reclassification methods for numerical and categorical features.
- How to impute missing categorical and numerical features using the pandas library.
Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.

No comments:
Post a Comment