Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog
Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work.
Stay updated with cutting-edge advancements, practical applications, and real-world use.
Friday, 27 February 2026
5 DIY Python Functions for Data Cleaning
Image by Author | Midjourney
Data cleaning: whether you love it or hate it, you likely spend a lot of time doing it.
It’s what we signed up for. There’s no understanding, analyzing, or modeling data without first cleaning it. Making sure we have reusable tools handy for data cleaning is essential. To that end, here are 5 DIY functions to give you a some examples and starting points for building up your own data cleaning tool chest.
The functions are well-documented, and include explicit descriptions or function parameters and return types. Type hinting is also employed to ensure both that the functions are called in the manner they were intended, and that they can be well understood by you, the reader.
Before we get started, let’s take care of the imports.
Our first DIY function is meant to remove excessive whitespace from text. If we want neither multiple spaces within a string, nor excessive leading or trailing spaces, this single line function will take care of it for us. We make use of regular expressions for internal spaces, as well as strip() for trailing/leading whitespace.
def clean_spaces(text:str)->str:
"""
Remove multiple spaces from a string and trim leading/trailing spaces.
:param text: The input string to clean
:returns: A string with multiple spaces removed and trimmed
"""
returnre.sub(' +',' ',str(text).strip())
Testing:
messy_text="This has too many spaces"
clean_text=clean_spaces(messy_text)
print(clean_text)
Output:
Thishas too many spaces
2. Standardize Date Formats
Do you have datasets with dates running the gamut of internationally acceptable formats? This function will standardize them all to our specified format (YYYY-MM-DD).
Let’s deal with those pesky missing values. We can specify our numeric data strategy to use (‘mean’, ‘median’, or ‘mode’), as well as our categorical data strategy (‘mode’ or ‘dummy’).
Outliers causing you problems? Not any more. This DIY function uses the IQR method for removing outliers from our data. You just pass in the data and specify the columns to check for outliers, it returns an outlier-free dataframe.
Let’s get normal! When you want to convert all text to lowercase, strip of whitespace, and remove special characters, this DIY function will do the trick.
def normalize_text(text:str)->str:
"""
Normalize text data by converting to lowercase, removing special characters, and extra spaces.
:param text: The input text to normalize
:returns: Normalized text
"""
# Convert to lowercase
text=str(text).lower()
# Remove special characters
text=re.sub(r'[^\w\s]','',text)
# Remove extra spaces
text=re.sub(r'\s+',' ',text).strip()
returntext
Testing:
messy_text="This is MESSY!!! Text with $pecial ch@racters."
clean_text=normalize_text(messy_text)
print(clean_text)
Output:
thisismessy text with pecial chracters
Final Thoughts
Well that’s that. We went presented 5 different DIY functions that will perform specific data cleaning tasks. We test drove them all, and checked out their results. You should now have some idea of where to go on your own from here, and don’t forget to save these functions for use later.
No comments:
Post a Comment