Product

Thursday, 19 March 2026

Making Sense of Text with Decision Trees

 

Making Sense of Text with Decision Trees

Making Sense of Text with Decision Trees
Image by Editor | ChatGPT

In this article, you will learn:

  • Build a decision tree classifier for spam email detection that analyzes text data.
  • Incorporate text data modeling techniques like TF-IDF and embeddings for training your decision tree.
  • Evaluate and compare classification results against other text classifiers, like Naive Bayes, using Scikit-learn.

Introduction

It’s no secret that decision tree-based models excel at a wide range of classification and regression tasks, often based on structured, tabular data. However, when combined with the right tools, decision trees also become powerful predictive tools for unstructured data, such as text or images, and even time series data.

This article demonstrates how to build decision trees for text data. Specifically, we will incorporate text representation techniques like TF-IDF and embeddings in decision trees trained for spam email classification, evaluating their performance and comparing the results with another text classification model — all with the aid of Python’s Scikit-learn library.

Building Decision Trees for Text Classification

The following hands-on tutorial will use the publicly available UCI dataset for spam classification: a collection of text-label pairs describing email messages and their labeling as spam or ham (“ham” is a colloquial term for non-spam messages).

The following code requests, decompresses, and loads the dataset via its public repository URL into a Pandas DataFrame object named df:

As a quick first check, let’s view the count of spam versus ham emails:

There are 4,825 ham emails (86%) and 747 spam emails (14%). This indicates we are dealing with a class-imbalanced dataset. Keep this in mind, as a simple metric like accuracy won’t be the best standalone measure for evaluation.

Next, we split the dataset (both input texts and labels) into training and test subsets. Due to the class imbalance, we will use stratified sampling to maintain the same class proportions in both subsets, which helps in training more generalizable models.

Now, we are ready to train our first decision tree model. A key aspect here is encoding the text data into a structured format that decision trees can handle. One common approach is TF-IDF vectorization. TF-IDF maps each text into a sparse numerical vector, where each dimension (feature) represents a term from the existing vocabulary, weighted by its TF-IDF score.

Scikit-learn’s Pipeline class provides an elegant way to chain these steps. We’ll create a pipeline that first applies TF-IDF vectorization using TfidfVectorizer and then trains a DecisionTreeClassifier.

Results:

The results aren’t too bad, but they are slightly inflated by the dominant ham class. If catching all spam is critical, we should pay special attention to the recall for the spam class, which is only 0.83 in this case. Spam precision is higher, meaning very few ham emails are incorrectly marked as spam. This is a priority if we want to avoid important messages being sent to the spam folder.

Our second decision tree will use an alternative approach for representing text: embeddings. Embeddings are vector representations of words or sentences such that similar texts are associated with vectors close together in space, capturing semantic meaning and contextual relationships beyond mere word counts.

A simple way to generate embeddings for our text is to use pretrained models like GloVe. We can map each word in an email to its corresponding dense GloVe vector and then represent the entire email by averaging these word vectors. This results in a compact, dense numerical representation for each message.

The following code implements this process. It defines a text_to_embedding() function, applies it to the training and test sets, and then trains and evaluates a new decision tree.

Results:

Unfortunately, this simple averaging approach can cause significant information loss, sometimes called representation loss. This explains the overall drop in performance compared to the TF-IDF model. Decision trees often work better with sparse, high-signal features like those from TF-IDF. These word-level features can act as strong discriminators (e.g. classifying an email as spam based on the presence of words like “free” or “million”). This largely explains the performance difference between the two models.

Comparison with a Naive Bayes Text Classifier

Finally, let’s compare our results with another popular text classification model: the Naive Bayes classifier. While not tree-based, it works well with TF-IDF features. The process is very similar to our first model:

Results:

Comparing our first decision tree model (MODEL 1) with this Naive Bayes model, we see little difference in how they classify ham emails. For the spam class, the Naive Bayes model achieves perfect precision (1.00), meaning every email it identifies as spam is indeed spam. However, it performs worse on recall (0.70), missing about 30% of the actual spam messages in the test data. If recall is our most critical performance indicator, we would lean towards the first decision tree model combined with TF-IDF. We could then try to optimize it further, for instance, through hyperparameter tuning or by using more training data.

Wrapping Up

This article demonstrated how to train decision tree models for text data, tackling spam email classification using common text representation approaches like TF-IDF and vector embeddings.

Synthetic Data Generator with Faker

 

Synthetic Dataset Generation with Faker

Synthetic Dataset Generation with Faker
Image by Author | Ideogram

In this article, you will learn:

  • how to use the Faker library in Python to generate various types of synthetic data.
  • methods to create single data records and full datasets, and how to export them into different formats.
  • how to simulate real-world data imperfections (e.g. missing values, duplicates) for testing purposes, especially in ETL pipelines.

Introduction

The commonly over-used phrase “data is the new oil (or gold)” genuinely holds somewhat true in many scenarios where obtaining sufficient high-quality data for undertaking insightful analyses or building effective machine learning models becomes a challenge. Synthetic data generation is, therefore, a frequently resorted-to approach to cope with this challenge. Thanks to Python libraries like Faker, generating synthetic data for purposes such as bootstrapping existing datasets, testing, or even anonymizing sensitive information is easier than ever.

This article introduces the Faker library for generating synthetic datasets. Through a gentle hands-on tutorial, we will explore how to generate single records or data instances, full datasets in one go, and export them into different formats. The code walkthrough adopts a twofold perspective:

  1. Learning: We will gain a basic understanding of several data types that can be generated and how to get them ready for further processing, aided by popular data-intensive libraries like Pandas
  2. Testing: With some generated data at hand, we will provide some hints on how to test data issues in the context of a simplified ETL (Extract, Transform, Load) pipeline that ingests synthetically generated transactional data.

Don’t feel intimidated: we’ll adopt a gentle approach to showcase Faker’s basic capabilities under a very introductory and rather untangled perspective.

Step-by-Step Data Generation

1. Install and Import Faker

If this is your first time using Faker, you’ll need to install it as follows:

Now we will import the necessary libraries and modules:

2. Create Your Very Own “Faker”

The first thing to do is create an instance of the Faker class that will be able to generate different types of “fake” data — I do personally prefer utilizing the term synthetic over fake, hence I’ll stick to it hereinafter. We will also set a fixed seed for the random number generator: a cornerstone element behind synthetic data generation. Setting a seed helps make the code reproducible and debug it effectively, if necessary.

3. Write a Data-Generating Function

Next comes the most critical part of the code: the function that will generate synthetic, real-world-like instances of data. Concretely, we will generate bank customer records containing basic personal and socio-demographic attributes.

That’s probably a lot to digest, so let’s analyze the code further, line by line:

  • The function generates and returns a Python dictionary representing a bank customer: dictionary keys contain attribute names, and dictionary values contain, of course, the values.
  • The "id" attribute contains a unique user identifier (UUID) generated with the uuid4() function for this end.
  • The "name" attribute contains a randomly generated customer name with the aid of the name() function.
  • Similarly, the "email" attribute contains a randomly generated email address, but in this case, the email generation function email() has a 10% chance of not being used, thus simulating the chance that about 10% of the data may contain missing values for this attribute. This is an interesting way to simulate real-world data imperfections. Notice here that the process to randomly generate email addresses and that for the previous attribute containing customer names are independent, hence if you wanted customer names and emails to be related, you may have to use an alternate, probably not random approach to create email addresses upon customer names.
  • As we can see, the rest of the attributes’ values are also generated by using dedicated Fake functions, thereby providing plenty of flexibility in generating data of many types, and even supporting levels of customization, as seen for instance with the age range specified for the date of birth attribute. The choice() function is used to generate categorical attribute values within a limited number of options.
  • The "income" attribute value is generated as a floating value within a specified range, rounded to two decimal places. Besides, there is a 5% chance it will be set as -1000, which indicates an invalid or missing value: again, a way to simulate real-world data imperfections or errors.

In a single line of code, we can now iteratively call this method to create any number of customer instances and store them in a Pandas DataFrame object.

4. Call the Function to Create Data

Let’s do so for 100 such customers:

The first few data instances should look like this:

Synthetically generated bank customer data


Use Case: ETL Pipeline Testing

Suppose another scenario in which we are interested in testing an ETL pipeline that ingests bank transactional data. The following code generates some simplified customer instances with fewer attributes than in the previous example, plus a new dataset containing bank transactions associated with some of these customers.

Note that the transaction data generation is designed so that 2% of transactions are labeled as duplicated, and some negative values are also accepted as transactions, to simulate both income and payments.

Let’s look at some generated transactions:

Synthetically generated bank transaction data


Areas to focus on for testing data quality in a scenario like this, especially if transactions are collected into an ETL pipeline, could include:

  • Orphaned transactions, that is, possible transactions whose user_id attribute is not associated with any user ID in the customers dataset
  • Duplicate transaction IDs
  • Invalid currency codes

Wrapping Up

This article has presented Python’s Faker library, and has outlined its capabilities for generating synthetic datasets: a common need in real-world scenarios where data may be scarce to enable analysis, testing, and training of machine learning models upon sufficient realistic data.

Connect broadband

Making Sense of Text with Decision Trees

  Making Sense of Text with Decision Trees Image by Editor | ChatGPT In this article, you will learn...