Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog
Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work.
Stay updated with cutting-edge advancements, practical applications, and real-world use.
Wednesday, 18 March 2026
Further Applications with Context Vectors
Context vectors are powerful representations generated by transformer models that capture the meaning of words in their specific contexts. In our previous tutorials, we explored how to generate these vectors and some basic applications. Now, we’ll focus on building practical applications that leverage context vectors to solve real-world problems.
In this tutorial, we’ll implement several applications to demonstrate the power and versatility of context vectors. We’ll use the Hugging Face transformers library to extract context vectors from pre-trained models and build applications around them. Specifically, you will learn:
Building a semantic search engine with context vectors
Creating a document clustering and topic modeling application
Further Applications with Context Vectors Photo by Matheus Bertelli. Some rights reserved.
Overview
This post is divided into three parts; they are:
Building a Semantic Search Engine
Document Clustering
Document Classification
Building a Semantic Search Engine
If you want to find a specific document within a collection, you might use a simple keyword search. However, this approach is limited by the precision of keyword matching. You might not remember the exact wording used in the document, only what it was about. In such cases, semantic search is more effective.
Semantic search allows you to search by meaning rather than by keywords. Each document is represented by a context vector that captures its meaning, and the query is also represented as a context vector. The search engine then finds the documents most similar to the query, using a similarity measure such as L2 distance or cosine similarity.
Since you’ve already learned how to generate context vectors using a transformer model, let’s implement a simple semantic search engine:
import torch
import numpy asnp
from sklearn.metrics.pairwise import cosine_similarity
In this example, the context vector is created using the get_context_vector() function. You pass in the text as a string or a list of strings, and the tokenizer and model produce a tensor output. This output is a matrix of shape (batch size, sequence length, hidden size). Not all tokens in the sequence are valid, so you use the attention mask produced by the tokenizer to identify valid tokens.
Each input string’s context vector is computed as the mean of all valid token embeddings. Note that other methods to create context vectors are possible, such as using the [CLS] token or different pooling strategies.
In this example, you begin with a collection of documents and a query string. You generate context vectors for both, and in semantic_search(), compare the query vector with all document vectors using cosine similarity to find the top-k most similar documents.
The output of the above code is:
Query: How do computers learn from data?
Result 1 (Similarity: 0.7573):
Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed.
Result 2 (Similarity: 0.7342):
Computer vision is an interdisciplinary field that deals with how computers can gain high-level understanding from digital images or videos.
You can see that the semantic search engine understands the meaning behind queries, rather than just matching keywords. However, the quality of results depends on how well the context vectors represent the documents and queries, as well as the similarity metric used.
Document Clustering
Document clustering groups similar documents together. It is useful when organizing a large collection of documents. While you could classify documents manually, that approach is time-consuming. Clustering is an automatic, unsupervised process—you don’t need to provide any labels. The algorithm groups documents into clusters based on their similarity.
With context vectors for each document, you can use any standard clustering algorithm. Below, we use K-means clustering:
In this example, the same get_context_vector() function is used to generate context vectors for a corpus of documents. Each document is transformed into a fixed-size context vector. Then, the K-means clustering algorithm groups the documents. The number of clusters is set to 3, but you can experiment with other values to see what makes the most sense.
The output of the above code is:
Cluster 1:
- Deep learning uses neural networks with many layers to learn representations of data with multiple levels of abstraction.
- Neural networks are computing systems inspired by the biological neural networks that constitute animal brains.
- Convolutional neural networks are deep neural networks most commonly applied to analyzing visual imagery.
- Sentiment analysis uses NLP to identify and extract opinions within text to determine writer's attitude.
Cluster 2:
- Natural language processing is a subfield of linguistics, computer science, and artificial intelligence.
- Named entity recognition is a subtask of information extraction that seeks to locate and classify named entities in text.
- Computer vision is an interdisciplinary field that deals with how computers can gain high-level understanding from digital images.
- Image recognition is the ability of software to identify objects, places, people, writing and actions in images.
- Object detection is a computer technology related to computer vision and image processing.
Cluster 3:
- Machine learning algorithms build models based on sample data to make predictions without being explicitly programmed.
The quality of clustering depends on the context vectors and the clustering algorithm. To evaluate the results, you can visualize the clusters in 2D using Principal Component Analysis (PCA). PCA reduces the vectors to their first two principal components, which can be plotted in a scatter plot:
If you don’t see clear clusters—as in this case—it suggests the clustering isn’t ideal. You may need to adjust how you generate context vectors. However, the issue might also be that all the documents are related to machine learning, so forcing them into three distinct clusters may not be meaningful.
In general, document clustering helps automatically discover topics in a collection. For good results, you need a moderately large and diverse corpus with clear topic distinctions.
Document Classification
If you happen to have labels for the documents, you can use them to train a classifier. This goes one step beyond clustering. With labels, you control how documents are grouped.
You may need more data to train a reliable classifier. Below, we’ll use a logistic regression classifier to categorize documents.
from transformers import AutoTokenizer,AutoModel
import torch
import numpy asnp
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
The context vectors are generated the same way as in the previous example. Instead of clustering or manually comparing similarities, you provide a list of labels (one per document) to a logistic regression classifier. Using the implementation from scikit-learn, we train the model on the training set and evaluate it on the test set.
The classification_report() function from scikit-learn provides metrics like precision, recall, F1 score, and accuracy. The result looks like this:
precision recall f1-score support
Business 0.50 1.00 0.67 1
Health 0.00 0.00 0.00 1
Technology 1.00 1.00 1.00 1
accuracy 0.67 3
macro avg 0.50 0.67 0.56 3
weighted avg 0.50 0.67 0.56 3
To use the trained classifier, follow the same workflow: use the get_context_vector() function to convert new text into context vectors, then pass them to the classifier to predict categories. When you run the above code, you should see:
Text: The central bank has decided to keep interest rates unchanged.
Category: Business
Text: A new study shows that regular exercise can reduce the risk of heart disease.
Category: Health
Text: The new laptop has a faster processor and more memory than previous models.
Category: Technology
Note that the classifier is trained on context vectors, which ideally capture the meaning of the text rather than just surface keywords. As a result, it should more effectively generalize to new inputs, even those with unseen keywords.
Summary
In this post, you’ve explored how to build practical applications using context vectors generated by transformer models. Specifically, you’ve implemented:
A semantic search engine to find documents most similar to a query
A document clustering application to group documents into meaningful categories
A document classification system to categorize documents into predefined categories
These applications highlight the power and versatility of context vectors for understanding and processing text. By leveraging the semantic capabilities of transformer models, you can build sophisticated NLP systems that go beyond simple keyword matching or rule-based methods.
No comments:
Post a Comment