SpaCy For Entity Extraction
The first spaCy version was released in 2015 and it quickly became a standard framework for enterprise grade entity extraction (also know as NER).
If you have a piece of unstructured text (coming from the web for example) and you want to extract structured data from it, like dates, names, places, etc. spaCy is a very good solution.
SpaCy is interesting because several pre-trained models are available in around 20 languages (see more here). It means that you do not necessarily have to train your own model for entity extraction. It also means that, if you want to train your own model, you can start from a pre-trained model instead of starting from scratch, which might save you a lot of time.
SpaCy is considered as a "production grade" framework because it is very fast, reliable, and comes with a comprehensive documentation.
However if the default entities supported by spaCy pre-trained models are not enough, you will need to work on "data annotation" (also known as "data labelling") in order to train your own model. This process is extremely time consuming and many enterprise entity extraction projects fail because of this challenge.
Let's say that you want to extract job titles from a piece of text (from a resume for example, or from a company web page). As spaCy pre-trained models do not support such an entity by default, you will need to teach spaCy how to recognize job titles. You will need to create a training dataset that contains several thousands of job titles extractions examples (and maybe even many more!). You may use a paid annotation software like Prodigy (made by the spaCy team), but it still involves a lot of human work. It is actually quite common to see companies hire a bunch of contractors for several months in order to carry out a data annotation project. Such a job is so repetitive and boring that the resulting datasets often contain a lot of mistakes...
Data Annotation Example
Let's see which alternative solutions you could try in 2023!
Stanford CoreNLP
The first version of Stanford CoreNLP was released in 2013. It is a Java framework (while spaCy is a Python one) that allows you to perform entity extraction with very good results.
Stanford CoreNLP proposes pre-trained models too, but less than spaCy (see more here).
The accuracy of this framework is similar to spaCy, but it depends on the data you are analyzing. For example Stanford CoreNLP is giving better results on legal data. Also it is worth noting that some entities are addressed slightly differently compared to spaCy (this is the case of the GPE entity for example).
When it comes to performance, Stanford clearly seems slower than spaCy, which might be a problem if you are trying to reach a very high throughput.
Flair
Flair is a more recent Python framework (released in 2018) based on the PyTorch deep learning framework.
It is gaining a lot of popularity because it reaches a higher accuracy in many languages compared to spaCy. Several pre-trained models are proposed (see more here).
However this accuracy improvement comes at the cost of speed. Your throughput will be much lower compared to spaCy.
Generative AI Models (GPT-J, GPT-3...)
A couple of years ago, a new sort of AI models started to appear: generative models. These models were initially created for text generation (writing the beginning of a piece of text and letting the model generate the rest) but people quickly realized that these models were very good at all sorts of natural language processing use cases, including entity extraction.
The most popular generative models today are GPT-3, GPT-J, GPT-NeoX, T5, and Bloom. All these deep learning models use the Transformer architecture, invented by Google in 2017.
This new generation of AI model is very heavy and expensive to run. They usually require high-end hardware based on one or several GPUs. Also they are slower than frameworks like spaCy. But thanks to these models it is now possible to extract any kind of entity without training a dedicated model!
Extracting any entity without creating a dedicated model is possible thank to few-shot learning. This technique is about quickly showing the model what you want to do by only making a couple of examples at runtime. Learn more about few-shot learning here.
Getting back to our job titles extraction example, if you want to extract job titles with a model like GPT-J you will not need to annotate any data. It will then save you weeks or months of human work. And accuracy will most likely be much higher than any entity extraction with spaCy.
See our article about how to easily perform entity extraction with GPT models.
Conclusion
SpaCy is a great natural language processing framework that is used in production by many companies today for entity extraction tasks.
However, spaCy and alternatives like Stanford CoreNLP or Flair are limited in terms of accuracy, and they require tedious annotation work in order to extract new entities. In 2023 several alternative models based on text generation can be used for entity extraction without any annotation, like GPT-J, GPT-NeoX, GPT-3... These new models will really help more and more companies succeed in their entity extraction projects.
If you want to use GPT-J and GPT-NeoX, don't hesitate to have a try on
No comments:
Post a Comment