Pre-trained entity extraction models based on spaCy or NLTK give great results but require a tedious annotation and training process in order to detect non-native entities like job titles, VAT numbers, drugs, etc. Thanks to large language models like GPT-3, GPT-J, and GPT-NeoX, it is now possible to extract any type of entities thanks to few-shot learning, without any annotation and training. In this article, we're showing how to do that.
NER (entity extraction) is basically about extracting structured information from an unstructured text.
NER with spaCy and NLTK: the traditional way
SpaCy has pretty much become the de facto standard for NER these last years (see the spaCy website). SpaCy is a very attractive framework because it is easy is use, and its speed makes it well suited for production use.
SpaCy is a Python natural language processing framework that proposes many pre-trained models in multiple languages, so it is easy to extract several entity types (companies, cities, addresses, dates, etc.) in your own language without having to train your own model.
NLTK is also an interesting choice for entity extraction with Python, but it proposes less entities by default, and in general NLTK is not recommended for production (it is more of an educational and research framework).
However you will quickly hit a limit with these frameworks: the number of natively supported entities is limited. Most companies want to leverage NER in order to extract specific business information like person information, financial data, medical treatments, etc. Of course, these entities are not supported by default by the spaCy pre-trained models, so in order to achieve this, you have to create your own dataset and train your own model out of it.
Training your own spaCy model is a long and tedious annotation process: one or several persons need to collaborate in order to create a huge set of good examples, and annotate them. A very large volume of examples is required in order for the model to properly learn. Good annotation tools exist (like Prodigy, by spaCy), but it still remains a painful task that causes many NLP projects to abort.
Good news: with the rise of large language models like GPT-3, GPT-J, and GPT-NeoX, it is now possible to extract any entities without annotating and training a new model!
Text Generation with GPT-3, GPT-J, and GPT-NeoX
Large language models for text generation have started appearing recently with GPT-3 (see more about GPT-3 on OpenAI's website). When OpenAI released their GPT-3 model, made up of 175 billion parameters, it was a revolution because it paved the way for many cutting-edge AI applications based on natural language processing without requiring any additional training.
The initial goal of GPT models like GPT-3 is to generate text: simply give an input to the model and let it generate the rest for you. Based on text generation, pretty much any natural language processing use case can be achieved: classification, summarization, conversational AI, paraphrasing... and of course entity extraction!
Since GPT-3 is not an open-source model, the open-source community has worked on alternatives to GPT-3 and we now have 2 great open-source equivalents: GPT-J and GPT-NeoX. They are still not as big as GPT-3, but no doubt that it's only a matter of time before the open-source community catches up with OpenAI.
Properly leveraging these models requires a new technique called "few-shot learning".
Few-shot Learning
These large GPT models are so big that they can very quickly learn from you.
Let's say you want GPT-3 to generate a short product description for you. Here is an example without few-shot learning:
Generate a product description containing these specific keywords: t-shirt, men, $50
The response you will get will be useless. It could be something like this for example:
Generate a product description containing these specific keywords: t-shirt, men, $50 and short.
The product description needs to be a few words long. Don’t use plurals, use the keywords in the order they are
Good news: you can achieve much better results by simply giving a couple of examples to the model!
Generate a product description containing specific keywords.
Keywords: shoes, women, $59
Result: Beautiful shoes for women at the price of $59.
###
Keywords: trousers, men, $69
Result: Modern trousers for men, for $69 only.
###
Keywords: gloves, winter, $19
Result: Amazingly hot gloves for cold winters, at $19.
###
Keywords: gpu, gaming, $1499
Result:
The result will be something like this:
Generate a product description containing specific keywords.
Keywords: shoes, women, $59
Result: Beautiful shoes for women at the price of $59.
###
Keywords: trousers, men, $69
Result: Modern trousers for men, for $69 only.
###
Keywords: gloves, winter, $19
Result: Amazingly hot gloves for cold winters, at $19.
###
Keywords: gpu, gaming, $1,499
Result: The best gaming GPU on the market, at the price of $1,499 only.
As you can see, the response from the model is now perfectly on purpose, thanks to the 3 examples we first gave it. Yet, this model was never trained on this kind of product description generation task. This is what the "few-shot learning" technique is about: you perform "transfer learning" on the fly with a couple of examples only. In order to achieve this kind of result, you usually expect to train a natural language processing model with tons of examples, but not here.
Entity Extraction With Few-shot Learning
Now we will perform entity extraction thanks to few-shot learning.
Let's say that you want to extract job titles from websites. Simply give a couple of job title extraction examples before making your actual request:
Extract job titles from the following sentences.
Sentence: John Doe has been working for Microsoft for 20 years as a Linux Engineer.
Job title: Linux Engineer
###
Sentence: John Doe has been working for Microsoft for 20 years and he loved it.
Job title: none
###
Sentence: Marc Simoncini | Director | Meetic
Job title: Director
###
Sentence: Franck Riboud was born on 7 November 1955 in Lyon. He is the son of Antoine Riboud, who transformed the former European glassmaker BSN Group into a leading player in the food industry. He is the CEO at Danone.
Job title: CEO
###
Sentence: Damien is the CTO of Platform.sh, he was previously the CTO of Commerce Guys, a leading ecommerce provider.
Job title:
The result will be the following:
Extract job titles from the following sentences.
Sentence: John Doe has been working for Microsoft for 20 years as a Linux Engineer.
Job title: Linux Engineer
###
Sentence: John Doe has been working for Microsoft for 20 years and he loved it.
Job title: none
###
Sentence: Marc Simoncini | Director | Meetic
Job title: Director
###
Sentence: Franck Riboud was born on 7 November 1955 in Lyon. He is the son of Antoine Riboud, who transformed the former European glassmaker BSN Group into a leading player in the food industry. He is the CEO at Danone.
Job title: CEO
###
Sentence: Damien is the CTO of Platform.sh, he was previously the CTO of Commerce Guys, a leading ecommerce provider.
Job title: CTO
As you noticed, we have to be smart about how we're making our few-shot examples. It can happen that no job title is found at all, which is why we created an example returning "none" (it avoids false positives). Maybe you want to extract several job titles at the same time? In that case it's important to create examples returning several job titles too (comma separated job titles for example).
You will get even better results by adding even more examples. And it's important that your examples are as close as possible to your actual final request. For example, if you know you're going to analyze entire paragraphs instead of mere sentences, it's best to create examples with paragraphs too.
If you don't have access to a GPT model, you can simply use the NLP Cloud API. Several clients are available (Python, Go, Node.js, Ruby, PHP...). Let's show an example here using GPT-J with the Python client:
import nlpcloud
client = nlpcloud.Client("gpt-j", "your API token", gpu=True)
client.generation("""Extract job titles from the following sentences.
Sentence: John Doe has been working for Microsoft for 20 years as a Linux Engineer.
Job title: Linux Engineer
###
Sentence: John Doe has been working for Microsoft for 20 years and he loved it.
Job title: none
###
Sentence: Marc Simoncini | Director | Meetic
Job title: Director
###
Sentence: Franck Riboud was born on 7 November 1955 in Lyon. He is the son of Antoine Riboud, who transformed the former European glassmaker BSN Group into a leading player in the food industry. He is the CEO at Danone.
Job title: CEO
###
Sentence: Damien is the CTO of Platform.sh, he was previously the CTO of Commerce Guys, a leading ecommerce provider.
Job title:""",
top_p=0.1,
length_no_input=True,
remove_input=True,
end_sequence="###",
remove_end_sequence=True
)
The result will be: CTO
Let me give you a quick explanation about the text generation parameters we just used.
We set a very low top p value because we don't want GPT-J to create too original results: we just want it to stick to what it saw in your request.
"length_no_input" means that the maximum length value should not take the input text into account.
"remove_input" means that the input text should be removed from the result.
"end_sequence" means that when the model meets this character, it should stop generating text. As in our few-shot examples we added "###" at the end of each answer, the model will automatically generate "###" after generating the response and it will stop there.
"remove_end_sequence" means that we want to remove "###" from the response.
You can see more details in the NLP Cloud documentation: see it here.
Performance Considerations
Performing entity extraction with a GPT model gives a lot of freedom as any new entity can be extracted on the fly even if the model wasn't trained for that!
However it comes at a cost: these large language models are huge and relatively slow.
For example, if you want to use GPT-J or GPT-NeoX it will require a huge GPU with a lot of VRAM like an NVIDIA RTX A6000 or A40. And there will be some latency (extracting an entity takes around 500ms). On the contrary, spaCy or NLTK will be much faster and less costly from an infrastructure standpoint.
Conclusion
In 2022, it is possible to perform advanced NER very easily without any annotation and training! It will greatly help companies deliver their entity extraction projects faster, and it also allows for more cutting edge applications based on natural language processing.
However large language models like GPT-3, GPT-J, and GPT-NeoX are costly, so you should not underestimate the infrastructure costs involved.
I hope this article will help you save time and money!
No comments:
Post a Comment