Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Tuesday, 12 November 2024

Datasets for Natural Language Processing

You need datasets to practice on when getting started with deep learning for natural language processing tasks.

It is better to use small datasets that you can download quickly and do not take too long to fit models. Further, it is also helpful to use standard datasets that are well understood and widely used so that you can compare your results to see if you are making progress.

In this post, you will discover a suite of standard datasets for natural language processing tasks that you can use when getting started with deep learning.

Overview

This post is divided into 7 parts; they are:

Text Classification
Language Modeling
Image Captioning
Machine Translation
Question Answering
Speech Recognition
Document Summarization

I have tried to provide a mixture of datasets that are popular for use in academic papers that are modest in size.

Almost all datasets are freely available for download today.

If your favorite dataset is not listed or you think you know of a better dataset that should be listed, please let me know in the comments below.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Datasets for Natural Language Processing
Photo by Grant, some rights reserved.

1. Text Classification

Text classification refers to labeling sentences or documents, such as email spam classification and sentiment analysis.

Below are some good beginner text classification datasets.

Reuters Newswire Topic Classification (Reuters-21578). A collection of news documents that appeared on Reuters in 1987 indexed by categories. Also see RCV1, RCV2 and TRC2.
IMDB Movie Review Sentiment Classification (stanford). A collection of movie reviews from the website imdb.com and their positive or negative sentiment.
News Group Movie Review Sentiment Classification (cornell). A collection of movie reviews from the website imdb.com and their positive or negative sentiment.

For more, see the post:

Datasets for single-label text categorization.

2. Language Modeling

Language modeling involves developing a statistical model for predicting the next word in a sentence or next letter in a word given whatever has come before. It is a pre-cursor task in tasks like speech recognition and machine translation.

It is a pre-cursor task in tasks like speech recognition and machine translation.

Below are some good beginner language modeling datasets.

Project Gutenberg, a large collection of free books that can be retrieved in plain text for a variety of languages.

There are more formal corpora that are well studied; for example:

Brown University Standard Corpus of Present-Day American English. A large sample of English words.
Google 1 Billion Word Corpus.

3. Image Captioning

Image captioning is the task of generating a textual description for a given image.

Below are some good beginner image captioning datasets.

Common Objects in Context (COCO). A collection of more than 120 thousand images with descriptions
Flickr 8K. A collection of 8 thousand described images taken from flickr.com.
Flickr 30K. A collection of 30 thousand described images taken from flickr.com.

For more see the post:

Exploring Image Captioning Datasets, 2016

4. Machine Translation

Machine translation is the task of translating text from one language to another.

Below are some good beginner machine translation datasets.

Aligned Hansards of the 36th Parliament of Canada. Pairs of sentences in English and French.
European Parliament Proceedings Parallel Corpus 1996-2011. Sentences pairs of a suite of European languages.

There are a ton of standard datasets used for the annual machine translation challenges; see:

Statistical Machine Translation

5. Question Answering

Question answering is a task where a sentence or sample of text is provided from which questions are asked and must be answered.

Below are some good beginner question answering datasets.

Stanford Question Answering Dataset (SQuAD). Question answering about Wikipedia articles.
Deepmind Question Answering Corpus. Question answering about news articles from the Daily Mail.
Amazon question/answer data. Question answering about Amazon products.

For more, see the post:

Datasets: How can I get corpus of a question-answering website like Quora or Yahoo Answers or Stack Overflow for analyzing answer quality?

6. Speech Recognition

Speech recognition is the task of transforming audio of a spoken language into human readable text.

Below are some good beginner speech recognition datasets.

TIMIT Acoustic-Phonetic Continuous Speech Corpus. Not free, but listed because of its wide use. Spoken American English and associated transcription.
VoxForge. Project to build an open source database for speech recognition.
LibriSpeech ASR corpus. Large collection of English audiobooks taken from LibriVox.

Do you know of some more good automatic speech recognition datasets?
Let me know in the comments.

7. Document Summarization

Document summarization is the task of creating a short meaningful description of a larger document.

Below are some good beginner document summarization datasets.

Legal Case Reports Data Set. A collection of 4 thousand legal cases and their summarization.
TIPSTER Text Summarization Evaluation Conference Corpus. A collection of nearly 200 documents and their summaries.
The AQUAINT Corpus of English News Text. Not free, but widely used. A corpus of news articles.

For more see:

Summary

In this post, you discovered a suite of standard datasets that you can use for natural language processing tasks when getting started with deep learning.

Did you pick a dataset? Are you using one of the above datasets?
Let me know in the comments below.

Artificial Intelligence , Machine Learning and Data Science Hubspot

Tuesday, 12 November 2024

Datasets for Natural Language Processing

Overview

1. Text Classification

2. Language Modeling

3. Image Captioning

4. Machine Translation

5. Question Answering

6. Speech Recognition

7. Document Summarization

Further Reading

Summary

No comments:

Post a Comment

Report Abuse

Labels

"Donate for a Noble Cause