INTRODUCTION TO NATURAL LANGUAGE PROCESSING
Language is the most important tool of communication invented by
human civilization. It is either spoken or written, consisting of the
use of words in a structured and conventional way. Language helps us
share our thoughts, and understand others.
Natural Language Processing is, a form of artificial intelligence,
all about trying to analyze and understand either written or spoken
language and the context that it’s being used in. The ultimate objective
of NLP is to read, decipher, understand, and make sense of the human
languages in a manner that is valuable.
Wikipedia defines NLP as “
a sub field of AI concerned with the interactions between computers and
human (natural) languages, in particular how to program computers to
process and analyze large amounts of natural language data.”
DATA PREPROCESSING – NLP
In this session, we will show you the ways of cleaning the text for
the preparation of the dataset in NLP. We will be using the built-in
python function in this section and we also will be introducing the NLTK
library in the next session. Talking about data preprocessing in NLP,
we encounter the steps like splitting the documents into sentences,
words and there are various ways to split the texts. Here we will go
through some of the ways:-
1) Split by White Spaces
Splitting by white spaces refers to the splitting of documents or
texts by word. Applying split() with no input parameters calls the
function to split text by looking at white spaces only. It doesn’t take
account of any apostrophe. Example:- Look how who’s is split.
text = 'Albert Einstein is one of the most brilliant scientists who’s ever lived.' |
# split into words by white space |
['Albert', 'Einstein', 'is', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who’s', 'ever', 'lived.']
2) Split by Words
It is already clear by title about its function. Do you know the
difference between split by words and split by white space? Notice the
difference in “who’s”.
# split based on words only |
words = re.split(r'\W+', text) |
['Albert', 'Einstein', 'is', 'widely', 'celebrated', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who', 's', 'ever', 'lived', '']
3) Normalization
In NLP, we convert all uppercase characters to lowercase. We don’t
recommend to use this step in every dataset preprocessing. Normalizing
the words can change the entire meaning. Example:- Orange is a French
telecom company whereas orange is fruit.
# split based on words only |
words = re.split(r'\W+', text) |
words = [word.lower() for word in words] |
['albert', 'einstein', 'is', 'widely', 'celebrated', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who', 's', 'ever', 'lived', '']
NLTK
NLTK, Natural Language ToolKit, is an open-source Python platform to work on Natural Language Processing. This library requires Python 3.5, 3.6, 3.7, or 3.8
1) SPLIT BY SENTENCE
The tokenizer divides a text into a list of sentences by using an
unsupervised algorithm to build a model for abbreviation words,
collocations, and words that start sentences. We should train on a large
collection of plain text in the target language before using them.
The NLTK data package includes a pre-trained Punkt tokenizer for English language.
from nltk import sent_tokenize |
sentences = sent_tokenize(text) |
for sentence in sentences: |
print(sentence)
2) SPLIT BY WORDS
Make sure you check out the output and spot the differences in “who’s”.
from nltk.tokenize import word_tokenize |
tokens = word_tokenize(text) |
['Albert', 'Einstein', 'is', 'widely', 'celebrated', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who', '’', 's', 'ever', 'lived', '.']
3) FILTERING
Python includes the built-in function isalpha()
that can
be used in order to determine whether or not the scanned word is
alphabetical or else (numerical, punctuation, special characters, etc.).
Make sure you check out the output and spot the differences.
tokens = word_tokenize(text) |
# remove all tokens that are not alphabetic |
words = [word for word in tokens if word.isalpha()] |
['Albert', 'Einstein', 'is', 'widely', 'celebrated', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who', 's', 'ever', 'lived']
4) REMOVE STOPWORDS
Stopwords are words that do not add much meaning to a sentence. They
can safely be ignored without sacrificing the meaning of the sentence.
The most common are short function words such as the, is, at, which, and
on, etc.
In this case, removing stopwords can cause problems when searching
for phrases that include them, particularly in names such as “The Who”
or “Take That”.
Including the word “not” as a stopword also changes the entire meaning if removed (try “this code is not good”)
# let's list all the stopwords for NLTK |
from nltk.corpus import stopwords |
nltk.download('stopwords') |
stop_words = stopwords.words('english') |
print(stop_words)
As you can see, the stopwords are all lower case and don’t have
punctuation. If we’re to compare them with our tokens, we need to make
sure that our text is prepared the same way.
This cell recaps all that we have previously learnt in this colab: tokenizing, lower casing and checking for alphabetic words.
tokens = word_tokenize(text) |
tokens = [w.lower() for w in tokens] |
# remove all tokens that are not alphabetic |
words = [word for word in tokens if word.isalpha()] |
stop_words = set(stopwords.words('english')) |
words = [w for w in words if not w in stop_words] |
['albert', 'einstein', 'widely', 'celebrated', 'one', 'brilliant', 'scientists', 'ever', 'lived']
STEMMING IN NATURAL LANGUAGE PROCESSING
Stemming refers to the process of reducing each word to its root or
base. There are two types of stemmers for suffix stripping: porter and
lancaster and each has its own algorithm and sometimes display different
outputs.
from nltk.tokenize import word_tokenize |
from nltk.stem.porter import PorterStemmer |
stemmed = [porter.stem(word) for word in words] |
['albert', 'einstein', 'wide', 'celebr', 'one', 'brilliant', 'scientist', 'ever', 'live']
from nltk.tokenize import word_tokenize |
from nltk.stem.lancaster import LancasterStemmer |
lancaster = LancasterStemmer() |
stemmed = [lancaster.stem(word) for word in words] |
['albert', 'einstein', 'wid', 'celebr', 'on', 'bril', 'sci', 'ev', 'liv']
This is all for today.Stay Safe & Happy Coding.
No comments:
Post a Comment