How to remove stop words with NLTK in Python
In this we will learn, how to write a program to removing stop words with NLTK in Python. Here we are using nltk library for this program.
What are Stop words?
Stop word are most common used words like a, an, the, in etc.
First we need to import the stopwords and word tokentize. We have to set those stopwords, then we have to split the sentence into words. Then we need to remove those stopwords from given text using for loop. In this program we are using English language, you can use other languages also.
Program :
from nltk.corpus import stopwords
from nltk import word_tokenize
stop_words = set(stopwords.words('english'))
print(stop_words)
text = word_tokenize("The quick brown fox jumps over the lazy dog")
#print(nltk.pos_tag(text))
new_sentence =[]
for w in text:
if w not in stop_words:
new_sentence.append(w)
print(text)
print(new_sentence)
Output:
{‘whom’, “you’d”, ‘them’, ‘ve’, “isn’t”, ‘some’, ‘was’, ‘are’, ‘been’, “don’t”, “shan’t”, ‘myself’, ‘by’, ‘until’, ‘who’, ‘is’, “needn’t”, “shouldn’t”, “wouldn’t”, ‘won’, ‘just’, ‘did’, ‘themselves’, ‘how’, ‘nor’, ‘over’, ‘before’, ‘further’, ‘above’, ‘same’, ‘haven’, ‘or’, ‘of’, ‘re’, ‘shan’, “mustn’t”, ‘ourselves’, ‘yourself’, ‘being’, ‘be’, “won’t”, ‘s’, ‘its’, ‘so’, ‘up’, ‘now’, ‘where’, ‘theirs’, ‘do’, ‘more’, ‘too’, ‘here’, ‘should’, ‘herself’, ‘at’, ‘off’, ‘there’, ‘she’, ‘has’, ‘to’, “hasn’t”, “couldn’t”, ‘wouldn’, ‘ain’, ‘because’, ‘for’, ‘not’, ‘mustn’, ‘t’, ‘again’, ‘hasn’, ‘itself’, ‘can’, ‘isn’, ‘ours’, ‘had’, ‘their’, “it’s”, ‘no’, ‘his’, ‘down’, ‘after’, “wasn’t”, ‘does’, ‘on’, ‘all’, ‘me’, ‘him’, ‘ll’, ‘you’, ‘shouldn’, “you’re”, ‘once’, “doesn’t”, ‘an’, ‘her’, ‘below’, ‘this’, ‘didn’, ‘y’, “didn’t”, ‘each’, “should’ve”, ‘weren’, ‘with’, “hadn’t”, ‘in’, ‘against’, ‘hers’, ‘doesn’, ‘your’, ‘o’, ‘have’, ‘the’, ‘out’, ‘into’, ‘why’, “aren’t”, ‘what’, ‘but’, ‘hadn’, ‘few’, ‘from’, ‘any’, ‘than’, “haven’t”, ‘himself’, “you’ll”, ‘own’, ‘he’, ‘very’, ‘as’, ‘ma’, ‘yourselves’, ‘those’, ‘about’, ‘we’, ‘our’, ‘needn’, ‘having’, ‘most’, ‘wasn’, ‘mightn’, ‘which’, ‘while’, ‘then’, ‘will’, ‘during’, “weren’t”, ‘m’, ‘both’, ‘a’, ‘these’, ‘couldn’, “she’s”, ‘that’, ‘doing’, ‘if’, ‘aren’, ‘were’, ‘i’, ‘yours’, ‘when’, ‘and’, ‘through’, “you’ve”, ‘only’, ‘don’, “mightn’t”, ‘am’, ‘my’, ‘such’, ‘under’, ‘d’, ‘between’, ‘it’, “that’ll”, ‘they’, ‘other’}
[‘The’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’]
[‘The’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘lazy’, ‘dog’]
Advantages :
If we are doing sentiment analysis for movie reviews or twitter analysis or any other , we need to remove these stopwords in the given text. It will help us to get accurate analysis to build better models.