How to remove stop words with NLTK in Python

Spread the love

In this we will learn, how to write a program to removing stop words with NLTK in Python. Here we are using nltk library for this program.

What are Stop words?

Stop word are most common used words like a, an, the, in etc.

First we need to import the stopwords and word tokentize. We have to set those stopwords, then we have to split the sentence into words. Then we need to remove those stopwords from given text using for loop. In this program we are using English language, you can use other languages also.

Program :

from nltk.corpus import stopwords from nltk import word_tokenize stop_words = set(stopwords.words('english')) print(stop_words) text = word_tokenize("The quick brown fox jumps over the lazy dog") #print(nltk.pos_tag(text)) new_sentence =[] for w in text: if w not in stop_words: new_sentence.append(w) print(text) print(new_sentence)

Output:

{‘whom’, “you’d”, ‘them’, ‘ve’, “isn’t”, ‘some’, ‘was’, ‘are’, ‘been’, “don’t”, “shan’t”, ‘myself’, ‘by’, ‘until’, ‘who’, ‘is’, “needn’t”, “shouldn’t”, “wouldn’t”, ‘won’, ‘just’, ‘did’, ‘themselves’, ‘how’, ‘nor’, ‘over’, ‘before’, ‘further’, ‘above’, ‘same’, ‘haven’, ‘or’, ‘of’, ‘re’, ‘shan’, “mustn’t”, ‘ourselves’, ‘yourself’, ‘being’, ‘be’, “won’t”, ‘s’, ‘its’, ‘so’, ‘up’, ‘now’, ‘where’, ‘theirs’, ‘do’, ‘more’, ‘too’, ‘here’, ‘should’, ‘herself’, ‘at’, ‘off’, ‘there’, ‘she’, ‘has’, ‘to’, “hasn’t”, “couldn’t”, ‘wouldn’, ‘ain’, ‘because’, ‘for’, ‘not’, ‘mustn’, ‘t’, ‘again’, ‘hasn’, ‘itself’, ‘can’, ‘isn’, ‘ours’, ‘had’, ‘their’, “it’s”, ‘no’, ‘his’, ‘down’, ‘after’, “wasn’t”, ‘does’, ‘on’, ‘all’, ‘me’, ‘him’, ‘ll’, ‘you’, ‘shouldn’, “you’re”, ‘once’, “doesn’t”, ‘an’, ‘her’, ‘below’, ‘this’, ‘didn’, ‘y’, “didn’t”, ‘each’, “should’ve”, ‘weren’, ‘with’, “hadn’t”, ‘in’, ‘against’, ‘hers’, ‘doesn’, ‘your’, ‘o’, ‘have’, ‘the’, ‘out’, ‘into’, ‘why’, “aren’t”, ‘what’, ‘but’, ‘hadn’, ‘few’, ‘from’, ‘any’, ‘than’, “haven’t”, ‘himself’, “you’ll”, ‘own’, ‘he’, ‘very’, ‘as’, ‘ma’, ‘yourselves’, ‘those’, ‘about’, ‘we’, ‘our’, ‘needn’, ‘having’, ‘most’, ‘wasn’, ‘mightn’, ‘which’, ‘while’, ‘then’, ‘will’, ‘during’, “weren’t”, ‘m’, ‘both’, ‘a’, ‘these’, ‘couldn’, “she’s”, ‘that’, ‘doing’, ‘if’, ‘aren’, ‘were’, ‘i’, ‘yours’, ‘when’, ‘and’, ‘through’, “you’ve”, ‘only’, ‘don’, “mightn’t”, ‘am’, ‘my’, ‘such’, ‘under’, ‘d’, ‘between’, ‘it’, “that’ll”, ‘they’, ‘other’}
[‘The’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’]
[‘The’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘lazy’, ‘dog’]

 

Advantages :

If we are doing sentiment analysis for movie reviews or twitter analysis or any other , we need to remove these stopwords in the given text. It will help us to get accurate analysis to build better models.

admin

admin

Leave a Reply

Your email address will not be published. Required fields are marked *