How to remove punctuation and stopwords in python nltk
In this tutorial, You will learn how to write a program to remove punctuation and stopwords in python using nltk library.
How to remove punctuation in python nltk
We will regular expression with wordnet library.
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
result = tokenizer.tokenize('hey! how are you ? buddy')
print(result)
Output:
[‘hey’, ‘how’, ‘are’, ‘you’, ‘buddy’]
How to remove stopwords in python nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
stop_words = set(stopwords.words('english'))
text = word_tokenize("The quick brown fox jumps over the lazy dog")
#print(nltk.pos_tag(text))
new_sentence =[]
for w in text:
if w not in stop_words: new_sentence.append(w)
print(text)
print(new_sentence)
Output:
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'] ['The', 'quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']