In this article we are going to explore the code of basic NLP operations using NLTK and spaCy.
NLTK
NLTK is an open-source library and it is very suitable for teaching, and working in, computational linguistics using Python.
Also it is having industrial strength libraries.
spaCy
spaCy is an open-source library for advanced NLP in python.
It is specially designed for production use, which can handle large volume of text where as NLTK and CoreNLP were created specially for teaching and research purpose.
spaCy provides advanced NLP techniques which is widely used in complex applications such as text summarization, text to speech, domain specific NER, Q&A, Emotion Detection etc.
I am planning to explore one-by-one and share it with you in a series of posts.
First Releases
Open-Source Library | Year of First Release |
spaCy | 2015 |
NLTK | 2001 |
It is widely mentioned in many blog posts and articles that spaCy is faster, has almost all the features that is provided by other libraries (NLTK, CodeNLP etc). But more or less similar accuracy.
In this article we are going to analyze and compare code for the very basic operations of NLP in spaCy and NLTK.
And we are not going to compare the speed and accuracy of these libraries. However, knowing the code and results from these two libraries may help in future researches.
#SPACY
import spacy
#NLTK
import nltk
Word Tokenization
text = "He is a 43 year old gentleman who is referred for consultation by Dr. Tamil Buhari. About a week ago he slipped on the driveway at home and sustained an injury to his left ankle. He was seen at My-City Hospital and was told he had a fracture. He was placed in an air splint and advised to be partial weight bearing, and he is using a cane. He is here for routine follow-up."
In spaCy:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)
Output:
['Most', 'of', 'the', 'outlay', 'will', 'be', 'at', 'home', '.', 'No', 'surprise', 'there', ',', 'either', '.', 'While', 'Samsung', 'has', 'expanded', 'overseas', ',', 'South', 'Korea', 'is', 'still', 'host', 'to', 'most', 'of', 'its', 'factories', 'and', 'research', 'engineers', '.']
In spaCy:
from nltk.tokenize import word_tokenize
print(word_tokenize(text))
Output:
['Most', 'of', 'the', 'outlay', 'will', 'be', 'at', 'home', '.', 'No', 'surprise', 'there', ',', 'either', '.', 'While', 'Samsung', 'has', 'expanded', 'overseas', ',', 'South', 'Korea', 'is', 'still', 'host', 'to', 'most', 'of', 'its', 'factories', 'and', 'research', 'engineers', '.']
Sentence Tokenization
In spaCy:
doc = nlp(text)
for sent in doc.sents:
print(sent)
print("in spaCy")
Output:
He is a 43 year old gentleman who is referred for consultation by Dr. Tamil Buhari.
About a week ago he slipped on the driveway at home and sustained an injury to his left ankle.
He was seen at My-City Hospital and was told he had a fracture.
He was placed in an air splint and advised to be partial weight bearing, and he is using a cane.
He is here for routine follow-up.
In NLTK:
from nltk.tokenize import sent_tokenize
print(sent_tokenize(text))
Output:
['He is a 43 year old gentleman who is referred for consultation by Dr. Tamil Buhari.', 'About a week ago he slipped on the driveway at home and sustained an injury to his left ankle.', 'He was seen at My-City Hospital and was told he had a fracture.', 'He was placed in an air splint and advised to be partial weight bearing, and he is using a cane.', 'He is here for routine follow-up.']
Stopword Removal
In spaCy:
text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
doc = nlp(text)
text_without_sw = [token.text for token in doc if token.is_stop == False]
print(text_without_sw)
Output:
in spaCy
['outlay', 'home', '.', 'surprise', ',', '.', 'Samsung', 'expanded', 'overseas', ',', 'South', 'Korea', 'host', 'factories', 'research', 'engineers', '.']
In NLTK:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
words = text.split()
text_without_sw = [token for token in words if not token in stop_words]
print(text_without_sw)
Output:
in NLTK
['Most', 'outlay', 'home.', 'No', 'surprise', 'there,', 'either.', 'While', 'Samsung', 'expanded', 'overseas,', 'South', 'Korea', 'still', 'host', 'factories', 'research', 'engineers.']
POS Tagging
In spaCy:
text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
doc = nlp(text)
tokens_with_POS = [token.text + " - " + token.pos_ for token in doc]
print(tokens_with_POS)
Output:
In spaCy
['Most - ADJ', 'of - ADP', 'the - DET', 'outlay - NOUN', 'will - AUX', 'be - VERB', 'at - ADP', 'home - NOUN', '. - PUNCT', 'No - DET', 'surprise - NOUN', 'there - ADV', ', - PUNCT', 'either - ADV', '. - PUNCT', 'While - SCONJ', 'Samsung - PROPN', 'has - AUX', 'expanded - VERB', 'overseas - ADV', ', - PUNCT', 'South - PROPN', 'Korea - PROPN', 'is - AUX', 'still - ADV', 'host - NOUN', 'to - ADP', 'most - ADJ', 'of - ADP', 'its - PRON', 'factories - NOUN', 'and - CCONJ', 'research - NOUN', 'engineers - NOUN', '. - PUNCT']
In NLTK:
from nltk.tag import pos_tag
sent = nltk.word_tokenize(text)
sent = nltk.pos_tag(sent)
print(sent)
Output:
In NLTK
[('Most', 'JJS'), ('of', 'IN'), ('the', 'DT'), ('outlay', 'NN'), ('will', 'MD'), ('be', 'VB'), ('at', 'IN'), ('home', 'NN'), ('.', '.'), ('No', 'DT'), ('surprise', 'NN'), ('there', 'RB'), (',', ','), ('either', 'DT'), ('.', '.'), ('While', 'IN'), ('Samsung', 'NNP'), ('has', 'VBZ'), ('expanded', 'VBN'), ('overseas', 'RB'), (',', ','), ('South', 'NNP'), ('Korea', 'NNP'), ('is', 'VBZ'), ('still', 'RB'), ('host', 'VBN'), ('to', 'TO'), ('most', 'JJS'), ('of', 'IN'), ('its', 'PRP$'), ('factories', 'NNS'), ('and', 'CC'), ('research', 'NN'), ('engineers', 'NNS'), ('.', '.')]
Named Entity Recognization
NER using spaCy:
text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
Output:
In spaCy
Samsung 69 76 ORG
South Korea 100 111 GPE
NER using NLTK:
Download NLTK words package:
nltk.download('words')
Output:
[nltk_data] Downloading package words to
[nltk_data] C:\Users\aasha\AppData\Roaming\nltk_data...
[nltk_data] Package words is already up-to-date!
Using ne_chunk, pos_tag:
pos_tag function takes in a tokenized sentence and returns a pos tagged — Noun, Verb etc words.
ne_chunk takes in a pos tagged sentence and returns a tree object which is labeled with Person, location — GPE, organization.
import nltk
from nltk import ne_chunk, pos_tag, word_tokenize
sent = "What is the weather in Chicago today?"
print(nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))))
Output:
(S
What/WP
is/VBZ
the/DT
weather/NN
in/IN
(GPE Chicago/NNP)
today/NN
?/.)
You can notice that the Named Entities are tagged: chicago location is tagged with GPR.
Complete code to extract Named entity using ne_chunk and pos_tag:
for sent in nltk.sent_tokenize(text):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
print(chunk.label(), ' '.join(c[0] for c in chunk))
Output:
PERSON Samsung
GPE South Korea
Conclusion:
Now we have seen how the basic NLP operations can be done in both NLTK and SpaCy. It will be useful and easy to compare the source codes to understand the basic features of two libraries.
We will see more about NLP techniques and its applications in this series.
Thank you for reading our article and hope you enjoyed it. 😊 Try all these techniques and play with words.
Like to support? Just click the like button ❤️.
Happy Learning! 👩💻