Top 5: Best Python Libraries to Extract Keywords From Text Automatically

Check our collection of 5 of the best open-source Keywords Extraction Libraries for Python.

In Data Science, the Keywords Extraction is a text analysis technique where you can obtain important insights of some text within a short span of time. It should help to obtain relevant keywords from any text and save you precious time of exploring the whole document. There multiple functionalities in different projects where you can implement this kind of library, for example the automation of keywords extraction when you write a post in some blog, so if you're feeling lazy or less creative than usual, you can automatically generate them from your original text. Another useful real case is when for example when you publish a product in the store and it receives reviews. You can use such feature to extract the problems of the products automatically analyzing thousands of reviews without exploring all of them.

In this top, I will share with you 5 of the most useful Python libraries to extract the keywords from any text in multiple languages automatically.

5. RAKE

A Python implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons.

4. YAKE

Online demo • API

YAKE! is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text. Our system does not need to be trained on a particular set of documents, neither it depends on dictionaries, external-corpus, size of the text, language or domain. To demonstrate the merits and the significance of our proposal, we compare it against ten state-of-the-art unsupervised approaches (TF.IDF, KP-Miner, RAKE, TextRank, SingleRank, ExpandRank, TopicRank, TopicalPageRank, PositionRank and MultipartiteRank), and one supervised method (KEA). Experimental results carried out on top of twenty datasets (see Benchmark section below) show that our methods significantly outperform state-of-the-art methods under a number of collections of different sizes, languages or domains.

3. PKE

PKE is an open source python-based keyphrase extraction toolkit that provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extended to develop new models. It also allows for easy benchmarking of state-of-the-art keyphrase extraction models, and ships with supervised models trained on the SemEval-2010 dataset. This library can be installed with the following pip command (it requires Python 3.6+):

pip install git+https://github.com/boudinfl/pke.git

It does require some extra libraries to work as well:

python -m nltk.downloader stopwords
python -m nltk.downloader universal_tagset
python -m spacy download en_core_web_sm # download the english model

PKE provides a standardized API for extracting keyphrases from a document. It can be used easily like shown in the following script:

# script.py
import pke

# initialize keyphrase extraction model, here TopicRank
extractor = pke.unsupervised.TopicRank()

# load the content of the document, here document is expected to be in raw
# format (i.e. a simple text file) and preprocessing is carried out using spacy
extractor.load_document(input='/path/to/input.txt', language='en')

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm
extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
keyphrases = extractor.get_n_best(n=10)

For using another model, simply replace pke.unsupervised.TopicRank with another model (list of implemented models).

2. MultiRake

MultiRake is a Multilingual Rapid Automatic Keyword Extraction (RAKE) library for Python that features:

Automatic keyword extraction from text written in any language
No need to know language of text beforehand
No need to have list of stopwords
26 languages are currently available, for the rest - stopwords are generated from provided text
Just configure rake, plug in text and get keywords (see implementation details)

This implementation is different from others by its multilingual support. Basically you may provide text without knowing its language (it should be written with cyrillic or latin alphabets), without explicit list of stopwords and get decent result. Though the best result is achieved with thoroughly constructed list of stopwords. During RAKE initialization only the language code should be used:

bg - Bulgarian
cs - Czech
da - Danish
de - German
el - Greek
en - English
es - Spanish
fi - Finnish
fr - French
ga - Irish
hr - Croatian
hu - Hungarian
id - Indonesian
it - Italian
lt - Lithuanian
lv - latvian
nl - Dutch
no - Norwegian
pl - Polish
pt - Portuguese
ro - Romanian
ru - Russian
sk - Slovak
sv - Swedish
tr - Turkish
uk - Ukrainian

1. KeyBERT

KeyBERT is without a doubt one of the easiest libraries to use among the others. KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document. This library can be easily installed with the following command using pip:

pip install keybert

After installing, you can use it as a library in your scripts with a script like the following one, where you need to import the KeyBERT model, once it's loaded you can use it to extract the keywords from a variable that contains the plain text:

# script.py
from keybert import KeyBERT

doc = """
    Supervised learning is the machine learning task of learning a function that
    maps an input to an output based on example input-output pairs. It infers a
    function from labeled training data consisting of a set of training examples.
    In supervised learning, each example is a pair consisting of an input object
    (typically a vector) and a desired output value (also called the supervisory signal). 
    A supervised learning algorithm analyzes the training data and produces an inferred function, 
    which can be used for mapping new examples. An optimal scenario will allow for the 
    algorithm to correctly determine the class labels for unseen instances. This requires 
    the learning algorithm to generalize from the training data to unseen situations in a 
    'reasonable' way (see inductive bias).
"""

kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

print(kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None))
#[
#    ('learning', 0.4604),
#    ('algorithm', 0.4556),
#    ('training', 0.4487),
#    ('class', 0.4086),
#    ('mapping', 0.3700)
#]

And that's it! As specified by the author of the library, that's the goal of KeyBERT (a quick and easy method for creating keywords and keyphrases). For more information about this library, please visit the official repository or read this article at medium written by the author of the library.

If you know another awesome library that allows the extraction of keywords from plain text, please share it with the community in the comment box.