In Data Science, the Keywords Extraction is a text analysis technique where you can obtain important insights of some text within a short span of time. It should help to obtain relevant keywords from any text and save you precious time of exploring the whole document. There multiple functionalities in different projects where you can implement this kind of library, for example the automation of keywords extraction when you write a post in some blog, so if you're feeling lazy or less creative than usual, you can automatically generate them from your original text. Another useful real case is when for example when you publish a product in the store and it receives reviews. You can use such feature to extract the problems of the products automatically analyzing thousands of reviews without exploring all of them.
In this top, I will share with you 5 of the most useful Python libraries to extract the keywords from any text in multiple languages automatically.
A Python implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons.
YAKE! is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text. Our system does not need to be trained on a particular set of documents, neither it depends on dictionaries, external-corpus, size of the text, language or domain. To demonstrate the merits and the significance of our proposal, we compare it against ten state-of-the-art unsupervised approaches (TF.IDF, KP-Miner, RAKE, TextRank, SingleRank, ExpandRank, TopicRank, TopicalPageRank, PositionRank and MultipartiteRank), and one supervised method (KEA). Experimental results carried out on top of twenty datasets (see Benchmark section below) show that our methods significantly outperform state-of-the-art methods under a number of collections of different sizes, languages or domains.
PKE is an open source python-based keyphrase extraction toolkit that provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extended to develop new models. It also allows for easy benchmarking of state-of-the-art keyphrase extraction models, and ships with supervised models trained on the SemEval-2010 dataset. This library can be installed with the following pip command (it requires Python 3.6+):
pip install git+https://github.com/boudinfl/pke.git
It does require some extra libraries to work as well:
python -m nltk.downloader stopwords python -m nltk.downloader universal_tagset python -m spacy download en_core_web_sm # download the english model
PKE provides a standardized API for extracting keyphrases from a document. It can be used easily like shown in the following script:
# script.py import pke # initialize keyphrase extraction model, here TopicRank extractor = pke.unsupervised.TopicRank() # load the content of the document, here document is expected to be in raw # format (i.e. a simple text file) and preprocessing is carried out using spacy extractor.load_document(input='/path/to/input.txt', language='en') # keyphrase candidate selection, in the case of TopicRank: sequences of nouns # and adjectives (i.e. `(Noun|Adj)*`) extractor.candidate_selection() # candidate weighting, in the case of TopicRank: using a random walk algorithm extractor.candidate_weighting() # N-best selection, keyphrases contains the 10 highest scored candidates as # (keyphrase, score) tuples keyphrases = extractor.get_n_best(n=10)
For using another model, simply replace
pke.unsupervised.TopicRank with another model (list of implemented models).
MultiRake is a Multilingual Rapid Automatic Keyword Extraction (RAKE) library for Python that features:
- Automatic keyword extraction from text written in any language
- No need to know language of text beforehand
- No need to have list of stopwords
- 26 languages are currently available, for the rest - stopwords are generated from provided text
- Just configure rake, plug in text and get keywords (see implementation details)
This implementation is different from others by its multilingual support. Basically you may provide text without knowing its language (it should be written with cyrillic or latin alphabets), without explicit list of stopwords and get decent result. Though the best result is achieved with thoroughly constructed list of stopwords. During RAKE initialization only the language code should be used:
- bg - Bulgarian
- cs - Czech
- da - Danish
- de - German
- el - Greek
- en - English
- es - Spanish
- fi - Finnish
- fr - French
- ga - Irish
- hr - Croatian
- hu - Hungarian
- id - Indonesian
- it - Italian
- lt - Lithuanian
- lv - latvian
- nl - Dutch
- no - Norwegian
- pl - Polish
- pt - Portuguese
- ro - Romanian
- ru - Russian
- sk - Slovak
- sv - Swedish
- tr - Turkish
- uk - Ukrainian
KeyBERT is without a doubt one of the easiest libraries to use among the others. KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document. This library can be easily installed with the following command using pip:
pip install keybert
After installing, you can use it as a library in your scripts with a script like the following one, where you need to import the KeyBERT model, once it's loaded you can use it to extract the keywords from a variable that contains the plain text:
# script.py from keybert import KeyBERT doc = """ Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a 'reasonable' way (see inductive bias). """ kw_model = KeyBERT() keywords = kw_model.extract_keywords(doc) print(kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)) #[ # ('learning', 0.4604), # ('algorithm', 0.4556), # ('training', 0.4487), # ('class', 0.4086), # ('mapping', 0.3700) #]
And that's it! As specified by the author of the library, that's the goal of KeyBERT (a quick and easy method for creating keywords and keyphrases). For more information about this library, please visit the official repository or read this article at medium written by the author of the library.
If you know another awesome library that allows the extraction of keywords from plain text, please share it with the community in the comment box.
Become a more social person