In Data Science, the Keywords Extraction is a text analysis technique where you can obtain important insights of some text within a short span of time. It should help to obtain relevant keywords from any text and save you precious time of exploring the whole document. There multiple functionalities in different projects where you can implement this kind of library, for example the automation of keywords extraction when you write a post in some blog, so if you're feeling lazy or less creative than usual, you can automatically generate them from your original text. Another useful real case is when for example when you publish a product in the store and it receives reviews. You can use such feature to extract the problems of the products automatically analyzing thousands of reviews without exploring all of them.
In this top, I will share with you 5 of the most useful Python libraries to extract the keywords from any text in multiple languages automatically.
5. RAKE
A Python implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons.
4. YAKE
YAKE! is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text. Our system does not need to be trained on a particular set of documents, neither it depends on dictionaries, external-corpus, size of the text, language or domain. To demonstrate the merits and the significance of our proposal, we compare it against ten state-of-the-art unsupervised approaches (TF.IDF, KP-Miner, RAKE, TextRank, SingleRank, ExpandRank, TopicRank, TopicalPageRank, PositionRank and MultipartiteRank), and one supervised method (KEA). Experimental results carried out on top of twenty datasets (see Benchmark section below) show that our methods significantly outperform state-of-the-art methods under a number of collections of different sizes, languages or domains.
3. PKE
PKE is an open source python-based keyphrase extraction toolkit that provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extended to develop new models. It also allows for easy benchmarking of state-of-the-art keyphrase extraction models, and ships with supervised models trained on the SemEval-2010 dataset. This library can be installed with the following pip command (it requires Python 3.6+):
It does require some extra libraries to work as well:
PKE provides a standardized API for extracting keyphrases from a document. It can be used easily like shown in the following script:
For using another model, simply replace pke.unsupervised.TopicRank
with another model (list of implemented models).
2. MultiRake
MultiRake is a Multilingual Rapid Automatic Keyword Extraction (RAKE) library for Python that features:
- Automatic keyword extraction from text written in any language
- No need to know language of text beforehand
- No need to have list of stopwords
- 26 languages are currently available, for the rest - stopwords are generated from provided text
- Just configure rake, plug in text and get keywords (see implementation details)
This implementation is different from others by its multilingual support. Basically you may provide text without knowing its language (it should be written with cyrillic or latin alphabets), without explicit list of stopwords and get decent result. Though the best result is achieved with thoroughly constructed list of stopwords. During RAKE initialization only the language code should be used:
- bg - Bulgarian
- cs - Czech
- da - Danish
- de - German
- el - Greek
- en - English
- es - Spanish
- fi - Finnish
- fr - French
- ga - Irish
- hr - Croatian
- hu - Hungarian
- id - Indonesian
- it - Italian
- lt - Lithuanian
- lv - latvian
- nl - Dutch
- no - Norwegian
- pl - Polish
- pt - Portuguese
- ro - Romanian
- ru - Russian
- sk - Slovak
- sv - Swedish
- tr - Turkish
- uk - Ukrainian
1. KeyBERT
KeyBERT is without a doubt one of the easiest libraries to use among the others. KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document. This library can be easily installed with the following command using pip:
After installing, you can use it as a library in your scripts with a script like the following one, where you need to import the KeyBERT model, once it's loaded you can use it to extract the keywords from a variable that contains the plain text:
And that's it! As specified by the author of the library, that's the goal of KeyBERT (a quick and easy method for creating keywords and keyphrases). For more information about this library, please visit the official repository or read this article at medium written by the author of the library.
If you know another awesome library that allows the extraction of keywords from plain text, please share it with the community in the comment box.
1 Comment