https://github.com/ivan-bilan/The-NLP-Pandect A comprehensive reference for all topics related to Natural Language Processing
This pandect (πανδέκτης is Ancient Greek for encyclopedia ) was created to help you find almost anything related to Natural Language
Processing that is available online.
Compendiums and awesome lists on the topic of NLP:
NLP Conferences, Paper Summaries and Paper Compendiums:
Papers and Paper Summaries
Conferences
NLP Progress and NLP Tasks:
NLP Datasets:
Word and Sentence embeddings:
Notebooks, Scripts and Repositories
Non-English resources and compendiums
Pre-trained NLP models
SQuAD - Stanford Question Answering Dataset (SQuAD)
GLUE - General Language Understanding Evaluation (GLUE) benchmark
SuperGLUE - benchmark styled after GLUE with a new set of more difficult language understanding tasks
CodeXGLUE - A benchmark dataset for code intelligence
XTREME - Massively Multilingual Multi-task Benchmark
decaNLP - The Natural Language Decathlon (decaNLP) for studying general NLP models
RACE - ReAding Comprehension dataset collected from English Examinations
XQuad - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering
BLURB - Biomedical Language Understanding and Reasoning Benchmark
IndicGLUE - Natural Language Understanding Benchmark for Indic Languages
BLUE - Biomedical Language Understanding Evaluation benchmark
General
Embeddings
Repositories
Blogs
Cross-lingual Word Embeddings
vecmap - VecMap (cross-lingual word embedding mappings) [GitHub ~500stars]
Byte Pair Encoding
bpemb - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub ~800 stars]
subword-nmt - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub ~1500 stars]
python-bpe - Byte Pair Encoding for Python [GitHub ~100stars]
Transformer-based Architectures
General
Transformer
The Annotated Transformer by Harvard NLP [Blog, 2018]
The Illustrated Transformer by Jay Alammar [Blog, 2018]
Illustrated Guide to Transformers by Hong Jing [Blog, 2020]
Sequential Transformer with Adaptive Attention Span by Facebook. Blog [Blog, 2019]
Evolution of Representations in the Transformer by Lena Voita [Blog, 2019]
Reformer: The Efficient Transformer [Blog, 2020]
Longformer — The Long-Document Transformer by Viktor Karlsson [Blog, 2020]
TRANSFORMERS FROM SCRATCH [Blog, 2019]
Universal Transformers by Mostafa Dehghani [Blog, 2019]
Transformers in Natural Language Processing — A Brief Survey by George Ho [Blog, May 2020]
Lite Transformer - Lite Transformer with Long-Short Range Attention [GitHub ~300 stars]
BERT
T5
GPT-family
General
GPT-3
BigBird
Other
Distillation, Pruning and Quantization
Automated Summarization
Rule-based NLP
LemmInflect - A python module for English lemmatization and inflection
Transformer-based Architectures
Embeddings as a Service
NLP Recipes Industrial Applications:
NLP Applications in Bio, Finance, Legal and other industries
General Speech Recognition
wav2letter - Automatic Speech Recognition Toolkit [GitHub ~5k stars]
DeepSpeech - Baidu's DeepSpeech architecture [GitHub ~14k stars]
Acoustic Word Embeddings by Maria Obedkova [Blog, 2020]
kaldi - Kaldi is a toolkit for speech recognition [GitHub ~9k stars]
awesome-kaldi - resources for using Kaldi [GitHub ~300 stars]
Text to Speech
FastSpeech - The Implementation of FastSpeech based on pytorch [GitHub ~500 stars]
Blogs
Frameworks for Topic Modeling
gensim - framework for topic modeling [GitHub ~11k stars]
Spark NLP [Github ~1k stars]
Repositories
General Purpose
spaCy by Explosion AI [GitHub ~17k stars]
flair by Zalando [Github ~9k stars]
AllenNLP by AI2 [Github ~9k stars]
stanza (former Stanford NLP) [GitHub ~4k stars]
spaCy stanza [GitHub ~400 stars]
nltk [GitHub ~9k stars]
gensim - framework for topic modeling [GitHub ~11k stars]
NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub ~2.5k stars]
polyglot - Multi-lingual NLP Framework [Github ~2k stars]
FARM [GitHub ~1k stars]
gobbli by RTI International [GitHub ~200 stars]
headliner - training and deployment of seq2seq models [GitHub ~200 stars]
SyferText - A privacy preserving NLP framework [GitHub ~100 stars]
DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub ~600 stars]
TextHero - Text preprocessing, representation and visualization [GitHub ~2k stars]
textblob - TextBlob: Simplified Text Processing [GitHub ~7k stars]
AdaptNLP - A high level framework and library for NLP [GitHub ~200 stars]
TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub ~800 stars]
textacy - NLP, before and after spaCy [GitHub ~1.5k stars]
Non-English oriented
textblob-de - TextBlob: Simplified Text Processing for German [GitHub ~100 stars]
Kashgari Transfer Learning with focus on Chinese [GitHub ~2k stars]
Underthesea - Vietnamese NLP Toolkit [GitHub ~800 stars]
Transformer-oriented
transformers by HuggingFace [GitHub ~28k stars]
Adapter Hub and its documentation - Adapter modules for Transformers [GitHub ~150 stars]
haystack - Transformers at scale for question answering & neural search. [GitHub ~1k stars]
Dialog Systems and Speech
DeepPavlov by MIPT [Github ~4k stars]
ParlAI by FAIR [Github ~6k stars]
rasa - Framework for Conversational Agents [GitHub ~9k stars]
wav2letter - Automatic Speech Recognition Toolkit [GitHub ~5k stars]
Distributed NLP
Machine Translation
COMET -A Neural Framework for MT Evaluation [Github ~50 stars]
Books
Courses
Tutorials
General
Tokenization
tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub ~3k stars]
SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub ~4k stars]
SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub ~100 stars]
Data Augmentation and Weak Supervision
Libraries and Frameworks
WildNLP Text manipulation library to test NLP models [GitHub ~100 stars]
snorkel Framework to generate training data [GitHub ~4k stars]
NLPAug Data augmentation for NLP [GitHub ~1k stars]
Blogs and Tutorials
Keyword Extraction
Text Rank
PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub ~1.3k stars]
textrank - TextRank implementation for Python 3 [GitHub ~1k stars]
RAKE - Rapid Automatic Keyword Extraction
rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub ~700 stars]
yake - Single-document unsupervised keyword extraction [GitHub ~400 stars]
RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub ~700 stars]
rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub ~700 stars]
Other
flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub ~4.4k stars]
BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub ~100 stars]
NLP Profiler - A simple NLP library allows profiling datasets with text columns [GitHub ~100 stars]
NLP Interpretability
Ethics, Bias, and Equality in NLP
License CC0
Attributions
Resources
All linked resources belong to original authors
Icons
Fonts