news.shamcode.ru | ivan-bilan / The-NLP-Pandect

ivan-bilan / The-NLP-Pandect

четверг, 15 октября 2020 г. в 00:24:09

https://github.com/ivan-bilan/The-NLP-Pandect

A comprehensive reference for all topics related to Natural Language Processing

This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online.

Compendiums and awesome lists on the topic of NLP:

Awesome NLP by keon [GitHub ~10k stars]
Speech and Natural Language Processing Awesome List by elaboshira [GitHub ~2k stars]
Awesome Deep Learning for Natural Language Processing (NLP) [GitHub ~1k stars]
Text Mining and Natural Language Processing Resources by stepthom [GitHub ~300 stars]
Made with ML List by madewithml.com
Brainsources for #NLP enthusiasts by Philip Vollet
Awesome AI/ML/DL - NLP Section [GitHub ~600 stars]

NLP Conferences, Paper Summaries and Paper Compendiums:

TWIML AI [Years: 2016 - now, Status: active]
The Super Data Science Podcast [Years: 2016 - now, Status: active]
NLP Highlights [Years: 2017 - now, Status: active]
Practical AI [Years: 2018 - now, Status: active]
Data Hack Radio [Years: 2018 - now, Status: active]
AI Game Changers [Years: 2020 - now, Status: active]

SQuAD - Stanford Question Answering Dataset (SQuAD)
GLUE - General Language Understanding Evaluation (GLUE) benchmark
SuperGLUE - benchmark styled after GLUE with a new set of more difficult language understanding tasks
CodeXGLUE - A benchmark dataset for code intelligence
XTREME - Massively Multilingual Multi-task Benchmark
decaNLP - The Natural Language Decathlon (decaNLP) for studying general NLP models
RACE - ReAding Comprehension dataset collected from English Examinations
XQuad - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering
BLURB - Biomedical Language Understanding and Reasoning Benchmark
IndicGLUE - Natural Language Understanding Benchmark for Indic Languages
BLUE - Biomedical Language Understanding Evaluation benchmark

General

A Recipe for Training Neural Networks by Andrej Karpathy [Keywords: research, training, 2019]

Embeddings

Repositories

Pre-trained ELMo Representations for Many Languages [GitHub ~1k stars]
sense2vec - Contextually-keyed word vectors [GitHub ~1k stars]
wikipedia2vec [GitHub ~500 stars]
StarSpace [GitHub ~3k stars]
fastText [GitHub ~21k stars]

Blogs

Language Models and Contextualised Word Embeddings by David S. Batista [Blog, 2018]
An Essential Guide to Pretrained Word Embeddings for NLP Practitioners by AnalyticsVidhya [Blog, 2020]
Polyglot Word Embeddings Discover Language Clusters [Blog, 2020]
The Illustrated Word2vec by Jay Alammar [Blog, 2019]

Cross-lingual Word Embeddings

vecmap - VecMap (cross-lingual word embedding mappings) [GitHub ~500stars]

Byte Pair Encoding

bpemb - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub ~800 stars]
subword-nmt - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub ~1500 stars]
python-bpe - Byte Pair Encoding for Python [GitHub ~100stars]

Transformer-based Architectures

General

The Transformer Family by Lilian Weng [Blog, 2020]
Keeping up with the BERTs: a review of the main NLP benchmarks by Manuel Tonneau [Blog, 2020]
Playing the lottery with rewards and multiple languages - about the effect of random initialization [ICLR 2020 Paper]
Attention? Attention! by Lilian Weng [Blog, 2018]
the transformer … “explained”? [Blog, 2019]
Attention is all you need; Attentional Neural Network Models by Łukasz Kaiser [Talk, 2017]
Understanding and Applying Self-Attention for NLP [Talk, 2018]

Transformer

The Annotated Transformer by Harvard NLP [Blog, 2018]
The Illustrated Transformer by Jay Alammar [Blog, 2018]
Illustrated Guide to Transformers by Hong Jing [Blog, 2020]
Sequential Transformer with Adaptive Attention Span by Facebook. Blog [Blog, 2019]
Evolution of Representations in the Transformer by Lena Voita [Blog, 2019]
Reformer: The Efficient Transformer [Blog, 2020]
Longformer — The Long-Document Transformer by Viktor Karlsson [Blog, 2020]
TRANSFORMERS FROM SCRATCH [Blog, 2019]
Universal Transformers by Mostafa Dehghani [Blog, 2019]
Transformers in Natural Language Processing — A Brief Survey by George Ho [Blog, May 2020]
Lite Transformer - Lite Transformer with Long-Short Range Attention [GitHub ~300 stars]

BERT

A Visual Guide to Using BERT for the First Time by Jay Alammar [Blog, 2019]
The Dark Secrets of BERT by Anna Rogers [Blog, 2020]
Understanding searches better than ever before [Blog, 2019]
Demystifying BERT: A Comprehensive Guide to the Groundbreaking NLP Framework [Blog, 2019]
SemBERT - Semantics-aware BERT for Language Understanding [Github ~100 stars]
BERTweet - BERTweet: A pre-trained language model for English Tweets [GitHub ~200 stars]

T5

T5 Understanding Transformer-Based Self-Supervised Architectures [Blog, August 2020]
T5: the Text-To-Text Transfer Transformer [Blog, 2020]

GPT-family

General

The Illustrated GPT-2 by Jay Alammar [Blog, 2019]
The Annotated GPT-2 by Aman Arora
OpenAI’s GPT-2: the model, the hype, and the controversy by Ryan Lowe [Blog, 2019]
How to generate text by Patrick von Platen [Blog, 2020]

GPT-3

Aweseome GPT-3 - list of all resources related to GPT-3 [GitHub ~1.5K stars]
Zero Shot Learning for Text Classification by Amit Chaudhary [Blog, 2020]
GPT-3 A Brief Summary by Leo Gao [Blog, 2020]
GPT-3, a Giant Step for Deep Learning And NLP by Yoel Zeldes [Blog, June 2020]
GPT-3 Language Model: A Technical Overview by Chuan Li [Blog, June 2020]
OpenAI API - API Demo to use GPT-3 for commercial applications

BigBird

Big Bird: Transformers for Longer Sequences original paper by Google Research [Paper, July 2020]

Other

What is Two-Stream Self-Attention in XLNet by Xu LIANG [Blog, 2019]
Visual Paper Summary: ALBERT (A Lite BERT) by Amit Chaudhary [Blog, 2020]
Turing NLG by Microsoft
Multi-Label Text Classification with XLNet by Josh Xin Jie Lee [Blog, 2019]
ELECTRA [GitHub ~1k stars]

Distillation, Pruning and Quantization

Distilling knowledge from Neural Networks to build smaller and faster models by FloydHub [Blog, 2019]
David over Goliath: towards smaller models for cheaper, faster, and greener NLP by Manuel Tonneau [Blog, 2020]

Automated Summarization

PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization by Google AI [Blog, June 2020]

Rule-based NLP

LemmInflect - A python module for English lemmatization and inflection

Transformer-based Architectures

Why BERT Fails in Commercial Environments by Intel AI [Blog, 2020]
Fine Tuning BERT for Text Classification with FARM by Sebastian Guggisberg [Blog, 2020]
Practical NLP for the Real World [Presentation, 2019]
From Paper to Product – How we implemented BERT by Christoph Henkelmann [Talk, 2020]

Embeddings as a Service

embedding-as-service [GitHub, ~100 stars]
Bert-as-service [GitHub, ~8k stars]

NLP Recipes Industrial Applications:

NLP Recipes by microsoft [GitHub ~5k stars]
NLP with Python by susanli2016 [GitHub ~1.5k stars]
Basic Utilities for PyTorch NLP by PetrochukM [GitHub ~2k stars]

NLP Applications in Bio, Finance, Legal and other industries

Blackstone - A spaCy pipeline and model for NLP on unstructured legal text [GitHub ~300 stars]
Sci spaCy - spaCy pipeline and models for scientific/biomedical documents [GitHub ~600 stars]
FinBERT: Pre-Trained on SEC Filings for Financial NLP Tasks [GitHub ~100 stars]
LexNLP - Information retrieval and extraction for real, unstructured legal text [GitHub ~400 stars]
NerDL and NerCRF - Tutorial on Named Entity Recognition for Healthcare with SparkNLP

General Speech Recognition

wav2letter - Automatic Speech Recognition Toolkit [GitHub ~5k stars]
DeepSpeech - Baidu's DeepSpeech architecture [GitHub ~14k stars]
Acoustic Word Embeddings by Maria Obedkova [Blog, 2020]
kaldi - Kaldi is a toolkit for speech recognition [GitHub ~9k stars]
awesome-kaldi - resources for using Kaldi [GitHub ~300 stars]

Text to Speech

FastSpeech - The Implementation of FastSpeech based on pytorch [GitHub ~500 stars]

Blogs

Topic Modelling with PySpark and Spark NLP by Maria Obedkova [Spark, Blog, 2020]

Frameworks for Topic Modeling

gensim - framework for topic modeling [GitHub ~11k stars]
Spark NLP [Github ~1k stars]

Repositories

Top2Vec [Github ~150 stars]
Anchored Correlation Explanation Topic Modeling [GitHub ~300 stars]
Topic Modeling in Embedding Spaces [GitHub ~200 stars] Paper
TopicNet - A high-level interface for BigARTM library [GitHub ~100 stars]

General Purpose

spaCy by Explosion AI [GitHub ~17k stars]
flair by Zalando [Github ~9k stars]
AllenNLP by AI2 [Github ~9k stars]
stanza (former Stanford NLP) [GitHub ~4k stars]
spaCy stanza [GitHub ~400 stars]
nltk [GitHub ~9k stars]
gensim - framework for topic modeling [GitHub ~11k stars]
NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub ~2.5k stars]
polyglot - Multi-lingual NLP Framework [Github ~2k stars]
FARM [GitHub ~1k stars]
gobbli by RTI International [GitHub ~200 stars]
headliner - training and deployment of seq2seq models [GitHub ~200 stars]
SyferText - A privacy preserving NLP framework [GitHub ~100 stars]
DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub ~600 stars]
TextHero - Text preprocessing, representation and visualization [GitHub ~2k stars]
textblob - TextBlob: Simplified Text Processing [GitHub ~7k stars]
AdaptNLP - A high level framework and library for NLP [GitHub ~200 stars]
TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub ~800 stars]
textacy - NLP, before and after spaCy [GitHub ~1.5k stars]

Non-English oriented

textblob-de - TextBlob: Simplified Text Processing for German [GitHub ~100 stars]
Kashgari Transfer Learning with focus on Chinese [GitHub ~2k stars]
Underthesea - Vietnamese NLP Toolkit [GitHub ~800 stars]

Transformer-oriented

transformers by HuggingFace [GitHub ~28k stars]
Adapter Hub and its documentation - Adapter modules for Transformers [GitHub ~150 stars]
haystack - Transformers at scale for question answering & neural search. [GitHub ~1k stars]

Dialog Systems and Speech

DeepPavlov by MIPT [Github ~4k stars]
ParlAI by FAIR [Github ~6k stars]
rasa - Framework for Conversational Agents [GitHub ~9k stars]
wav2letter - Automatic Speech Recognition Toolkit [GitHub ~5k stars]

Distributed NLP

Spark NLP [Github ~1k stars]

Machine Translation

COMET -A Neural Framework for MT Evaluation [Github ~50 stars]

Books

Dive into Deep Learning - An interactive deep learning book with code, math, and discussions
Natural Language Processing and Computational Linguistics - Speech, Morphology and Syntax (Cognitive Science)

Courses

Tutorials

nlp-tutorial - A list of NLP(Natural Language Processing) tutorials built on PyTorch [GitHub ~1000 stars]
Hands-On NLTK Tutorial [GitHub ~300 stars]

r/LanguageTechnology - NLP Reddit forum

General

NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks by HuggingFace [GitHub ~2k stars]

Tokenization

tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub ~3k stars]
SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub ~4k stars]
SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub ~100 stars]

Data Augmentation and Weak Supervision

Libraries and Frameworks

WildNLP Text manipulation library to test NLP models [GitHub ~100 stars]
snorkel Framework to generate training data [GitHub ~4k stars]
NLPAug Data augmentation for NLP [GitHub ~1k stars]

Blogs and Tutorials

A Visual Survey of Data Augmentation in NLP [Blog, 2020]
Weak Supervision: A New Programming Paradigm for Machine Learning [Blog, March 2019]

Keyword Extraction

Text Rank

PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub ~1.3k stars]
textrank - TextRank implementation for Python 3 [GitHub ~1k stars]

RAKE - Rapid Automatic Keyword Extraction

rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub ~700 stars]
yake - Single-document unsupervised keyword extraction [GitHub ~400 stars]
RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub ~700 stars]
rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub ~700 stars]

Other

flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub ~4.4k stars]
BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub ~100 stars]
NLP Profiler - A simple NLP library allows profiling datasets with text columns [GitHub ~100 stars]

NLP Interpretability

Language Interpretability Tool (LIT) [GitHub ~150 stars]
Toolkit to help visualise - what lies in word embeddings [GitHub ~150 stars]

Ethics, Bias, and Equality in NLP

Computational Ethics for NLP - course resources from the Carnegie Mellon University [Lecture Notes, Spring 2020]
Ethics in NLP - resources from ACLs Ethics in NLP track

License CC0

Attributions

Resources

All linked resources belong to original authors

Icons

Akropolis by parkjisun from the Noun Project
Book of Ester by Gilad Sotil from the Noun Project
quill by Juan Pablo Bravo from the Noun Project
acting by Flatart from the Noun Project
olympic by supalerk laipawat from the Noun Project
aristocracy by Eucalyp from the Noun Project
Horn by Eucalyp from the Noun Project
temple by Eucalyp from the Noun Project
constellation by Eucalyp from the Noun Project
ancient greek round pattern by Olena Panasovska from the Noun Project
Harp by Vectors Point from the Noun Project
Atlas by parkjisun from the Noun Project
Parthenon by Eucalyp from the Noun Project
papyrus by IconMark from the Noun Project

Fonts

Dalek Font

ivan-bilan / The-NLP-Pandect

Compendiums and awesome lists on the topic of NLP:

NLP Conferences, Paper Summaries and Paper Compendiums:

Papers and Paper Summaries

Conferences

NLP Progress and NLP Tasks:

NLP Datasets:

Word and Sentence embeddings:

Notebooks, Scripts and Repositories

Non-English resources and compendiums

Pre-trained NLP models

General

Embeddings

Repositories

Blogs

Cross-lingual Word Embeddings

Byte Pair Encoding

Transformer-based Architectures

General

Transformer

BERT

T5

GPT-family

General

GPT-3

BigBird

Other

Distillation, Pruning and Quantization

Automated Summarization

Rule-based NLP

Transformer-based Architectures

Embeddings as a Service

NLP Recipes Industrial Applications:

NLP Applications in Bio, Finance, Legal and other industries

General Speech Recognition

Text to Speech

Blogs

Frameworks for Topic Modeling

Repositories

General Purpose

Non-English oriented

Transformer-oriented

Dialog Systems and Speech

Distributed NLP

Machine Translation

Books

Courses

Tutorials

General

Tokenization

Data Augmentation and Weak Supervision

Libraries and Frameworks

Blogs and Tutorials

Keyword Extraction

Text Rank

RAKE - Rapid Automatic Keyword Extraction

Other

NLP Interpretability

Ethics, Bias, and Equality in NLP

License CC0

Attributions

Resources

Icons

Fonts