github

ivan-bilan / The-NLP-Pandect

  • четверг, 15 октября 2020 г. в 00:24:09
https://github.com/ivan-bilan/The-NLP-Pandect


A comprehensive reference for all topics related to Natural Language Processing



The-NLP-Pandect

This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online.

The-NLP-Resources

Compendiums and awesome lists on the topic of NLP:

NLP Conferences, Paper Summaries and Paper Compendiums:

Papers and Paper Summaries
Conferences

NLP Progress and NLP Tasks:

NLP Datasets:

Word and Sentence embeddings:

Notebooks, Scripts and Repositories

Non-English resources and compendiums

Pre-trained NLP models

The-NLP-Podcasts

The-NLP-Newsletter

The-NLP-Meetups

The-NLP-Youtube

The-NLP-Benchmarks

  • SQuAD - Stanford Question Answering Dataset (SQuAD)
  • GLUE - General Language Understanding Evaluation (GLUE) benchmark
  • SuperGLUE - benchmark styled after GLUE with a new set of more difficult language understanding tasks
  • CodeXGLUE - A benchmark dataset for code intelligence
  • XTREME - Massively Multilingual Multi-task Benchmark
  • decaNLP - The Natural Language Decathlon (decaNLP) for studying general NLP models
  • RACE - ReAding Comprehension dataset collected from English Examinations
  • XQuad - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering
  • BLURB - Biomedical Language Understanding and Reasoning Benchmark
  • IndicGLUE - Natural Language Understanding Benchmark for Indic Languages
  • BLUE - Biomedical Language Understanding Evaluation benchmark

The-NLP-Research

General

Embeddings

Repositories

Blogs

Cross-lingual Word Embeddings

  • vecmap - VecMap (cross-lingual word embedding mappings) [GitHub ~500stars]

Byte Pair Encoding

  • bpemb - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub ~800 stars]
  • subword-nmt - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub ~1500 stars]
  • python-bpe - Byte Pair Encoding for Python [GitHub ~100stars]

Transformer-based Architectures

General

Transformer

BERT

T5

GPT-family

General
GPT-3

BigBird

Other

Distillation, Pruning and Quantization

Automated Summarization

Rule-based NLP

  • LemmInflect - A python module for English lemmatization and inflection

The-NLP-Industry

Transformer-based Architectures

Embeddings as a Service

NLP Recipes Industrial Applications:

NLP Applications in Bio, Finance, Legal and other industries

The-NLP-Speech

General Speech Recognition

  • wav2letter - Automatic Speech Recognition Toolkit [GitHub ~5k stars]
  • DeepSpeech - Baidu's DeepSpeech architecture [GitHub ~14k stars]
  • Acoustic Word Embeddings by Maria Obedkova [Blog, 2020]
  • kaldi - Kaldi is a toolkit for speech recognition [GitHub ~9k stars]
  • awesome-kaldi - resources for using Kaldi [GitHub ~300 stars]

Text to Speech

  • FastSpeech - The Implementation of FastSpeech based on pytorch [GitHub ~500 stars]

The-NLP-Topics

Blogs

Frameworks for Topic Modeling

  • gensim - framework for topic modeling [GitHub ~11k stars]
  • Spark NLP [Github ~1k stars]

Repositories

The-NLP-Frameworks

General Purpose

  • spaCy by Explosion AI [GitHub ~17k stars]
  • flair by Zalando [Github ~9k stars]
  • AllenNLP by AI2 [Github ~9k stars]
  • stanza (former Stanford NLP) [GitHub ~4k stars]
  • spaCy stanza [GitHub ~400 stars]
  • nltk [GitHub ~9k stars]
  • gensim - framework for topic modeling [GitHub ~11k stars]
  • NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub ~2.5k stars]
  • polyglot - Multi-lingual NLP Framework [Github ~2k stars]
  • FARM [GitHub ~1k stars]
  • gobbli by RTI International [GitHub ~200 stars]
  • headliner - training and deployment of seq2seq models [GitHub ~200 stars]
  • SyferText - A privacy preserving NLP framework [GitHub ~100 stars]
  • DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub ~600 stars]
  • TextHero - Text preprocessing, representation and visualization [GitHub ~2k stars]
  • textblob - TextBlob: Simplified Text Processing [GitHub ~7k stars]
  • AdaptNLP - A high level framework and library for NLP [GitHub ~200 stars]
  • TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub ~800 stars]
  • textacy - NLP, before and after spaCy [GitHub ~1.5k stars]

Non-English oriented

  • textblob-de - TextBlob: Simplified Text Processing for German [GitHub ~100 stars]
  • Kashgari Transfer Learning with focus on Chinese [GitHub ~2k stars]
  • Underthesea - Vietnamese NLP Toolkit [GitHub ~800 stars]

Transformer-oriented

  • transformers by HuggingFace [GitHub ~28k stars]
  • Adapter Hub and its documentation - Adapter modules for Transformers [GitHub ~150 stars]
  • haystack - Transformers at scale for question answering & neural search. [GitHub ~1k stars]

Dialog Systems and Speech

  • DeepPavlov by MIPT [Github ~4k stars]
  • ParlAI by FAIR [Github ~6k stars]
  • rasa - Framework for Conversational Agents [GitHub ~9k stars]
  • wav2letter - Automatic Speech Recognition Toolkit [GitHub ~5k stars]

Distributed NLP

Machine Translation

  • COMET -A Neural Framework for MT Evaluation [Github ~50 stars]

The-NLP-Learning

Books

Courses

Tutorials

The-NLP-Communities

Other-NLP-Topics

General

Tokenization

  • tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub ~3k stars]
  • SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub ~4k stars]
  • SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub ~100 stars]

Data Augmentation and Weak Supervision

Libraries and Frameworks
  • WildNLP Text manipulation library to test NLP models [GitHub ~100 stars]
  • snorkel Framework to generate training data [GitHub ~4k stars]
  • NLPAug Data augmentation for NLP [GitHub ~1k stars]
Blogs and Tutorials

Keyword Extraction

Text Rank
  • PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub ~1.3k stars]
  • textrank - TextRank implementation for Python 3 [GitHub ~1k stars]
RAKE - Rapid Automatic Keyword Extraction
  • rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub ~700 stars]
  • yake - Single-document unsupervised keyword extraction [GitHub ~400 stars]
  • RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub ~700 stars]
  • rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub ~700 stars]
Other
  • flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub ~4.4k stars]
  • BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub ~100 stars]
  • NLP Profiler - A simple NLP library allows profiling datasets with text columns [GitHub ~100 stars]

NLP Interpretability

Ethics, Bias, and Equality in NLP

License CC0

Attributions

Resources

  • All linked resources belong to original authors

Icons

Fonts