deepset-ai / haystack
- воскресенье, 2 августа 2020 г. в 00:24:31
Python
🔍 Transformers at scale for question answering & search
The performance of modern Question Answering Models (BERT, ALBERT ...) has seen drastic improvements within the last year enabling many new opportunities for accessing information more efficiently. However, those models are designed to find answers within rather small text passages. Haystack lets you scale QA models to large collections of documents! While QA is the focussed use case for Haystack, we will address further options around neural search in the future (re-ranking, most-similar search ...).
Haystack is designed in a modular way and lets you use any models trained with FARM or Transformers.
PyPi:
pip install farm-haystack
Master branch (if you wanna try the latest features):
git clone https://github.com/deepset-ai/haystack.git cd haystack pip install --editable .
To update your installation, just do a git pull. The --editable flag will update changes immediately.
Haystack offers different options for storing your documents for search. We recommend Elasticsearch, but have also light-weight options for fast prototyping and will soon add DocumentStores that are optimized for embeddings (FAISS & Co).
haystack.database.elasticsearch.ElasticsearchDocumentStore
You can get started by running a single Elasticsearch node using docker:
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2
Or if docker is not possible for you:
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz chown -R daemon:daemon elasticsearch-7.6.2 elasticsearch-7.0.0/bin/elasticsearch
See Tutorial 1 on how to go on with indexing your docs.
haystack.database.sql.SQLDocumentStore
& haystack.database.memory.InMemoryDocumentStore
These DocumentStores are mainly intended to simplify the first development steps or test a prototype on an existing SQL Database containing your texts. The SQLDocumentStore initializes by default a local file-based SQLite database. However, you can easily configure it for PostgreSQL or MySQL since our implementation is based on SQLAlchemy. Limitations: Retrieval (e.g. via TfidfRetriever) happens in-memory here and will therefore only work efficiently on small datasets
Using dense embeddings (i.e. vector representations) of texts is a powerful alternative to score similarity of texts. This retriever uses two BERT models - one to embed your query, one to embed your passage. It's based on the work of Karpukhin et al and is especially an powerful alternative if there's no direct overlap between tokens in your queries and your texts.
Example
retriever = DensePassageRetriever(document_store=document_store,
embedding_model="dpr-bert-base-nq",
do_lower_case=True, use_gpu=True)
retriever.retrieve(query="Why did the revenue increase?")
# returns: [Document, Document]
Scoring text similarity via sparse Bag-of-words representations are strong and well-established baselines in Information Retrieval.
The default ElasticsearchRetriever
uses Elasticsearch's native scoring (BM25), but can be extended easily with custom queries or filtering.
Example
retriever = ElasticsearchRetriever(document_store=document_store, custom_query=None)
retriever.retrieve(query="Why did the revenue increase?", filters={"years": ["2019"], "company": ["Q1", "Q2"]})
# returns: [Document, Document]
This retriever uses a single model to embed your query and passage (e.g. Sentence-BERT) and finds similar texts by using cosine similarity. This works well if your query and passage are a similar type of text, e.g. you want to find the most similar question in your FAQ given a user question.
Example
retriever = EmbeddingRetriever(document_store=document_store,
embedding_model="deepset/sentence_bert",
model_format="farm")
retriever.retrieve(query="Why did the revenue increase?", filters={"years": ["2019"], "company": ["Q1", "Q2"]})
# returns: [Document, Document]
Basic in-memory retriever getting texts from the DocumentStore, creating TF-IDF representations in-memory and allowing to query them. Simple baseline for quick prototypes. Not recommended for production.
Neural networks (i.e. mostly Transformer-based) that read through texts in detail to find an answer. Use diverse models like BERT, RoBERTa or XLNet trained via FARM or on SQuAD-like datasets. The Reader takes multiple passages of text as input and returns top-n answers with corresponding confidence scores. Both readers can load either a local model or any public model from Hugging Face's model hub
Implementing various QA models via the FARM Framework.
Example
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2",
use_gpu=False, no_ans_boost=-10, context_window_size=500,
top_k_per_candidate=3, top_k_per_sample=1,
num_processes=8, max_seq_len=256, doc_stride=128)
# Optional: Training & eval
reader.train(...)
reader.eval(...)
# Predict
reader.predict(question="Who is the father of Arya Starck?", documents=documents, top_k=3)
This Reader comes with:
Implementing various QA models via the pipeline
class of Transformers Framework.
Example
reader = TransformersReader(model="distilbert-base-uncased-distilled-squad",
tokenizer="distilbert-base-uncased",
context_window_size=500,
use_gpu=-1)
reader.predict(question="Who is the father of Arya Starck?", documents=documents, top_k=3)
A simple REST API based on FastAPI is provided to:
To serve the API, adjust the values in rest_api/config.py
and run:
gunicorn rest_api.application:app -b 0.0.0.0:80 -k uvicorn.workers.UvicornWorker
You will find the Swagger API documentation at http://127.0.0.1:80/docs
Haystack has basic converters to extract text from PDF and Docx files. While it's almost impossible to cover all types, layouts and special cases in PDFs, the implementation covers the most common formats and provides basic cleaning functions to remove header, footers, and tables. Multi-Column text layouts are also supported. The converters are easily extendable, so that you can customize them for your files if needed.
Example:
#PDF
from haystack.indexing.file_converters.pdf import PDFToTextConverter
converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True, valid_languages=["de","en"])
pages = converter.extract_pages(file_path=file)
# => list of str, one per page
#DOCX
from haystack.indexing.file_converters.docx import DocxToTextConverter
converter = DocxToTextConverter()
paragraphs = converter.extract_pages(file_path=file)
# => list of str, one per paragraph (as docx has no direct notion of pages)
tox
.