NVIDIA / nv-ingest
- воскресенье, 12 января 2025 г. в 00:00:05
NVIDIA Ingest is an early access set of microservices for parsing hundreds of thousands of complex, messy unstructured PDFs and other enterprise documents into metadata and text to embed into retrieval systems.
NVIDIA-Ingest is a scalable, performance-oriented document content and metadata extraction microservice. Including support for parsing PDFs, Word and PowerPoint documents, it uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications.
NVIDIA Ingest enables parallelization of the process of splitting documents into pages where contents are classified (as tables, charts, images, text), extracted into discrete content, and further contextualized via optical character recognition (OCR) into a well defined JSON schema. From there, NVIDIA Ingest can optionally manage computation of embeddings for the extracted content, and also optionally manage storing into a vector database Milvus.
A microservice that:
A service that:
GPU | Family | Memory | # of GPUs (min.) |
---|---|---|---|
H100 | SXM or PCIe | 80GB | 2 |
A100 | SXM or PCIe | 80GB | 2 |
535
, CUDA >= 12.2
)To get started using NVIDIA Ingest, you need to do a few things:
Optional:
This example demonstrates how to use the provided docker-compose.yaml to start all needed services with a few commands.
Important
NIM containers on their first startup can take 10-15 minutes to pull and fully load models.
If preferred, you can also start services one by one, or run on Kubernetes via our Helm chart. Also of note are additional environment variables you may wish to configure.
Git clone the repo:
git clone https://github.com/nvidia/nv-ingest
Change directory to the cloned repo
cd nv-ingest
.
Generate API keys and authenticate with NGC with the docker login
command:
# This is required to access pre-built containers and NIM microservices
$ docker login nvcr.io
Username: $oauthtoken
Password: <Your Key>
Note
during the early access (EA) phase, your API key must be created as a member of nemo-microservice / ea-participants
which you may join by applying for early access here: https://developer.nvidia.com/nemo-microservices-early-access/join. When approved, switch your profile to this org / team, then the key you generate will have access to the resources outlined below.
# Container images must access resources from NGC.
NGC_API_KEY=... # Optional, set this if you are deploying NIMs locally from NGC
NVIDIA_BUILD_API_KEY=... # Optional, set this is you are using build.nvidia.com NIMs
Note
As configured by default in docker-compose.yaml, the DePlot NIM is on a dedicated GPU. All other NIMs and the nv-ingest container itself share a second. This is to avoid DePlot and other NIMs competing for VRAM on the same device.
Change the CUDA_VISIBLE_DEVICES
pinnings as desired for your system within docker-compose.yaml.
Important
Make sure NVIDIA is set as your default container runtime before running the docker compose command with the command:
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
docker compose up
Tip
By default we have configured log levels to be verbose.
It's possible to observe service startup proceeding: you will notice many log messages. Disable verbose logging by configuring NIM_TRITON_LOG_VERBOSE=0
for each NIM in docker-compose.yaml.
If you want to build from source, use docker compose up --build
instead. This will build from your repo's code rather than from an already published container.
nvidia-smi
should show processes like the following:# If it's taking > 1m for `nvidia-smi` to return, it's likely the bus is still busy setting up the models.
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1352957 C tritonserver 762MiB |
| 1 N/A N/A 1322081 C /opt/nim/llm/.venv/bin/python3 63916MiB |
| 2 N/A N/A 1355175 C tritonserver 478MiB |
| 2 N/A N/A 1367569 C ...s/python/triton_python_backend_stub 12MiB |
| 3 N/A N/A 1321841 C python 414MiB |
| 3 N/A N/A 1352331 C tritonserver 478MiB |
| 3 N/A N/A 1355929 C ...s/python/triton_python_backend_stub 424MiB |
| 3 N/A N/A 1373202 C tritonserver 414MiB |
+---------------------------------------------------------------------------------------+
Observe the started containers with docker ps
:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0f2f86615ea5 nvcr.io/ohlfw0olaadg/ea-participants/nv-ingest:24.10 "/opt/conda/bin/tini…" 35 seconds ago Up 33 seconds 0.0.0.0:7670->7670/tcp, :::7670->7670/tcp nv-ingest-nv-ingest-ms-runtime-1
de44122c6ddc otel/opentelemetry-collector-contrib:0.91.0 "/otelcol-contrib --…" 14 hours ago Up 24 seconds 0.0.0.0:4317-4318->4317-4318/tcp, :::4317-4318->4317-4318/tcp, 0.0.0.0:8888-8889->8888-8889/tcp, :::8888-8889->8888-8889/tcp, 0.0.0.0:13133->13133/tcp, :::13133->13133/tcp, 55678/tcp, 0.0.0.0:32849->9411/tcp, :::32848->9411/tcp, 0.0.0.0:55680->55679/tcp, :::55680->55679/tcp nv-ingest-otel-collector-1
02c9ab8c6901 nvcr.io/ohlfw0olaadg/ea-participants/cached:0.2.0 "/opt/nvidia/nvidia_…" 14 hours ago Up 24 seconds 0.0.0.0:8006->8000/tcp, :::8006->8000/tcp, 0.0.0.0:8007->8001/tcp, :::8007->8001/tcp, 0.0.0.0:8008->8002/tcp, :::8008->8002/tcp nv-ingest-cached-1
d49369334398 nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.1.0 "/opt/nvidia/nvidia_…" 14 hours ago Up 33 seconds 0.0.0.0:8012->8000/tcp, :::8012->8000/tcp, 0.0.0.0:8013->8001/tcp, :::8013->8001/tcp, 0.0.0.0:8014->8002/tcp, :::8014->8002/tcp nv-ingest-embedding-1
508715a24998 nvcr.io/ohlfw0olaadg/ea-participants/nv-yolox-structured-images-v1:0.2.0 "/opt/nvidia/nvidia_…" 14 hours ago Up 33 seconds 0.0.0.0:8000-8002->8000-8002/tcp, :::8000-8002->8000-8002/tcp nv-ingest-yolox-1
5b7a174a0a85 nvcr.io/ohlfw0olaadg/ea-participants/deplot:1.0.0 "/opt/nvidia/nvidia_…" 14 hours ago Up 33 seconds 0.0.0.0:8003->8000/tcp, :::8003->8000/tcp, 0.0.0.0:8004->8001/tcp, :::8004->8001/tcp, 0.0.0.0:8005->8002/tcp, :::8005->8002/tcp nv-ingest-deplot-1
430045f98c02 nvcr.io/ohlfw0olaadg/ea-participants/paddleocr:0.2.0 "/opt/nvidia/nvidia_…" 14 hours ago Up 24 seconds 0.0.0.0:8009->8000/tcp, :::8009->8000/tcp, 0.0.0.0:8010->8001/tcp, :::8010->8001/tcp, 0.0.0.0:8011->8002/tcp, :::8011->8002/tcp nv-ingest-paddle-1
8e587b45821b grafana/grafana "/run.sh" 14 hours ago Up 33 seconds 0.0.0.0:3000->3000/tcp, :::3000->3000/tcp grafana-service
aa2c0ec387e2 redis/redis-stack "/entrypoint.sh" 14 hours ago Up 33 seconds 0.0.0.0:6379->6379/tcp, :::6379->6379/tcp, 8001/tcp nv-ingest-redis-1
bda9a2a9c8b5 openzipkin/zipkin "start-zipkin" 14 hours ago Up 33 seconds (healthy) 9410/tcp, 0.0.0.0:9411->9411/tcp, :::9411->9411/tcp nv-ingest-zipkin-1
ac27e5297d57 prom/prometheus:latest "/bin/prometheus --w…" 14 hours ago Up 33 seconds 0.0.0.0:9090->9090/tcp, :::9090->9090/tcp nv-ingest-prometheus-1
Tip
nv-ingest is in Early Access mode, meaning the codebase gets frequent updates. To build an updated nv-ingest service container with the latest changes you can:
docker compose build
After the image is built, run docker compose up
per item 5 above.
To interact with the nv-ingest service, you can do so from the host, or by docker exec
-ing into the nv-ingest container.
To interact from the host, you'll need a Python environment and install the client dependencies:
# conda not required, but makes it easy to create a fresh python environment
conda create --name nv-ingest-dev --file ./conda/environments/nv_ingest_environment.yml
conda activate nv-ingest-dev
cd client
pip install .
# When not using Conda, pip dependencies for the client can be installed directly via pip. Pip based installation of
# the ingest service is not supported.
cd client
pip install -r requirements.txt
pip install .
Note
Interacting from the host depends on the appropriate port being exposed from the nv-ingest container to the host as defined in docker-compose.yaml.
If you prefer, you can disable exposing that port, and interact with the nv-ingest service directly from within its container.
To interact within the container:
docker exec -it nv-ingest-nv-ingest-ms-runtime-1 bash
You'll be in the /workspace
directory, which has DATASET_ROOT
from the .env file mounted at ./data
. The pre-activated morpheus
conda environment has all the python client libraries pre-installed:
(morpheus) root@aba77e2a4bde:/workspace#
From the bash prompt above, you can run nv-ingest-cli and Python examples described below.
You can submit jobs programmatically in Python or via the nv-ingest-cli tool.
In the below examples, we are doing text, chart, table, and image extraction:
extract_text
, - uses PDFium to find and extract text from pagesextract_images
- uses PDFium to extract imagesextract_tables
- uses YOLOX to find tables and charts. Uses PaddleOCR for table extraction, and Deplot and CACHED for chart extractionextract_charts
- (optional) enables or disables the use of Deplot and CACHED for chart extraction.Important
extract_tables
controls extraction for both tables and charts. You can optionally disable chart extraction by setting extract_charts
to false.
import logging, time
from nv_ingest_client.client import NvIngestClient
from nv_ingest_client.primitives import JobSpec
from nv_ingest_client.primitives.tasks import ExtractTask
from nv_ingest_client.util.file_processing.extract import extract_file_content
logger = logging.getLogger("nv_ingest_client")
file_name = "data/multimodal_test.pdf"
file_content, file_type = extract_file_content(file_name)
# A JobSpec is an object that defines a document and how it should
# be processed by the nv-ingest service.
job_spec = JobSpec(
document_type=file_type,
payload=file_content,
source_id=file_name,
source_name=file_name,
extended_options=
{
"tracing_options":
{
"trace": True,
"ts_send": time.time_ns()
}
}
)
# configure desired extraction modes here. Multiple extraction
# methods can be defined for a single JobSpec
extract_task = ExtractTask(
document_type=file_type,
extract_text=True,
extract_images=True,
extract_tables=True
)
job_spec.add_task(extract_task)
# Create the client and inform it about the JobSpec we want to process.
client = NvIngestClient(
message_client_hostname="localhost", # Host where nv-ingest-ms-runtime is running
message_client_port=7670 # REST port, defaults to 7670
)
job_id = client.add_job(job_spec)
client.submit_job(job_id, "morpheus_task_queue")
result = client.fetch_job_result(job_id, timeout=60)
print(f"Got {len(result)} results")
nv-ingest-cli
(you can find more nv-ingest-cli examples here):nv-ingest-cli \
--doc ./data/multimodal_test.pdf \
--output_directory ./processed_docs \
--task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_tables": "true", "extract_images": "true"}' \
--client_host=localhost \
--client_port=7670
You should notice output indicating document processing status, followed by a breakdown of time spent during job execution:
INFO:nv_ingest_client.nv_ingest_cli:Processing 1 documents.
INFO:nv_ingest_client.nv_ingest_cli:Output will be written to: ./processed_docs
Processing files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.47s/file, pages_per_sec=0.29]
INFO:nv_ingest_client.cli.util.processing:dedup_images: Avg: 1.02 ms, Median: 1.02 ms, Total Time: 1.02 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:dedup_images_channel_in: Avg: 1.44 ms, Median: 1.44 ms, Total Time: 1.44 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:docx_content_extractor: Avg: 0.66 ms, Median: 0.66 ms, Total Time: 0.66 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:docx_content_extractor_channel_in: Avg: 1.09 ms, Median: 1.09 ms, Total Time: 1.09 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:filter_images: Avg: 0.84 ms, Median: 0.84 ms, Total Time: 0.84 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:filter_images_channel_in: Avg: 7.75 ms, Median: 7.75 ms, Total Time: 7.75 ms, Total % of Trace Computation: 0.07%
INFO:nv_ingest_client.cli.util.processing:job_counter: Avg: 2.13 ms, Median: 2.13 ms, Total Time: 2.13 ms, Total % of Trace Computation: 0.02%
INFO:nv_ingest_client.cli.util.processing:job_counter_channel_in: Avg: 2.05 ms, Median: 2.05 ms, Total Time: 2.05 ms, Total % of Trace Computation: 0.02%
INFO:nv_ingest_client.cli.util.processing:metadata_injection: Avg: 14.48 ms, Median: 14.48 ms, Total Time: 14.48 ms, Total % of Trace Computation: 0.14%
INFO:nv_ingest_client.cli.util.processing:metadata_injection_channel_in: Avg: 0.22 ms, Median: 0.22 ms, Total Time: 0.22 ms, Total % of Trace Computation: 0.00%
INFO:nv_ingest_client.cli.util.processing:pdf_content_extractor: Avg: 10332.97 ms, Median: 10332.97 ms, Total Time: 10332.97 ms, Total % of Trace Computation: 99.45%
INFO:nv_ingest_client.cli.util.processing:pdf_content_extractor_channel_in: Avg: 0.44 ms, Median: 0.44 ms, Total Time: 0.44 ms, Total % of Trace Computation: 0.00%
INFO:nv_ingest_client.cli.util.processing:pptx_content_extractor: Avg: 1.19 ms, Median: 1.19 ms, Total Time: 1.19 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:pptx_content_extractor_channel_in: Avg: 0.98 ms, Median: 0.98 ms, Total Time: 0.98 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:redis_source_network_in: Avg: 12.27 ms, Median: 12.27 ms, Total Time: 12.27 ms, Total % of Trace Computation: 0.12%
INFO:nv_ingest_client.cli.util.processing:redis_task_sink_channel_in: Avg: 2.16 ms, Median: 2.16 ms, Total Time: 2.16 ms, Total % of Trace Computation: 0.02%
INFO:nv_ingest_client.cli.util.processing:redis_task_source: Avg: 8.00 ms, Median: 8.00 ms, Total Time: 8.00 ms, Total % of Trace Computation: 0.08%
INFO:nv_ingest_client.cli.util.processing:Unresolved time: 82.82 ms, Percent of Total Elapsed: 0.79%
INFO:nv_ingest_client.cli.util.processing:Processed 1 files in 10.47 seconds.
INFO:nv_ingest_client.cli.util.processing:Total pages processed: 3
INFO:nv_ingest_client.cli.util.processing:Throughput (Pages/sec): 0.29
INFO:nv_ingest_client.cli.util.processing:Throughput (Files/sec): 0.10
After the ingestion steps above have completed, you should be able to find text
and image
subfolders inside your processed docs folder. Each will contain JSON formatted extracted content and metadata.
ls -R processed_docs/
processed_docs/:
image structured text
processed_docs/image:
multimodal_test.pdf.metadata.json
processed_docs/structured:
multimodal_test.pdf.metadata.json
processed_docs/text:
multimodal_test.pdf.metadata.json
You can view the full JSON extracts and the metadata definitions here.
First, install tkinter
by running the following commands depending on your OS.
sudo apt-get update
sudo apt-get install python3-tk
sudo dnf install python3-tkinter
brew install python-tk
Then run the following command to execute the script for inspecting the extracted image:
python src/util/image_viewer.py --file_path ./processed_docs/image/multimodal_test.pdf.metadata.json
Tip
Beyond inspecting the results, you can read them into things like llama-index or langchain retrieval pipelines.
Please also checkout our demo using a retrieval pipeline on build.nvidia.com to query over document content pre-extracted w/ NVIDIA Ingest.
Beyond the relevant documentation, examples, and other links above, below is a description of contents in this repo's folders:
If configured to do so, this project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use:
https://pypi.org/project/pdfservices-sdk/
INSTALL_ADOBE_SDK
:
true
, the Adobe SDK will be installed in the container at launch time. This is
required if you want to use the Adobe extraction service for PDF decomposition. Please review the
license agreement for the
pdfservices-sdk before enabling this option.We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
Any contribution which contains commits that are not Signed-Off will not be accepted.
To sign off on a commit you simply use the --signoff (or -s) option when committing your changes:
$ git commit -s -m "Add cool feature."
This will append the following to your commit message:
Signed-off-by: Your Name <your@email.com>
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129
Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or
(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or
(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.
(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.