opendatalab / MinerU
- воскресенье, 28 июля 2024 г. в 00:00:03
A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
MinerU: An end-to-end PDF parsing tool based on PDF-Extract-Kit, supporting conversion from PDF to Markdown.🚀🚀🚀
PDF-Extract-Kit: A Comprehensive Toolkit for High-Quality PDF Content Extraction🔥🔥🔥
MinerU is a one-stop, open-source, high-quality data extraction tool, includes the following primary features:
Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
Key features include:
Using a virtual environment is recommended to avoid potential dependency conflicts; both venv and conda are suitable. For example:
conda create -n MinerU python=3.10
conda activate MinerU
Install the full-feature package with pip:
Note: The pip-installed package supports CPU-only and is ideal for quick tests.
For CUDA/MPS acceleration in production, see Acceleration Using CUDA or MPS.
pip install magic-pdf[full-cpu]
The full-feature package depends on detectron2, which requires a compilation installation.
If you need to compile it yourself, please refer to facebookresearch/detectron2#5114
Alternatively, you can directly use our precompiled whl package (limited to Python 3.10):
pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/
For detailed references, please see below how_to_download_models
After downloading the model weights, move the 'models' directory to a directory on a larger disk space, preferably an SSD.
You can get the magic-pdf.template.json file in the repository root directory.
cp magic-pdf.template.json ~/magic-pdf.json
In magic-pdf.json, configure "models-dir" to point to the directory where the model weights files are located.
{
"models-dir": "/tmp/models"
}
If you have an available Nvidia GPU or are using a Mac with Apple Silicon, you can leverage acceleration with CUDA or MPS respectively.
You need to install the corresponding PyTorch version according to your CUDA version.
This example installs the CUDA 11.8 version.More information https://pytorch.org/get-started/locally/
pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
Also, you need to modify the value of "device-mode" in the configuration file magic-pdf.json.
{
"device-mode":"cuda"
}
For macOS users with M-series chip devices, you can use MPS for inference acceleration.
You also need to modify the value of "device-mode" in the configuration file magic-pdf.json.
{
"device-mode":"mps"
}
magic-pdf pdf-command --pdf "pdf_path" --inside_model true
After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".
You can find the corresponding xxx_model.json file in the markdown directory.
If you intend to do secondary development on the post-processing pipeline, you can use the command:
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
In this way, you won't need to re-run the model data, making debugging more convenient.
magic-pdf --help
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
Demo can be referred to demo.py
Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.
Key Features Include:
Web Page Extraction
E-Book Document Extraction
Language Type Identification
The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.
@misc{2024mineru,
title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
author={MinerU Contributors},
howpublished = {\url{https://github.com/opendatalab/MinerU}},
year={2024}
}