funstory-ai / BabelDOC
- воскресенье, 6 апреля 2025 г. в 00:00:04
Yet Another Document Translator
PDF scientific paper translation and bilingual comparison library.
We recommend using the Tool feature of uv to install yadt.
First, you need to refer to uv installation to install uv and set up the PATH
environment variable as prompted.
Use the following command to install yadt:
uv tool install --python 3.12 BabelDOC
babeldoc --help
babeldoc
command. For example:babeldoc --bing --files example.pdf
# multiple files
babeldoc --bing --files example1.pdf --files example2.pdf
We still recommend using uv to manage virtual environments.
First, you need to refer to uv installation to install uv and set up the PATH
environment variable as prompted.
Use the following command to install yadt:
# clone the project
git clone https://github.com/funstory-ai/BabelDOC
# enter the project directory
cd BabelDOC
# install dependencies and run babeldoc
uv run babeldoc --help
uv run babeldoc
command. For example:uv run babeldoc --files example.pdf --openai --openai-model "gpt-4o-mini" --openai-base-url "https://api.openai.com/v1" --openai-api-key "your-api-key-here"
# multiple files
uv run babeldoc --files example.pdf --files example2.pdf --openai --openai-model "gpt-4o-mini" --openai-base-url "https://api.openai.com/v1" --openai-api-key "your-api-key-here"
Tip
The absolute path is recommended.
Note
This CLI is mainly for debugging purposes. Although end users can use this CLI to translate files, we do not provide any technical support for this purpose.
End users should directly use Online Service: Beta version launched Immersive Translate - BabelDOC 1000 free pages per month.
End users who need self-deployment should use PDFMathTranslate
If you find that an option is not listed below, it means that this option is a debugging option for maintainers. Please do not use these options.
--lang-in
, -li
: Source language code (default: en)--lang-out
, -lo
: Target language code (default: zh)Tip
Currently, this project mainly focuses on English-to-Chinese translation, and other scenarios have not been tested yet.
(2025.3.1 update): Basic English target language support has been added, primarily to minimize line breaks within words([0-9A-Za-z]+).
HELP WANTED: Collecting word regular expressions for more languages
--files
: One or more file paths to input PDF documents.--pages
, -p
: Specify pages to translate (e.g., "1,2,1-,-3,3-5"). If not set, translate all pages--split-short-lines
: Force split short lines into different paragraphs (may cause poor typesetting & bugs)--short-line-split-factor
: Split threshold factor (default: 0.8). The actual threshold is the median length of all lines on the current page * this factor--skip-clean
: Skip PDF cleaning step--dual-translate-first
: Put translated pages first in dual PDF mode (default: original pages first)--disable-rich-text-translate
: Disable rich text translation (may help improve compatibility with some PDFs)--enhance-compatibility
: Enable all compatibility enhancement options (equivalent to --skip-clean --dual-translate-first --disable-rich-text-translate)--use-alternating-pages-dual
: Use alternating pages mode for dual PDF. When enabled, original and translated pages are arranged in alternate order. When disabled (default), original and translated pages are shown side by side on the same page.--watermark-output-mode
: Control watermark output mode: 'watermarked' (default) adds watermark to translated PDF, 'no_watermark' doesn't add watermark, 'both' outputs both versions.--max-pages-per-part
: Maximum number of pages per part for split translation. If not set, no splitting will be performed.--no-watermark
: [DEPRECATED] Use --watermark-output-mode=no_watermark instead.--translate-table-text
: Translate table text (experimental, default: False)--skip-scanned-detection
: Skip scanned document detection (default: False). When using split translation, only the first part performs detection if not skipped.Tip
--skip-clean
and --dual-translate-first
may help improve compatibility with some PDF readers--disable-rich-text-translate
can also help with compatibility by simplifying translation input--skip-clean
will result in larger file sizes--enhance-compatibility
first--max-pages-per-part
for large documents to split them into smaller parts for translation and automatically merge them back.--skip-scanned-detection
to speed up processing when you know your document is not a scanned PDF.--qps
: QPS (Queries Per Second) limit for translation service (default: 4)--ignore-cache
: Ignore translation cache and force retranslation--no-dual
: Do not output bilingual PDF files--no-mono
: Do not output monolingual PDF files--min-text-length
: Minimum text length to translate (default: 5)--openai
: Use OpenAI for translation (default: False)Tip
glm-4-flash
, deepseek-chat
, etc.--openai-model
: OpenAI model to use (default: gpt-4o-mini)--openai-base-url
: Base URL for OpenAI API--openai-api-key
: API key for OpenAI serviceTip
https://xxx.custom.xxx/v1
)--openai-api-key a
).--output
, -o
: Output directory for translated files. If not set, use current working directory.--debug
, -d
: Enable debug logging level and export detailed intermediate results in ~/.cache/yadt/working
.--report-interval
: Progress report interval in seconds (default: 0.1).--generate-offline-assets
: Generate an offline assets package in the specified directory. This creates a zip file containing all required models and fonts.--restore-offline-assets
: Restore an offline assets package from the specified file. This extracts models and fonts from a previously generated package.Tip
babeldoc --generate-offline-assets /path/to/output/dir
and then distribute it.babeldoc --restore-offline-assets /path/to/offline_assets_*.zip
.--restore-offline-assets
, the tool will automatically look for the correct offline assets package file in that directory.--config
, -c
: Configuration file path. Use the TOML format.Example Configuration:
[babeldoc]
# Basic settings
debug = true
lang-in = "en-US"
lang-out = "zh-CN"
qps = 10
output = "/path/to/output/dir"
# PDF processing options
split-short-lines = false
short-line-split-factor = 0.8
skip-clean = false
dual-translate-first = false
disable-rich-text-translate = false
use-alternating-pages-dual = false
watermark-output-mode = "watermarked" # Choices: "watermarked", "no_watermark", "both"
max-pages-per-part = 50 # Automatically split the document for translation and merge it back.
# no-watermark = false # DEPRECATED: Use watermark-output-mode instead
skip-scanned-detection = false # Skip scanned document detection for faster processing
# Translation service
openai = true
openai-model = "gpt-4o-mini"
openai-base-url = "https://api.openai.com/v1"
openai-api-key = "your-api-key-here"
# Output control
no-dual = false
no-mono = false
min-text-length = 5
report-interval = 0.5
# Offline assets management
# Uncomment one of these options as needed:
# generate-offline-assets = "/path/to/output/dir"
# restore-offline-assets = "/path/to/offline_assets_package.zip"
Tip
Before pdf2zh 2.0 is released, you can temporarily use BabelDOC's Python API. However, after pdf2zh 2.0 is released, please directly use pdf2zh's Python API.
This project's Python API does not guarantee any compatibility. However, the Python API from pdf2zh will guarantee a certain level of compatibility.
You can refer to the example in main.py to use BabelDOC's Python API.
Please note:
Make sure call babeldoc.high_level.init()
before using the API
The current TranslationConfig
does not fully validate input parameters, so you need to ensure the validity of input parameters
For offline assets management, you can use the following functions:
# Generate an offline assets package
from pathlib import Path
import babeldoc.assets.assets
# Generate package to a specific directory
# path is optional, default is ~/.cache/babeldoc/assets/offline_assets_{hash}.zip
babeldoc.assets.assets.generate_offline_assets_package(Path("/path/to/output/dir"))
# Restore from a package file
# path is optional, default is ~/.cache/babeldoc/assets/offline_assets_{hash}.zip
babeldoc.assets.assets.restore_offline_assets_package(Path("/path/to/offline_assets_package.zip"))
# You can also restore from a directory containing the offline assets package
# The tool will automatically find the correct package file based on the hash
babeldoc.assets.assets.restore_offline_assets_package(Path("/path/to/directory"))
Tip
There are a lot projects and teams working on to make document editing and translating easier like:
There are also some solutions to solve specific parts of the problem like:
This project hopes to promote a standard pipeline and interface to solve the problem.
In fact, there are two main stages of a PDF parser or translator:
For a service like mathpix, it will parse the pdf into a structure may be in a XML format, and then render them using a single column reader order as layoutreader does. The bad news is that the original structure lost.
Some people will use Adobe PDF Parser because it will generate a Word document and it keeps the original structure. But it is somewhat expensive. And you know, a pdf or word document is not a good format for reading in mobile devices.
We offer an intermediate representation of the results from parser and can be rendered into a new pdf or other format. The pipeline is also a plugin-based system which everybody can add their new model, ocr, renderer, etc.
Our first 1.0 version goal is to finish a translation from PDF Reference, Version 1.7 to the following language version:
And meet the following requirements:
We encourage you to contribute to YADT! Please check out the CONTRIBUTING guide.
Everyone interacting in YADT and its sub-projects' codebases, issue trackers, chat rooms, and mailing lists is expected to follow the YADT Code of Conduct.
Immersive Translation sponsors monthly Pro membership redemption codes for active contributors to this project, see details at: CONTRIBUTOR_REWARD.md