Process a Book

library-pipeline · the manual walk-through, scan to shelf

02_cleaned.md is the irreplaceable file. Everything after it (M4B audio) can be regenerated. Nothing before it can be.

CODE FOLDER — C:\Projects\library-pipeline (the program) DATA FOLDER — C:\scans\... or C:\library-pipeline\... (this book's files)

0 / 9 done

0Stage your original scans

Skip this if C:\scans\<Title>\ already has your scanned pages in it.

Before anything else, get a copy of your scanned pages into C:\scans\<Title>\. This folder is permanent, untouched storage — the pipeline never writes to it, and you never sort or organize anything here. Just get the raw pages in.

Wherever your scans currently live (scanner output folder, a temp folder, a USB drive — anywhere), copy them — don't move them — so your true originals stay put no matter what happens later.

Replace both paths with your real ones. Example:

DATA FOLDER

xcopy "C:\path\to\your\scanner\output" "C:\scans\The Book Title" /E /I

C:\scans\<Title>\ contains your scanned page images, and your original copies elsewhere are untouched.

1Set up the working foldertemporary step

Skip this if C:\library-pipeline\<Title>\book_config.toml already exists.

"Temporary" means: chapter-folder sorting (below) goes away once automatic chapter detection is built. Until then, this is the manual setup every new book needs.

1a. Create the working folder. Use the exact same title you'll use everywhere else in this checklist — capitalization and punctuation included.

DATA FOLDER

mkdir "C:\library-pipeline\The Book Title"

1b. Copy the sample config into it. This file always lives in the same spot for every book — inside that book's own working folder, never in the code repo.

CODE FOLDER

copy "C:\Projects\library-pipeline\examples\sample_book_config.toml" "C:\library-pipeline\The Book Title\book_config.toml"

1c. Open that new book_config.toml (Notepad or VS Code) and edit two lines — everything else can stay as shipped:

title = "The Book Title" author = "The Author Name"

1d. Set ignore_zones — ask Claude to measure it for you. Almost every book has some repeating header or footer (book title at the top, "Page X of Y" at the bottom, etc.) that you don't want baked into the cleaned text on every single page. Rather than measuring pixels by hand, upload 1–2 sample page images straight into this chat and ask Claude to find the coordinates:

What to say to Claude: upload 1–2 of the raw page images from C:\scans\<Title>\, then ask: "Here are 1-2 sample pages from this book. Can you measure the pixel coordinates of any repeating header/footer band and give me an ignore_zones entry for book_config.toml?" Claude will measure the actual image (not guess from looking) and hand back a ready-to-paste block like:

ignore_zones = [ [0, 0, 1920, 90], [0, 1130, 1920, 1200], ]

Paste that block into book_config.toml in place of the commented-out example, matching the indentation already there.

No header or footer on this book at all? Leave ignore_zones commented out — nothing to exclude.

1e. Organize scans into chapter folders. Create 00_source\Chapter_01\, Chapter_02\, etc. (zero-padded) inside the working folder, and copy pages from C:\scans\<Title>\ into the matching chapter folder. Front matter goes in Chapter_00\.

DATA FOLDER

mkdir "C:\library-pipeline\The Book Title\00_source\Chapter_01"

Don't want to sort by chapter right now? That's fine — put every page into a single Chapter_01\ folder instead, in correct reading order. You'll get one unbroken 02_cleaned.md with no chapter breaks, which you can split later. Only works cleanly if your filenames already sort into correct page order (sequential numbers or timestamps).

book_config.toml exists with the right title/author, and Chapter_01 (at minimum) has page images in it.

2Turn Umi-OCR on

Open the Umi-OCR app (desktop shortcut, or double-click the .exe directly — see note below if you don't have a shortcut yet). Inside the app, toggle the HTTP service ON. It defaults to off every time you open it.

No desktop shortcut? Open C:\tools\Umi-OCR_Paddle_v2.1.5\ in File Explorer, find Umi-OCR.exe, right-click it → Send to → Desktop (create shortcut). One-time fix.

Sanity check it's actually listening — this works from anywhere, no folder matters:

Invoke-WebRequest -Uri http://127.0.0.1:1224/api/ocr -Method POST -ContentType "application/json" -Body '{"base64":""}' -UseBasicParsing

You get a response back (even a "decode failed" error counts — that means the API is up).

3Point both run-scripts at this book

Open these two files in a text editor (paths shown are exact, full paths — paste straight into File Explorer's address bar if that's easier than browsing):

C:\Projects\library-pipeline\scripts\run_extract.py

C:\Projects\library-pipeline\scripts\run_clean_assemble.py

In both files, find the line near the top that looks like this, and change the text in quotes to match your working-folder name exactly — same capitalization, same punctuation, no extra words:

BOOK_TITLE = "The Book Title"

This is the single most common source of confusing errors in this whole process. The folder name and BOOK_TITLE in both files must match character-for-character. Re-check all three any time you resume work on a book, even if you set them correctly last time.

BOOK_TITLE is identical in both files, and matches the folder name under C:\library-pipeline\ exactly.

4Extract (OCR)

This command must run from the code folder — the place the actual program lives, not the per-book working folder. Both lines below go in the same PowerShell window, one after the other (or paste each into a brand-new window — either works, since the path is spelled out in full each time):

CODE FOLDER

cd "C:\Projects\library-pipeline"

CODE FOLDER

uv run python scripts/run_extract.py

This reads every page in this book's 00_source\ (over in the data folder) and writes one OCR text file per page into 01_extracted\, right next to it. You don't need to be "in" that folder for this to work — the script finds it using the BOOK_TITLE you set in Step 3.

It prints "No failures." and "Done: N pages -> <output_dir>".

5Clean + assemble

Same code folder as Step 4. If you closed your terminal, run the cd line again first:

CODE FOLDER

cd "C:\Projects\library-pipeline"

CODE FOLDER

uv run python scripts/run_clean_assemble.py

This writes the pivot file, 02_cleaned.md, into this book's data folder, plus a book_metadata.toml with stats.

It prints "Written: <path>" and "Size: <N> bytes".

6Review the pivot file

Open this exact file in Notepad — it's in the data folder, not the code folder:

C:\library-pipeline\The Book Title\02_cleaned.md

Skim for red-underline spell-check marks — you're scanning, not reading every word. Some misses are fine.

If anything appears before the first heading, add ## Front Matter as the very first line of the file, or the audio step will truncate the opening.

Still seeing repeated junk text at the top or bottom of pages (page numbers, book title, chapter name, "Page X of Y" repeating)? Did Step 1d's ignore_zones, or set it too loosely. Fix loop:

Upload 1–2 sample page images to Claude (same as Step 1d) and ask it to re-measure — share a page where the leak is visible.
Paste the corrected ignore_zones block into book_config.toml (same file as Step 1, data folder).
Re-run both Step 4 (extract) and Step 5 (clean+assemble) — not just Step 5. ignore_zones only takes effect during extract, so skipping straight to Step 5 will reuse the old, unfiltered OCR text.

Re-running extract on a long book takes just as long as the first time — currently there's no partial reuse, even for a config-only fix. Budget the same wait.

"has been edited since the last clean run" error on Step 5? This can fire even if you never changed the file's text — just opening certain editors can update the file's timestamp. If you're sure you didn't make edits worth keeping, it's safe to add force=True to the assemble_book(...) call in run_clean_assemble.py for this one run. Remove it again right after — leaving it on permanently disables the protection for edits you DO want to keep later.

If you hand-edit this file, re-running Step 5 afterward is blocked unless you pass force=True — that's intentional, to protect your edits. Don't force it unless you mean to throw the edits away.

The prose reads cleanly, and the first line is "## Front Matter" if needed.

7Render audio

This step leaves the new pipeline entirely. The audio renderer isn't built yet, so audio still comes from the old v1 script, which takes 02_cleaned.md as its input — the same file you just reviewed.

Open item, not yet confirmed: exactly how to hand the file to v1 — as a command-line argument, a copied file, or an edited path inside the script. Confirm this once and it's done for every future book.

Once it runs, the audio gets packaged with ffmpeg. It must re-encode, not stream-copy:

DATA FOLDER

ffmpeg -i concat_list.txt -c:a aac -b:a 64k output.m4b

If a partial .m4b or .mp3 already exists, the resume logic will skip the work — delete the partial file first to force a redo.

A finished .m4b (or per-chapter .mp3 files) exists.

8Ship to the NAS

Copy the finished audio to the NAS, then trigger an Audiobookshelf rescan.

The title shows up in Audiobookshelf after the rescan.