Q: What happens if the PDF has both text and scanned pages?

We OCR every page uniformly. Text-based pages get OCR'd unnecessarily — slower but still correct, because rendering text to canvas and reading it back is reliable. If you know in advance which pages are scanned, splitting the PDF first and OCR-ing only the scanned subset saves time. A smarter "detect text vs scan per page" feature is on the roadmap but adds complexity (and false-positive risk) that we have not yet shipped.

Question 1

How accurate is browser-side OCR?

Accepted Answer

Tesseract.js is the open-source Tesseract engine compiled to WebAssembly. For clean scans of printed text at 300 DPI it hits 95%+ accuracy on Latin scripts and 90%+ on Korean (한글). Handwriting, low-resolution scans, or skewed pages can drop accuracy noticeably — straighten and rescan at 300 DPI for best results. Tables, multi-column layouts, and pages with mixed fonts (titles, body, footnotes) sometimes produce slightly misordered output because Tesseract reads top-to-bottom by default.

Question 2

Why is OCR slow (~10 seconds per page)?

Accepted Answer

Real OCR is computationally heavy — Tesseract analyzes every character shape, runs neural network classifiers, and post-processes with a language model. We render each page at 2× zoom for accuracy, which doubles the pixels Tesseract must read. The first page is the slowest because the language model (~10 MB for English, ~15 MB for Korean) downloads on first run, then it is cached for the rest of the session. On a modern laptop expect 5-8 seconds per page after warm-up; mobile devices are 2-3× slower.

Question 3

Does this produce a searchable PDF?

Accepted Answer

No — only a plain .txt of the recognized text. Embedding a transparent text layer back into the original PDF requires a heavier pipeline (precise per-word coordinates, font matching, layer compositing) that we did not want to ship as 50+ MB of WebAssembly. If you only need to grep, ctrl-F, or paste the contents elsewhere, the .txt is what you want. For a searchable PDF, Adobe Acrobat Pro is the most reliable choice; ABBYY FineReader is also excellent.

Question 4

Does my PDF leave my browser during OCR?

Accepted Answer

No. The PDF is parsed locally with pdfjs-dist (the same engine Firefox uses for its built-in PDF viewer), each page is rendered to an in-memory canvas, and Tesseract.js processes the canvas pixels directly. The only network traffic on first use is the one-time download of the Tesseract WASM core (~2 MB) and the language data files. Open DevTools → Network and you will see those downloads happen once; after that the queue stays empty as you process subsequent pages.

Question 5

What languages are supported?

Accepted Answer

English, Korean (한국어), Chinese (Simplified and Traditional), Japanese, Spanish, French, and German. Each language pack is a separate download triggered on first use of that language. You can OCR a multilingual document by selecting two languages — Tesseract will load both packs (~20 MB combined) and use a unified classifier. Mixed Korean-English documents (common in Korean textbooks and IT manuals) work especially well in dual-language mode.

Question 6

Why are some characters wrong in my Korean output?

Accepted Answer

Korean OCR via Tesseract is solid but not perfect. Common failure modes: confusing 의 with 외, 0 with 〇, or stylized 한글 fonts (handwriting fonts, retro typography) producing nonsense. The training data favors modern printed 명조체 and 고딕체. If your source uses an unusual font, accuracy drops. Try rescanning at higher resolution (600 DPI), and if the document is older than 1980 expect to need manual correction on roughly 10% of characters.

Question 7

Why would I use this instead of Adobe Acrobat OCR?

Accepted Answer

Privacy and price. Acrobat Pro OCR is excellent — Adobe has invested heavily in their OCR pipeline and it remains state of the art for searchable PDF output. But it costs $20/month and uploads your PDF to Adobe Document Cloud. Our tool is free, runs locally, and handles 90% of everyday OCR tasks (extracting text from a scanned receipt, an old book page, a 한국 행정서류). For one-off use, paying $20 to read three receipts is not the right trade.

Question 8

Can I OCR a 200-page scanned book?

Accepted Answer

Technically yes, practically not in one sitting. At 8 seconds per page that is ~27 minutes of continuous CPU work and your laptop fan will run hard. The bigger issue is browser tab stability — keeping a tab busy for half an hour increases the chance of an unexpected reload or memory spike. Process in batches of 20-30 pages, save the .txt output between batches, and concatenate at the end. For really large books, a desktop Tesseract install via command line is faster.

Question 9

My OCR output has weird line breaks. How do I clean it?

Accepted Answer

Tesseract preserves line breaks as it sees them on the page, which means a paragraph spanning 8 lines on the scan becomes 8 short lines in the .txt. For prose, run a quick regex find-and-replace: replace single newlines (not double) with spaces. In VS Code: search "(?<!
)
(?!
)" with regex enabled, replace with a space. For tabular data, keep the line breaks — they preserve row structure.

Question 10

What happens if the PDF has both text and scanned pages?

Accepted Answer

We OCR every page uniformly. Text-based pages get OCR'd unnecessarily — slower but still correct, because rendering text to canvas and reading it back is reliable. If you know in advance which pages are scanned, splitting the PDF first and OCR-ing only the scanned subset saves time. A smarter "detect text vs scan per page" feature is on the roadmap but adds complexity (and false-positive risk) that we have not yet shipped.

Feature	OhMyPDF	Adobe Acrobat	SmallPDF	Google Drive
Free tier	Yes, unlimited	7-day trial	2/day free	Free with Google account
No upload to server	Yes	No (Adobe Cloud)	No	No (Google servers)
Korean OCR support	Yes	Yes	Limited	Yes (excellent)
Output as searchable PDF	No (.txt only)	Yes	Yes	Yes
No signup required	Yes	Adobe ID	Limited	Google account
Source code inspectable	Yes	No	No	No
Accuracy on clean scans	~95%	~98%	~95%	~98%

PDF OCR

Why I built this

How it works under the hood

Real use cases

vs other PDF OCR tools

What this can't do

Tips for best results

FAQ

You might also like

Split PDF

Compress PDF

PDF to Images