POhMyPDF

PDF OCR

Convert a scanned PDF into searchable text using Tesseract.js. Everything runs in your browser — your PDF is never uploaded.

Why I built this

In February 2026 I scanned a 30-page Korean rental contract (전세계약서) that I needed to translate clause-by-clause into English for an immigration lawyer. The scan was a PDF, but the text was an image — no copy-paste possible. Adobe Acrobat Pro would handle it, but I had let my subscription lapse and re-subscribing for one document felt absurd. The first "free PDF OCR" sites I tried wanted me to upload — a contract that had my name, 주민등록번호, 도장 imprint, and the landlord's banking details — to a server I had never heard of.

Two of those sites required signup. One returned only the first 3 pages free and demanded $14.99 for the full document. A third produced output that was clearly machine-translated nonsense — they had run cheap OCR plus auto-translate, ruining the source text. None of them mentioned how my PDF would be stored or for how long.

So I built this with Tesseract.js — the WebAssembly port of Google's open-source Tesseract OCR engine. Three rules: (1) PDF stays in your browser; (2) all 12+ languages free, including Korean and Japanese; (3) no page limit, no per-month quota. For sensitive 행정서류 — contracts, certificates, medical documents — local OCR is the only sane choice.

How it works under the hood

When you drop a PDF, pdfjs-dist (the engine Firefox built for its native PDF viewer) parses the file and renders each page to an off-screen canvas at 2× DPI. We chose 2× after experiments — 1× left character edges fuzzy and accuracy dropped to 88%, while 4× doubled processing time without measurable accuracy gain. The canvas pixel buffer is handed directly to Tesseract.js, which runs entirely in a Web Worker so the main thread stays responsive.

Tesseract.js is a WebAssembly compilation of the C++ Tesseract 5.x engine. On first use it downloads the core WASM binary (~2 MB) and your selected language data (~10 MB for English, ~15 MB for Korean). These are cached in the browser's IndexedDB storage so subsequent visits skip the download. The OCR itself runs in three stages: page segmentation (finding text regions), character classification (a neural network recognizing each glyph), and post-processing (language-model-guided correction of ambiguous characters).

The bottleneck is CPU. Tesseract is single-threaded inside its worker — adding cores does not speed up a single page, only lets multiple pages run in parallel (which we do not currently exploit). On an M1 MacBook, expect ~6 seconds per page after warm-up. On a 4-year-old laptop, 10-15 seconds. On a phone, 20-30 seconds.

Verify yourself: open DevTools → Network tab. On first OCR, you will see the WASM and language data files load (one time only). Clear the log and OCR a second page — the list stays empty. Your PDF's contents never appear in any request payload.

Real use cases

  • Korean contract translation prep (전세계약서, 근로계약서): Extract text so you can paste paragraphs into DeepL or a human translator. Far easier than retyping a 30-page document by hand.
  • Receipt archiving for tax filing: Many small businesses still issue paper receipts. OCR the scans to .txt, then grep for vendor names or amounts when 종합소득세 season arrives.
  • Researching old documents: Library archives often provide scanned-PDF copies of journal articles or books pre-1990. OCR turns those into searchable text for quote-finding and citation work.
  • Reading inaccessible PDFs aloud: Screen readers cannot read scan-only PDFs because there is no text layer. OCR the PDF to .txt and feed that into a text-to-speech tool — far more accessible for visually impaired users.
  • Digitizing handwritten 가족 documents: 족보, 일기, old letters from parents or grandparents. Tesseract struggles with handwriting but does reasonably well on clearly written 한글 from 1980s-onward documents.

vs other PDF OCR tools

How OhMyPDF OCR compares to popular OCR services as of May 2026:

FeatureOhMyPDFAdobe AcrobatSmallPDFGoogle Drive
Free tierYes, unlimited7-day trial2/day freeFree with Google account
No upload to serverYesNo (Adobe Cloud)NoNo (Google servers)
Korean OCR supportYesYesLimitedYes (excellent)
Output as searchable PDFNo (.txt only)YesYesYes
No signup requiredYesAdobe IDLimitedGoogle account
Source code inspectableYesNoNoNo
Accuracy on clean scans~95%~98%~95%~98%

Where competitors win: Adobe Acrobat Pro produces searchable PDFs (text layer overlaid on the original scan) which is the gold standard for archive work. Google Drive's "Open with Google Docs" runs Google's neural OCR which is arguably the best in the world, especially for non-Latin scripts. For sensitive documents that should not reach Google or Adobe's servers, browser-side OCR remains the right trade-off — slightly lower accuracy in exchange for genuine privacy.

What this can't do

  • Produce a searchable PDF. Output is plain .txt only. For a text-layered PDF, use Adobe Acrobat Pro or ABBYY FineReader.
  • Recognize handwriting. Tesseract is trained on printed text. Handwriting accuracy drops to 30-50%. Google Cloud Vision or specialty tools handle handwriting better but require server upload.
  • Preserve layout (columns, tables, headers). Output is a linear stream of text. Tables come out as space-separated rows — usable but not pretty.
  • OCR languages outside our supported list. Arabic, Thai, Hebrew, Vietnamese, and many others are not currently bundled. Tesseract supports them; we just have not added the language packs yet.
  • Handle PDFs over 100 MB. Browser memory caps make larger files unstable. Split first, OCR per segment.

If you need searchable PDF output for archival work, use Adobe Acrobat Pro. If you need the best possible accuracy regardless of privacy cost, Google Drive's "Open with Docs" feature is exceptional.

Tips for best results

Scan at 300 DPI minimum. Below 200 DPI, character edges become fuzzy and accuracy collapses. If your source is a phone-camera photo, hold the camera level (not at an angle) and ensure even lighting — Tesseract handles flat scans much better than perspective-warped phone shots.

Straighten the page first. A page rotated 5° loses ~10% accuracy. If the scan is skewed, use a PDF rotation tool (or a desktop scan utility) to deskew before running OCR.

Choose the right language. Running English OCR on a Korean document produces garbage. Running Korean OCR on an English document produces garbage plus a slow run. For mixed documents, pick the dual-language option.

Process in batches. For documents over 50 pages, split into chunks of 20-30 pages, OCR each, and concatenate the .txt outputs. This avoids tab memory pressure and lets you save progress if something goes wrong.

Proofread numbers. Tesseract sometimes confuses 0/O, 1/l/I, 5/S. For documents where numbers matter (invoices, tax forms), eyeball the output against the original.

FAQ

How accurate is browser-side OCR? Tesseract.js is the open-source Tesseract engine compiled to WebAssembly. For clean scans of printed text at 300 DPI it hits 95%+ accuracy on Latin scripts and 90%+ on Korean (한글). Handwriting, low-resolution scans, or skewed pages can drop accuracy noticeably — straighten and rescan at 300 DPI for best results. Tables, multi-column layouts, and pages with mixed fonts (titles, body, footnotes) sometimes produce slightly misordered output because Tesseract reads top-to-bottom by default.

Why is OCR slow (~10 seconds per page)? Real OCR is computationally heavy — Tesseract analyzes every character shape, runs neural network classifiers, and post-processes with a language model. We render each page at 2× zoom for accuracy, which doubles the pixels Tesseract must read. The first page is the slowest because the language model (~10 MB for English, ~15 MB for Korean) downloads on first run, then it is cached for the rest of the session. On a modern laptop expect 5-8 seconds per page after warm-up; mobile devices are 2-3× slower.

Does this produce a searchable PDF? No — only a plain .txt of the recognized text. Embedding a transparent text layer back into the original PDF requires a heavier pipeline (precise per-word coordinates, font matching, layer compositing) that we did not want to ship as 50+ MB of WebAssembly. If you only need to grep, ctrl-F, or paste the contents elsewhere, the .txt is what you want. For a searchable PDF, Adobe Acrobat Pro is the most reliable choice; ABBYY FineReader is also excellent.

Does my PDF leave my browser during OCR? No. The PDF is parsed locally with pdfjs-dist (the same engine Firefox uses for its built-in PDF viewer), each page is rendered to an in-memory canvas, and Tesseract.js processes the canvas pixels directly. The only network traffic on first use is the one-time download of the Tesseract WASM core (~2 MB) and the language data files. Open DevTools → Network and you will see those downloads happen once; after that the queue stays empty as you process subsequent pages.

What languages are supported? English, Korean (한국어), Chinese (Simplified and Traditional), Japanese, Spanish, French, and German. Each language pack is a separate download triggered on first use of that language. You can OCR a multilingual document by selecting two languages — Tesseract will load both packs (~20 MB combined) and use a unified classifier. Mixed Korean-English documents (common in Korean textbooks and IT manuals) work especially well in dual-language mode.

Why are some characters wrong in my Korean output? Korean OCR via Tesseract is solid but not perfect. Common failure modes: confusing 의 with 외, 0 with 〇, or stylized 한글 fonts (handwriting fonts, retro typography) producing nonsense. The training data favors modern printed 명조체 and 고딕체. If your source uses an unusual font, accuracy drops. Try rescanning at higher resolution (600 DPI), and if the document is older than 1980 expect to need manual correction on roughly 10% of characters.

Why would I use this instead of Adobe Acrobat OCR? Privacy and price. Acrobat Pro OCR is excellent — Adobe has invested heavily in their OCR pipeline and it remains state of the art for searchable PDF output. But it costs $20/month and uploads your PDF to Adobe Document Cloud. Our tool is free, runs locally, and handles 90% of everyday OCR tasks (extracting text from a scanned receipt, an old book page, a 한국 행정서류). For one-off use, paying $20 to read three receipts is not the right trade.

Can I OCR a 200-page scanned book? Technically yes, practically not in one sitting. At 8 seconds per page that is ~27 minutes of continuous CPU work and your laptop fan will run hard. The bigger issue is browser tab stability — keeping a tab busy for half an hour increases the chance of an unexpected reload or memory spike. Process in batches of 20-30 pages, save the .txt output between batches, and concatenate at the end. For really large books, a desktop Tesseract install via command line is faster.

My OCR output has weird line breaks. How do I clean it? Tesseract preserves line breaks as it sees them on the page, which means a paragraph spanning 8 lines on the scan becomes 8 short lines in the .txt. For prose, run a quick regex find-and-replace: replace single newlines (not double) with spaces. In VS Code: search "(?<!\n)\n(?!\n)" with regex enabled, replace with a space. For tabular data, keep the line breaks — they preserve row structure.

What happens if the PDF has both text and scanned pages? We OCR every page uniformly. Text-based pages get OCR'd unnecessarily — slower but still correct, because rendering text to canvas and reading it back is reliable. If you know in advance which pages are scanned, splitting the PDF first and OCR-ing only the scanned subset saves time. A smarter "detect text vs scan per page" feature is on the roadmap but adds complexity (and false-positive risk) that we have not yet shipped.

You might also like

Part of the OhMy* tools family