Step-by-step
- 1Open the PDF to Text tool.
- 2Upload your PDF.
- 3Click Extract text.
- 4Copy all to clipboard, or Download as .txt.
Text-based vs scanned PDFs
Text-based PDFs (exported from Word, Pages, browsers) contain real text — extraction is perfect and includes every character.
Scanned PDFs are images of text. Our extractor will return empty or garbled content for those. You'll need OCR (Optical Character Recognition) — try our AI Text Extractor for scanned files.
Common uses
- Pasting research paper text into ChatGPT for a summary.
- Copying a contract clause into an email without retyping.
- Migrating an old PDF into a Notion / Obsidian note.
- Building a training dataset of clean text.
Tips for cleaner output
- PDFs with columns may interleave text — paste into a wide text editor to spot issues.
- Tables come out as rows of words; use a table extractor if you need cells.
- Headers and footers repeat on every page; a quick find-and-replace removes them.
Why text extraction sometimes returns garbage
Even text-based PDFs occasionally extract as jumbled letters or missing characters. The most common culprit is custom font encoding: some PDFs ship fonts that use private character maps (CIDs) without a ToUnicode table, so the extractor sees correct glyphs but no mapping back to Unicode characters. Other causes include heavily ligatured fonts, vertical text, and PDFs where the visible text is actually rendered as paths (outlined fonts) — at that point there is no text layer at all.
When extraction fails, OCR is the fallback. Run the PDF through our Text Extractor (OCR), which re-reads the visible glyphs as images and re-types them as Unicode. OCR is slower and slightly less perfect than a clean text layer, but it works on any visible text.
Preserving structure: tables, columns, lists
Plain text extraction loses visual structure by design — there's no way to know which words form a table cell or a bullet item from the raw text stream. If you need structure preserved, use PDF to Word: it converts the PDF into a .docx with headings, paragraphs, lists and simple tables intact. That's a better starting point than plain text if your downstream tool is Word, Google Docs, Notion or any editor that understands rich text.
For multi-column layouts (academic journals, magazines, newspapers), text often interleaves between columns. Paste the output into a wide text editor and look for sudden topic shifts — that's where columns merged. Splitting at those points usually restores the reading order.
FAQ
- Is my PDF uploaded to a server?
- No. Text extraction runs entirely in your browser using PDF.js. Nothing leaves your device.
- Why is the extracted text empty?
- Your PDF is probably scanned — it's an image, not real text. You need OCR.
- Can I extract text from a password-protected PDF?
- Unlock it first using our PDF Password Remover, then extract.