Does it work on scanned PDFs (page images, not real text)?

**No**. This tool **does not do OCR**: it reads only the text that is already stored as text inside the PDF. If your document is a scan of paper or a "Print to PDF" export of images, every page will come back **empty** and you will see a warning. For scans you need a separate OCR tool (Google Drive, Adobe Acrobat, Tesseract). Run OCR first, then drop the resulting PDF here and that combination works.

How accurate is the extraction? Do I get the exact same text as the PDF?

**Very accurate for normal documents** (Word, Google Docs, LaTeX, browser exports). The text comes through 1:1. **Issues can happen with**: tables (column order may drift), multi-column newspapers (column interleaving), forms (text fields are separate from labels), and PDFs produced by old printers before 2010 that embed custom font encodings.

What about paragraph breaks and line endings?

A PDF **does not store paragraphs** the way Word does. Each line is a separate positioned fragment. pdf.js joins fragments by their position so reading order is preserved, but **the result is rarely perfectly paragraphed**. In practice: most documents come out clean and short line clusters that should be one paragraph are easiest to fix in your editor with a find-and-replace (single \n to space, double \n stays).

Are pages clearly separated in the export?

**Yes**. In the "Plain text" tab pages are separated by **two newlines** (\n\n) so the boundary is visible. In the "By page" tab each page is its own card with the page number, character count, and a per-page download button. Internally we use the POSIX-standard **form-feed (\f)** as the page separator but render it as a visible blank gap in the final output for readability.

What about password-protected (encrypted) PDFs?

**Some yes, some no**. PDFs have two protection types: **owner** (print / copy lock) and **user** (open password). The first is just metadata flags and we ignore it. The second cannot be bypassed because the file is genuinely encrypted. You will get an "encryptedPdf" error. Workaround: open in Adobe Reader, enter the password, "Save a copy" as an unprotected PDF, then upload that.

Why do some special characters come out as garbage?

PDF has no single character encoding: each embedded font can carry **its own mapping** from glyphs to Unicode. Professional documents (Word, LaTeX, InDesign) include a "ToUnicode CMap" and everything decodes cleanly. **Old PDF printers from the 2000s** (Acrobat Distiller 5, networked copiers) skip that, and accented or non-Latin characters come out as broken sequences. The only fix for those files is to OCR the PDF itself.

Why do some pages come back empty or with just a single character?

**Three common reasons**: (1) the page contains only images or diagrams with no text layer, (2) it is a blank section divider (typical in annual reports with "Chapter 3" splash pages), (3) the text uses a non-standard font without a Unicode mapping. We flag every such page with a yellow **Empty** badge in the "By page" tab so you can immediately spot where extraction failed.

What happens to my file after extraction? Is it stored anywhere?

**No**. The file goes into **Node process memory** as a buffer, pdf.js parses it, we send the result back, and the buffer is released by the garbage collector. **Nothing touches disk**, nothing goes to a database, nothing stays in a cache. Once the request finishes there is no trace your PDF was ever here. The 30-files-per-hour rate limit is the only persistent state (an anonymous IP counter in memory, wiped on restart).

What is the maximum file size?

**20 MB and 500 pages**. That covers most everyday documents: contracts, manuals, reports, theses, ebooks, technical specs. Beyond that we refuse because (a) parsing starts taking tens of seconds and would block other users, (b) very large PDFs are often high-resolution scans where you would need OCR anyway. If your file is bigger, split it into sections with a free PDF splitter (or Adobe Reader's "Extract pages") before uploading.

PDF Text Extractor - free

What PDF text extraction is

PDF text extraction is the process of pulling just the textual content out of a PDF file into plain text you can copy, paste, search, or feed into another tool. We do it server-side because PDFs are not simple: text inside a PDF is not stored as a single string. It lives as hundreds of small fragments placed at exact positions by whoever produced the document (Word, InDesign, a PDF printer).

You upload a PDF and we parse it with pdf.js, the same engine Firefox uses to render PDFs in the browser. You get the full text, a per-page breakdown, and the metadata (title, author, creation date). No installs, no third-party cloud, no account.

How to use it

Drop a PDF file onto the upload area or click to pick one from disk. Single file at a time, up to 20 MB and 500 pages.
Wait a few seconds. Large documents (200+ pages) can take 10 to 20 seconds because each page is parsed individually.
After extraction you get three tabs: Plain text (everything joined), By page (each page separately), and Metadata (title, author, dates).
In the "Plain text" tab use Copy to put the content on your clipboard or Download .txt to save it locally.
In the "By page" tab you will see which pages are empty (marked with a yellow badge). Each page can also be downloaded on its own.
In the "Metadata" tab you can check who and when authored the document, with what software (Producer), and whether the file was encrypted.
If every page comes back empty, you get a warning. That means the PDF is a scanned image and you need a separate OCR tool first.

When this is useful

Seven typical PDF-to-text scenarios:

Copying from a PDF that blocks selection: some documents have the "copy" function disabled in the reader. This tool pulls the text out anyway.
Pasting quotes into Word or Google Docs: no more retyping whole paragraphs from a PDF you have open on screen.
Preparing text for a language model (ChatGPT, Claude): you copy the result and drop it into the chat window instead of fighting with PDF formatting.
Full-text search across a document archive: extracted text can be indexed by grep, ripgrep, Notion, or Obsidian for fast lookup later.
Translating a contract or manual: clean text pastes into DeepL or Google Translate without the layout artefacts a PDF would carry over.
Conversion to other formats: a .txt baseline is the starting point for Markdown, HTML, CSV, or whatever the next step in your pipeline needs.
Pulling tables of numbers out of a report: a PDF full of figures becomes text you can paste into a spreadsheet and sort.

Companion tools: HTML to Markdown converter, JSON formatter, regex tester.

Questions and answers

The full pdf.js bundle is over 3 MB of JavaScript, and loading it in every visitor's browser would slow the page noticeably. Server-side, the library stays warm in the Node process and you only upload the file. The latency win is significant, especially on slow networks. The file is not persisted: once the response is sent, the buffer is dropped.

What PDF text extraction is

How to use it

Drop a PDF file onto the upload area or click to pick one from disk. Single file at a time, up to 20 MB and 500 pages.

Wait a few seconds. Large documents (200+ pages) can take 10 to 20 seconds because each page is parsed individually.

After extraction you get three tabs: Plain text (everything joined), By page (each page separately), and Metadata (title, author, dates).

In the "Plain text" tab use Copy to put the content on your clipboard or Download .txt to save it locally.

In the "By page" tab you will see which pages are empty (marked with a yellow badge). Each page can also be downloaded on its own.

In the "Metadata" tab you can check who and when authored the document, with what software (Producer), and whether the file was encrypted.

If every page comes back empty, you get a warning. That means the PDF is a scanned image and you need a separate OCR tool first.

When this is useful

Seven typical PDF-to-text scenarios:

Copying from a PDF that blocks selection: some documents have the "copy" function disabled in the reader. This tool pulls the text out anyway.
Pasting quotes into Word or Google Docs: no more retyping whole paragraphs from a PDF you have open on screen.
Preparing text for a language model (ChatGPT, Claude): you copy the result and drop it into the chat window instead of fighting with PDF formatting.
Full-text search across a document archive: extracted text can be indexed by grep, ripgrep, Notion, or Obsidian for fast lookup later.
Translating a contract or manual: clean text pastes into DeepL or Google Translate without the layout artefacts a PDF would carry over.
Conversion to other formats: a .txt baseline is the starting point for Markdown, HTML, CSV, or whatever the next step in your pipeline needs.
Pulling tables of numbers out of a report: a PDF full of figures becomes text you can paste into a spreadsheet and sort.

Companion tools: HTML to Markdown converter, JSON formatter, regex tester.

Questions and answers

PDF Text Extractor

Drop a PDF file

What PDF text extraction is

How to use it

When this is useful

Questions and answers

Related tools

PDF and image converter

DOCX to Markdown Converter

HTML / Markdown Converter

JSON formatter

Regex tester

PDF Text Extractor

Drop a PDF file

What PDF text extraction is

How to use it

When this is useful

Questions and answers

Related tools

PDF and image converter

DOCX to Markdown Converter

HTML / Markdown Converter

JSON formatter

Regex tester