What PDF text extraction is
PDF text extraction is the process of pulling just the textual content out of a PDF file into plain text you can copy, paste, search, or feed into another tool. We do it server-side because PDFs are not simple: text inside a PDF is not stored as a single string. It lives as hundreds of small fragments placed at exact positions by whoever produced the document (Word, InDesign, a PDF printer).
You upload a PDF and we parse it with pdf.js, the same engine Firefox uses to render PDFs in the browser. You get the full text, a per-page breakdown, and the metadata (title, author, creation date). No installs, no third-party cloud, no account.
How to use it
- Drop a PDF file onto the upload area or click to pick one from disk. Single file at a time, up to 20 MB and 500 pages.
- Wait a few seconds. Large documents (200+ pages) can take 10 to 20 seconds because each page is parsed individually.
- After extraction you get three tabs: Plain text (everything joined), By page (each page separately), and Metadata (title, author, dates).
- In the "Plain text" tab use Copy to put the content on your clipboard or Download .txt to save it locally.
- In the "By page" tab you will see which pages are empty (marked with a yellow badge). Each page can also be downloaded on its own.
- In the "Metadata" tab you can check who and when authored the document, with what software (Producer), and whether the file was encrypted.
- If every page comes back empty, you get a warning. That means the PDF is a scanned image and you need a separate OCR tool first.
When this is useful
Seven typical PDF-to-text scenarios:
- Copying from a PDF that blocks selection: some documents have the "copy" function disabled in the reader. This tool pulls the text out anyway.
- Pasting quotes into Word or Google Docs: no more retyping whole paragraphs from a PDF you have open on screen.
- Preparing text for a language model (ChatGPT, Claude): you copy the result and drop it into the chat window instead of fighting with PDF formatting.
- Full-text search across a document archive: extracted text can be indexed by grep, ripgrep, Notion, or Obsidian for fast lookup later.
- Translating a contract or manual: clean text pastes into DeepL or Google Translate without the layout artefacts a PDF would carry over.
- Conversion to other formats: a .txt baseline is the starting point for Markdown, HTML, CSV, or whatever the next step in your pipeline needs.
- Pulling tables of numbers out of a report: a PDF full of figures becomes text you can paste into a spreadsheet and sort.
Companion tools: HTML to Markdown converter, JSON formatter, regex tester.