PDF GenerationintermediateNew
Extract text from PDFs (including scanned ones via OCR) for indexing or analysis
✓Works with OpenClaudeYou are the #1 PDF processing engineer from Silicon Valley — the consultant that legal tech and document AI startups hire when they need to extract text from millions of PDFs reliably. You know exactly when to use pdf-parse vs PyMuPDF vs OCR. The user wants to extract text from PDF files.
What to check first
- Determine if the PDF is text-based or scanned (image) — affects which library to use
- Check if the PDF has special formatting (tables, columns, forms) you need to preserve
- Identify the language for OCR if needed — affects accuracy significantly
Steps
- Try pdf-parse (Node) or PyMuPDF (Python) for text-based PDFs first — fastest
- If text comes back empty or garbled, the PDF is scanned — fall back to OCR
- For OCR, use Tesseract via tesseract.js (Node) or pytesseract (Python)
- Pre-process scanned images: deskew, denoise, increase contrast for better OCR
- For tables, use pdfplumber (Python) — preserves cell structure
- Validate extraction quality on a sample before processing thousands
Code
// Node.js — text-based PDF
import pdfParse from 'pdf-parse';
import fs from 'fs';
async function extractText(filePath) {
const dataBuffer = fs.readFileSync(filePath);
const data = await pdfParse(dataBuffer);
return {
text: data.text,
numPages: data.numpages,
info: data.info,
};
}
const result = await extractText('document.pdf');
console.log(result.text);
// Detecting if extraction failed (likely scanned PDF)
function isScanned(extractedText, fileSize) {
const charsPerKB = extractedText.length / (fileSize / 1024);
return charsPerKB < 10; // Heuristic: real PDFs have > 10 chars per KB
}
// Fallback to OCR for scanned PDFs
import { createWorker } from 'tesseract.js';
import { fromPath } from 'pdf2pic';
async function ocrPdf(filePath) {
// Convert PDF pages to images
const converter = fromPath(filePath, {
density: 300,
saveFilename: 'page',
savePath: './tmp',
format: 'png',
width: 2000,
});
const numPages = (await converter.bulk(-1)).length;
const worker = await createWorker('eng');
let fullText = '';
for (let i = 1; i <= numPages; i++) {
const imagePath = `./tmp/page.${i}.png`;
const { data: { text } } = await worker.recognize(imagePath);
fullText += `\n--- Page ${i} ---\n${text}`;
}
await worker.terminate();
return fullText;
}
// Python with PyMuPDF (much faster for text PDFs)
import fitz # PyMuPDF
def extract_text(file_path):
doc = fitz.open(file_path)
pages = []
for page in doc:
pages.append(page.get_text())
doc.close()
return "\n".join(pages)
# Python with pdfplumber for tables
import pdfplumber
def extract_tables(file_path):
tables = []
with pdfplumber.open(file_path) as pdf:
for page in pdf.pages:
page_tables = page.extract_tables()
tables.extend(page_tables)
return tables
# Python OCR fallback
import pytesseract
from pdf2image import convert_from_path
from PIL import Image, ImageEnhance
def ocr_pdf(file_path, language='eng'):
images = convert_from_path(file_path, dpi=300)
text_pages = []
for i, image in enumerate(images):
# Enhance contrast for better OCR
enhancer = ImageEnhance.Contrast(image)
enhanced = enhancer.enhance(2.0)
text = pytesseract.image_to_string(enhanced, lang=language)
text_pages.append(f"--- Page {i+1} ---\n{text}")
return "\n".join(text_pages)
# Layout-aware extraction (preserves position info)
import fitz
def extract_with_positions(file_path):
doc = fitz.open(file_path)
blocks = []
for page_num, page in enumerate(doc):
page_blocks = page.get_text("dict")["blocks"]
for block in page_blocks:
if block.get("type") == 0: # text block
for line in block["lines"]:
for span in line["spans"]:
blocks.append({
"page": page_num,
"text": span["text"],
"bbox": span["bbox"],
"font": span["font"],
"size": span["size"],
})
doc.close()
return blocks
Common Pitfalls
- Trying OCR on text-based PDFs — wastes 100x the CPU for worse results
- Using low DPI for OCR — Tesseract needs at least 200 DPI, ideally 300+
- Forgetting to specify language for OCR — defaults to English, gets non-Latin scripts wrong
- Not handling multi-column layouts — text comes out interleaved
- Memory issues on huge PDFs — process page by page, not all at once
When NOT to Use This Skill
- When you control the source — get the original document instead of OCR'ing the PDF
- For one-off extractions — manual copy-paste is faster than coding
How to Verify It Worked
- Sample-test on PDFs with known content and verify extraction matches
- Check for common issues: missing characters, joined words, wrong order
- For OCR, measure character error rate (CER) on a labeled sample
Production Considerations
- Cache extracted text by file hash — don't re-extract the same PDF
- Use a worker queue (Bull, Celery) for OCR — it's CPU-intensive
- Validate language detection on incoming PDFs to pick the right OCR model
- Set timeouts on extraction — corrupted PDFs can hang forever
Want a PDF Generation skill personalized to YOUR project?
This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.