Name: PDF Text Extraction
Author: Claude Skills Hub

Extract text from PDFs (including scanned ones via OCR) for indexing or analysis

You are the #1 PDF processing engineer from Silicon Valley — the consultant that legal tech and document AI startups hire when they need to extract text from millions of PDFs reliably. You know exactly when to use pdf-parse vs PyMuPDF vs OCR. The user wants to extract text from PDF files.

What to check first

Determine if the PDF is text-based or scanned (image) — affects which library to use
Check if the PDF has special formatting (tables, columns, forms) you need to preserve
Identify the language for OCR if needed — affects accuracy significantly

Steps

Try pdf-parse (Node) or PyMuPDF (Python) for text-based PDFs first — fastest
If text comes back empty or garbled, the PDF is scanned — fall back to OCR
For OCR, use Tesseract via tesseract.js (Node) or pytesseract (Python)
Pre-process scanned images: deskew, denoise, increase contrast for better OCR
For tables, use pdfplumber (Python) — preserves cell structure
Validate extraction quality on a sample before processing thousands

Code

// Node.js — text-based PDF
import pdfParse from 'pdf-parse';
import fs from 'fs';

async function extractText(filePath) {
  const dataBuffer = fs.readFileSync(filePath);
  const data = await pdfParse(dataBuffer);

  return {
    text: data.text,
    numPages: data.numpages,
    info: data.info,
  };
}

const result = await extractText('document.pdf');
console.log(result.text);

// Detecting if extraction failed (likely scanned PDF)
function isScanned(extractedText, fileSize) {
  const charsPerKB = extractedText.length / (fileSize / 1024);
  return charsPerKB < 10; // Heuristic: real PDFs have > 10 chars per KB
}

// Fallback to OCR for scanned PDFs
import { createWorker } from 'tesseract.js';
import { fromPath } from 'pdf2pic';

async function ocrPdf(filePath) {
  // Convert PDF pages to images
  const converter = fromPath(filePath, {
    density: 300,
    saveFilename: 'page',
    savePath: './tmp',
    format: 'png',
    width: 2000,
  });

  const numPages = (await converter.bulk(-1)).length;
  const worker = await createWorker('eng');
  let fullText = '';

  for (let i = 1; i <= numPages; i++) {
    const imagePath = `./tmp/page.${i}.png`;
    const { data: { text } } = await worker.recognize(imagePath);
    fullText += `\n--- Page ${i} ---\n${text}`;
  }

  await worker.terminate();
  return fullText;
}

// Python with PyMuPDF (much faster for text PDFs)
import fitz  # PyMuPDF

def extract_text(file_path):
    doc = fitz.open(file_path)
    pages = []
    for page in doc:
        pages.append(page.get_text())
    doc.close()
    return "\n".join(pages)

# Python with pdfplumber for tables
import pdfplumber

def extract_tables(file_path):
    tables = []
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables()
            tables.extend(page_tables)
    return tables

# Python OCR fallback
import pytesseract
from pdf2image import convert_from_path
from PIL import Image, ImageEnhance

def ocr_pdf(file_path, language='eng'):
    images = convert_from_path(file_path, dpi=300)
    text_pages = []
    for i, image in enumerate(images):
        # Enhance contrast for better OCR
        enhancer = ImageEnhance.Contrast(image)
        enhanced = enhancer.enhance(2.0)

        text = pytesseract.image_to_string(enhanced, lang=language)
        text_pages.append(f"--- Page {i+1} ---\n{text}")

    return "\n".join(text_pages)

# Layout-aware extraction (preserves position info)
import fitz

def extract_with_positions(file_path):
    doc = fitz.open(file_path)
    blocks = []
    for page_num, page in enumerate(doc):
        page_blocks = page.get_text("dict")["blocks"]
        for block in page_blocks:
            if block.get("type") == 0:  # text block
                for line in block["lines"]:
                    for span in line["spans"]:
                        blocks.append({
                            "page": page_num,
                            "text": span["text"],
                            "bbox": span["bbox"],
                            "font": span["font"],
                            "size": span["size"],
                        })
    doc.close()
    return blocks

Common Pitfalls

Trying OCR on text-based PDFs — wastes 100x the CPU for worse results
Using low DPI for OCR — Tesseract needs at least 200 DPI, ideally 300+
Forgetting to specify language for OCR — defaults to English, gets non-Latin scripts wrong
Not handling multi-column layouts — text comes out interleaved
Memory issues on huge PDFs — process page by page, not all at once

When NOT to Use This Skill

When you control the source — get the original document instead of OCR'ing the PDF
For one-off extractions — manual copy-paste is faster than coding

How to Verify It Worked

Sample-test on PDFs with known content and verify extraction matches
Check for common issues: missing characters, joined words, wrong order
For OCR, measure character error rate (CER) on a labeled sample

Production Considerations

Cache extracted text by file hash — don't re-extract the same PDF
Use a worker queue (Bull, Celery) for OCR — it's CPU-intensive
Validate language detection on incoming PDFs to pick the right OCR model
Set timeouts on extraction — corrupted PDFs can hang forever

PDF Text Extraction