LLM

Information Extraction — OCR and LLM combined for optimal performance

Information Extraction — OCR and LLM combined for optimal performance

Summary

This project extracts key information from PDFs or images provided by customers. It ingests images or PDFs and outputs a structured JSON object containing customer details, product descriptions, and history.

  • Inputs: Images or PDFs (scanned or born-digital)
  • Output: JSON object with the fields described below
  • Primary challenge: PDFs/images vary widely; some require OCR to access text

Why Text Extraction Sometimes Fails

PDFs and scanned documents can be complex. Here are the main reasons extraction can fail or degrade:

  1. PDF Structure Variability
    • Text-based PDFs: Contain selectable text with fonts and coordinates.
    • Image-based PDFs (scans): Contain pictures of text; there's no embedded text layer.
  2. No OCR Layer — If a PDF is image-based, standard text extraction libraries (e.g., PyMuPDF, pdfminer) won't find text. OCR (Optical Character Recognition) is required (e.g., Tesseract, EasyOCR).
  3. Mixed Content — Some PDFs contain both text and images. You may extract part of the content via text parsing and need OCR for the rest.
  4. Encoding & Fonts — Even text-based PDFs can be tricky if characters are represented as glyphs without correct encoding, leading to garbled output.

Target Fields to Extract

The goal is to extract the following fields from a transfer certificate:

  • a_id (string or integer)
  • w_id (string)
  • created_at (ISO 8601 timestamp)
  • name (string)
  • additional_people (array of strings)
  • product_description (string)
  • history (array of claim objects — see below)
  • source_file_name (string)
  • is_document (boolean)
  • confidence (0.0–1.0)
  • errors (array of strings; optional)

Example (Improved JSON)

{  "a_id": 1,  "w_id": "1a",  "created_at": "2025-11-06T13:58:04.604+11:00",  "name": "Anna",  "additional_people": ["Sebastian", "Anthony"],  "product_description": "Product A,  "history": [    {      "person_name": "Anna",      "service_type": "OTHER ",      "amount": 25.00,      "currency": "AUD"    },    {      "person_name": "Anthony",      "service_type": "GENERAL",      "amount": 432.00,      "currency": "AUD"    }  ],  "source_file_name": "Anna.pdf",  "is_document": true,  "confidence": 0.88,  "errors": []}

System Architecture (High-Level)

  1. Ingestion

    • Accept uploaded PDFs or images.
    • Normalize file names and store metadata (size, page count, format).
  2. Pre-processing

    • Detect and correct orientation (OSD or heuristic rotation).
    • Optional: de-skew, denoise, adjust contrast; ensure DPI ~200–300 for OCR.
  3. Text Extraction

    • Text-based PDFs: Extract text with a PDF parser (e.g., PyMuPDF).
    • Image-based PDFs or images: Use OCR (e.g., Tesseract, EasyOCR).
    • Mixed PDFs: Combine both approaches; per-page strategy.
  4. Document Classification

    • Confirm whether the document is a document of interest (rule-based + ML/LLM text classification).
  5. Information Extraction

    • Extract name, additional_people, product_description, history.
    • Combine pattern-based extraction (regex, keywords) with an LLM for robustness.
  6. Post-processing & Validation

    • Normalize date formats, currency, names; remove duplicates.
    • Compute a confidence score; attach errors if fields are missing.
  7. Output & Persistence

    • Emit the JSON schema shown above.
    • Persist to a database (optional), or return via API.

Reference Pipeline (Pseudocode)

def extract_certificate(file_path: str) -> dict:    # 1) Ingestion    meta = get_file_metadata(file_path)    # 2) Pre-processing    if is_image(file_path):        img = load_image(file_path)        angle = detect_orientation(img)  # OSD or heuristic        img = rotate_if_needed(img, angle)        ocr_text = run_ocr(img)        raw_text = ocr_text    elif is_pdf(file_path):        text = extract_pdf_text(file_path)  # PyMuPDF        if not text or is_low_quality(text):            pages = rasterize_pdf_to_images(file_path, dpi=200)            raw_text = " ".join(run_ocr(p) for p in pages)        else:            raw_text = text    else:        return error_json("Unsupported file format")    # 3) Classification    is_tc, tc_conf = classify_certificate(raw_text)    if not is_tc:        return {            "is_certificate": False,            "confidence": tc_conf,            "errors": ["This is not a certificate"]        }    # 4) Information extraction (hybrid approach)    fields = {        "name": extract_name(raw_text),        "additional_people": extract_dependants(raw_text),        "product_description": extract_product_description(raw_text),        "history": extract_claims(raw_text)    }    # 5) Validation & normalization    fields = normalize_fields(fields)    conf = compute_confidence(fields, tc_conf)    # 6) Assemble output    return {        "a_id": meta.get("attachment_id"),        "w_id": meta.get("work_id"),        "created_at": meta.get("created_at_iso"),        "name": fields["patient_name"],        "additional_people": fields["additional_people"],        "product_description": fields["product_description"],        "history": fields["history"],        "source_file_name": meta.get("file_name"),        "is_certificate": True,        "confidence": conf,        "errors": fields.get("errors", [])    }