Japfa IDP

Period: Jul 2025 – Dec 2025
Role: AI Engineer Intern
Client: PT. Japfa Comfeed Indonesia

Problem

The logistics and finance teams at Japfa process thousands of invoices, bills of lading, and receipts every month. Documents arrive as scanned PDFs or phone photos, in mixed formats, sometimes crooked, sometimes with handwritten annotations. Manual data entry was the bottleneck — slow, error-prone, and the team couldn't scale to handle the document volume.

We needed a pipeline that could take a raw document image and return structured data the team could paste directly into their existing spreadsheets and ERP system. Accuracy on the tables was the non-negotiable.

Approach

We built a 4-stage pipeline, modularized into three services so each stage could be tested and improved independently:

Pre-processing (Python + OpenCV): deskew, denoise, and contrast-normalize the image. We found that even a 50ms cv2 pass improved downstream OCR accuracy by 8-12% on phone-photo documents.
OCR + structure extraction (Tesseract for fast text, Docling for complex tables): the document is first OCR'd for raw text, then Docling extracts the document structure — table boundaries, line items, headers. Docling's table model was the breakthrough on the table accuracy problem.
Field extraction (rule-based + fuzzy matching): for each document type, we defined the fields to extract (invoice number, date, total, line items) and a set of regex + fuzzy-search rules to find them. We deliberately avoided LLMs for field extraction — they were 10x slower and not measurably more accurate for the well-defined schema.
API + UI (FastAPI + a small internal web app): the pipeline is exposed as a REST API; the web app batches documents and shows the extracted fields for human review before they hit the ERP.

from docling.document_converter import DocumentConverter
from mlflow import log_artifact, start_run
 
def extract_invoice(image_path: str) -> dict:
    converter = DocumentConverter()
    result = converter.convert(image_path)
 
    with start_run(run_name="invoice-extract"):
        log_artifact(image_path, "input.png")
 
    return result.document.export_to_dict()

We wired in fuzzy search for OCR typo tolerance (so a misread "INV-2024-001" still matches a real "INV-2024-O01"), anchor-based positioning for tables (so we could extract a line item without cropping), and document grouping so a multi-page PDF produces one record, not many.

Outcome

What I learned:

Production ML systems are 80% data engineering and 20% modeling. The MLflow logging, the document grouping, the fuzzy search, the deskew — that was 80% of the work and 100% of what made the system actually usable.
The right level of modularization is the one your team can maintain. We split into 3 services because the team had 3 people who needed to work in parallel. For a 1-person team, 1 service would have been right.
Hand off early and often. I wrote the user manual and architecture flowchart before the final demo, not after. The next intern batch was up to speed in a week.