PDF to HTML5

by Nat Laughlin on 2025-02-05

GitHub

A PDF document translator that produces an equivalent HTML5 document with near pixel-perfect accuracy.

  • Character positions are grouped into logical block elements.
  • Embedded PDF fonts are handled by FontForge, and re-encoded as base64 OpenType fonts in the header.

Examples

Requirements

  • Java 1.8
  • Maven 3.6
  • FontForge

Install

git clone https://github.com/natlaughlin/pdftohtml5.git
cd pdftohtml5
mvn clean package

Run

java -cp target/pdftohtml5-0.0.1-SNAPSHOT.jar com.natlaughlin.pdftohtml5.PdfToHtml5 <OPTIONS> <PDF file>

TODO

  • Some special font types are not currently supported.
  • Add more PDF test cases.