PDF Hacks

PDF Hacks

RubyPDF Release pdf2htmlEX Windows Version

pdf2htmlEX  is an open source tool that can easily convert PDF to HTML without losing text or format, the source code has released for a long time, but still no windows port,  now, rubypdf.com gives us a chance to use this tool under windows, win32 static version, only one exe and some necessary resource files.

for details, please visit,

pdf2htmlEX Windows Verion

pdf2htmlEX v0.9 Windows Verion Release

 

btw, rubypf.com also releases a windows version mktemp,  a little tool that safe temporary file creation from shell scripts

 

 

 

August 20, 2013 Posted by | PDF News, Software | , , , , , , | Leave a comment

PDFMiner-Python PDF parser and analyzer

PDFMiner is a suite of programs that help extracting and analyzing text data of PDF documents. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other extra information such as font information or ruled lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.

Features:

  • Written entirely in Python. (for version 2.4 or newer)
  • PDF-1.7 specification support. (well, almost)
  • Non-ASCII languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Basic encryption (RC4) support.
  • PDF to HTML conversion (with a sample converter web app).
  • Outline (TOC) extraction.
  • Tagged contents extraction.
  • Infer text running by using clustering technique.

Download and For details, please visit http://www.unixuser.org/~euske/python/pdfminer/index.html.

btw,

PDFMiner comes with two handy tools: pdf2txt.py and dumppdf.py.

pdf2txt.py extracts text contents from a PDF file. It extracts all the texts that are to be rendered programmatically, It cannot recognize texts drawn as images that would require optical character recognition. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion. You need to provide a password for protected PDF documents when its access is restricted. You cannot extract any text from a PDF document which does not have extraction permission.

For non-ASCII languages, you can specify the output encoding (such as UTF-8).

dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purpose, but it’s also possible to extract some meaningful contents (such as images).

October 19, 2009 Posted by | Linux, Open Source, Software, Windows | , , , , , , , , , | Leave a comment