rubypdf

PDFMiner-Python PDF parser and analyzer

PDFMiner is a suite of programs that help extracting and analyzing text data of PDF documents. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other extra information such as font information or ruled lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.

Features:

Written entirely in Python. (for version 2.4 or newer)
PDF-1.7 specification support. (well, almost)
Non-ASCII languages and vertical writing scripts support.
Various font types (Type1, TrueType, Type3, and CID) support.
Basic encryption (RC4) support.
PDF to HTML conversion (with a sample converter web app).
Outline (TOC) extraction.
Tagged contents extraction.
Infer text running by using clustering technique.

Download and For details, please visit http://www.unixuser.org/~euske/python/pdfminer/index.html.

btw,

PDFMiner comes with two handy tools: pdf2txt.py and dumppdf.py.

pdf2txt.py extracts text contents from a PDF file. It extracts all the texts that are to be rendered programmatically, It cannot recognize texts drawn as images that would require optical character recognition. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion. You need to provide a password for protected PDF documents when its access is restricted. You cannot extract any text from a PDF document which does not have extraction permission.

For non-ASCII languages, you can specify the output encoding (such as UTF-8).

dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purpose, but it’s also possible to extract some meaningful contents (such as images).

October 19, 2009 Posted by rubypdf | Linux, Open Source, Software, Windows | dumpdf, pdf parser and analyzer, pdf2html, pdf2text, pdf2txt, pdftohtml, pdftotext, pdftotxt, Python, toc | Leave a comment

Some PDF Tools developed in Python

When Search WordPress.com, I noticed a article PDF Tools, it introduces some small PDF tools and all developed in Python.

pdf-parser.py

This tool will parse a PDF document to identify the fundamental elements used in the analyzed file. It will not render a PDF document. The code of the parser is quick-and-dirty, I’m not recommending this as text book case for PDF parsers, but it gets the job done.

You can see the parser in action in this screencast.

The stats option display statistics of the objects found in the PDF document. Use this to identify PDF documents with unusual/unexpected objects, or to classify PDF documents. For example, I generated statistics for 2 malicious PDF files, and although they were very different in content and size, the statistics were identical, proving that they used the same attack vector and shared the same origin.

The search option searches for a string in indirect objects (not inside the stream of indirect objects). The search is not case-sensitive, and is susceptible to the obfuscation techniques I documented (as I’ve yet to encounter these obfuscation techniques in the wild, I decided no to resort to canonicalization).

filter option applies the filter(s) to the stream. For the moment, only FlateDecode is supported (e.g. zlib decompression).

The raw option makes pdf-parser output raw data (e.g. not the printable Python representation).

objects outputs the data of the indirect object which ID was specified. This ID is not version dependent. If more than one object have the same ID (disregarding the version), all these objects will be outputted.

reference allows you to select all objects referencing the specified indirect object. This ID is not version dependent.

type alows you to select all objects of a given type. The type is a Name and as such is case-sensitive and must start with a slash-character (/).

Download:

pdf-parser_V0_3_1.zip (https)

MD5: 07CDA54844CD6567473CBF2B0DFC601C

SHA256: 7614AEC453502EEF43F9EA04A82092C4ACDD32AB86D1C4D744B7B590C74152EC

make-pdf tools
make-pdf-javascript.py allows one to create a simple PDF document with embedded JavaScript that will execute upon opening of the PDF document. It’s essentially glue-code for the mPDF.py module which contains a class with methods to create headers, indirect objects, stream objects, trailers and XREFs.

20081109-134003

If you execute it without options, it will generate a PDF document with JavaScript to display a message box (calling app.alert).

To provide your own JavaScript, use option –javascript for a script on the command line, or –javascriptfile for a script contained in a file.

Download:

make-pdf_V0_1_1.zip (https)

MD5: 9AF2E343B78553021C989E8E22355531

SHA256: C604679ABEB0469C1463159E02E74F12487B2755A6096B416A8F4F638DEB8AA9

pdfid.py
This tool is not a PDF parser, but it will scan a file to look for certain PDF keywords, allowing you to identify PDF documents that contain (for example) JavaScript or execute an action when opened. PDFiD will also handle name obfuscation.

The idea is to use this tool first to triage PDF documents, and then analyze the suspicious ones with my pdf-parser.

An important design criterium for this program is simplicity. Parsing a PDF document completely requires a very complex program, and hence it is bound to contain many (security) bugs. To avoid the risk of getting exploited, I decided to keep this program very simple (it is even simpler than pdf-parser.py).

20090330-214223

PDFiD will scan a PDF document for a given list of strings and count the occurrences (total and obfuscated) of each word:

obj
endobj
stream
endstream
xref
trailer
startxref
/Page
/Encrypt
/ObjStm
/JS
/JavaScript
/AA
/OpenAction
/JBIG2Decode

Almost every PDF documents will contain the first 7 words (obj through startxref), and to a lesser extent stream and endstream. I’ve found a couple of PDF documents without xref or trailer, but these are rare (BTW, this is not an indication of a malicious PDF document).

/Page gives an indication of the number of pages in the PDF document. Most malicious PDF document have only one page.

/Encrypt indicates that the PDF document has DRM or needs a password to be read.

/ObjStm counts the number of object streams. An object stream is a stream object that can contain other objects, and can therefor be used to obfuscate objects (by using different filters).

/JS and /JavaScript indicate that the PDF document contains JavaScript. Almost all malicious PDF documents that I’ve found in the wild contain JavaScript (to exploit a JavaScript vulnerability and/or to execute a heap spray). Of course, you can also find JavaScript in PDF documents without malicious intend.

/AA and /OpenAction indicate an automatic action to be performed when the page/document is viewed. All malicious PDF documents with JavaScript I’ve seen in the wild had an automatic action to launch the JavaScript without user interaction.

The combination of automatic action and JavaScript makes a PDF document very suspicious.

/JBIG2Decode indicates if the PDF document uses JBIG2 compression. This is not necessarily and indication of a malicious PDF document, but requires further investigation.

A number that appears between parentheses after the counter represents the number of obfuscated occurrences. For example, /JBIG2Decode 1(1) tells you that the PDF document contains the name /JBIG2Decode and that it was obfuscated (using hexcodes, e.g. /JBIG#32Decode).

BTW, all the counters can be skewed if the PDF document is saved with incremental updates.

Because PDFiD is just a string scanner (supporting name obfuscation), it will also generate false positives. For example, a simple text file starting with %PDF-1.1 and containing words from the list will also be identified as a PDF document.

Download:

pdfid_v0_0_9.zip (https)

MD5: 1C731D6204C09AAFF219876A8FB5E834

SHA256: 24A9B16E67A84E85488A16879CB611128B2E5921044E48EFB60D784BD785CBD0

October 19, 2009 Posted by rubypdf | Linux, Open Source, Software, Tutorials, Windows | adobe pdf, PDF Parser, pdf tools, Python | Leave a comment

How to Freely Convert PDF online

Pdf Portable Document Format – A proprietary format for the transfer of designs across multiple computer platforms. Pdf is a universal electronic file format, modeled after the Postscript language and is device- and resolution – independent. Documents in the pdf format can be viewed, navigated, and printed from any computer regardless of the fonts or software used to create the original.

As now, almost all of us use PDF file, sometimes it’s easier to have a PDF converter online, so we don’t need to install any program. Just connect to the net and search for PDF utility for free. Here’s some of them.

PDFTextOnline – Extract text from PDF and makes these text copy-able.
ShowPDF – A PDF-to-HTML converter.
FreePDFConvert – Convert MS Office, Images, Web Pages, Vector Graphic Formats files to PDF orConvert PDF to Word (doc) or Excel (xls) document, extract Images from PDF.
Document Converter eXPress – Convert files to PDF or Image without the need of installingspecial software.
Web2PDF Online – A free HTML to PDF Conversion service for your websites that allows your visitors to quickly save useful information in your blogs and websites to PDF files.
Lettos – DOC to ODT & PDF, ODT to PDF a DOC, PDF to TXT
PDFIt – A Firefox extension that allows you to convert any page into a PDF through a online service provided by Touchpdf.com.
htm2PDF – A service to convert your webpages to PDF
Zamzar – A free online file conversion that able to convert PDF to many document formats.
PDFOnline – Convert documents to PDF, PDF service for iPhone, web to PDF and PDF to word.
KoolWire – Just send your documents to pdf@koolwire.com, then you will receive a converted file in PDF format.
HTML2PDF.BIZ – Web Service & API that converts your Website into PDF.
ExpressPDF – ExpressPDF is an online service that lets you convert your Microsoft Office documents to PDF.
Online PDF Converter – You can convert you PDF file to text or image (JPEG, PNG, GIF, TIFF) absolutely free.
PDFText – Converts your PDF (Acrobat) file to plain text.
RSS2PDF – Free Online RSS, Atom or OPML to PDF Generator.
PDF-o-matic – A simple PHP script that uses HTMLDOC to convert the web page of your choice.
LOOP – A free web-based service that allows you to convert and combine files to PDF.
FeedJournal – Convert RSS and Atom feeds into PDF newspaper.
BookletCreator – A free online tool that allows to create a booklet from a PDF document.

–from 20 PDF Online Tool Converter for FREE

October 19, 2009 Posted by rubypdf | Hacks, Software, Tutorials | adobe pdf, pdf, PDF Converter | 2 Comments

Some Adobe Acrobat Tutorials and Videos

the Acrobat User Community is the perfect way to learn more about the latest features, meet other users, and share ideas with other members and Acrobat experts. Our goal is to provide the type of educational resources and user-to-user support that appeal to Acrobat users of all levels and professional backgrounds.

Learn how to work within your PDF documents to implement simple changes—without having to edit the original source file—with our ‘how to’ tutorials and videos.

Tutorials

Acrobat 9 to the Rescue
by Donna Baker
Understanding Acrobat’s Optimizer
by Duff Johnson
Cleaning up your PDF Documents
by Donna Baker
Optimizing a PDF Document
by Donna Baker
How to Rearrange Pages in a PDF Document
by Adobe Systems
Optimize PDF files with better results
by James Dempsey
How do I copy-and-paste editable text from PDF to Word?
by Lori DeFurio
Sharing PDF files online
by Donna Baker
Using Bookmarks for Navigating Documents
by Ted Padova

Videos

Adding Watermarks to your PDF Documents
by Janet Frick
Bookmark Basics: Part 1
by Geoff Blake
Bookmark Basics: Part 2
by Geoff Blake
Adding Headers/Footers to your PDF Documents
by Tim Plummer
Two methods to replace pages in a PDF file
by Rick Borstein
Optimizing your PDF Document
by Ian Campbell

October 19, 2009 Posted by rubypdf | Tutorials | Acrobat Tutorials, Acrobat Videos, Adobe Acrobat, adobe pdf, pdf | Leave a comment

A Easy and Free Way to download Books From Google

Google Book Downloader is small utility(developed in .NET) which allows you to save book as PDF from google to your local filesystem and with many features,

Download any book from Google Books marked as ‘Full view’
Partially download any book from Google Books marked as ‘Limited preview’
Access to any book available only for US citizens (instructions)
Searching for hidden pages (not indexed by Google Books)

The Google Book Downloader application allows users to enter a book’s ISBN number or Google link to pull up the desired book and begin a download, fishing off with exporting the file to a PDF.

References,

October 19, 2009 Posted by rubypdf | Books, Hacks, Microsoft, Open Source, Software, Tutorials, Windows | .NET FrameWork, Google, Google Book Downloader, Google Books, Google Books Library project, ISBN | 1 Comment

JPEG 2000 for Object Pascal-Delphi and FreePascal

JPEG 2000 for Pascal is library for Object Pascal (Delphi and Free Pascal) developers that want to use JPEG 2000 images in their applications. It is based on OpenJpeg library written in C language (BSD license). C library is precompiled (using C++ Builder for Delphi and GCC for FPC) for several platforms and Pascal header is provided. Some higher level classes for easier manipulation with JPEG 2000 images are part of the JPEG 2000 for Pascal as well.

Library Contents

Cross-platform Pascal interface to OpenJpeg – low level access to precompiled library. Currently supported platforms: Windows 32bit, Linux 32/64bit, and Mac OS X.
VCL wrapper for Delphi (TBitmap descendant) enabling easy loading and saving of JPEG 2000 images.
Samples that demonstrate usage of all library interfaces and few test images in various data formats.

Installation

Delphi: Just add some of JPEG 2000 for Pascal units you want to use to your uses clause (must be in you search path) and precompiled library will be linked automatically.

Free Pascal: OpenJpeg is compiled into static libraries so you have to set library search path when compiling your project. Libraries are located in J2KObjects directory.

Please visit here to download the last version.

source: http://galfar.vevb.net/wp/projects/jpeg2000-for-pascal/

September 10, 2009 Posted by rubypdf | Delphi, Free Pascal, Lazarus, Object Pascal, Open Source | BCC32, Cross-platform, FPC, FreePascal, GCC, Graphics, Image, J2K, JPEG2000, JPEG2000 for Delphi, JPEG2000 for Pascal, Lazarus, Liunx, Mac OS X, Open Source, OpenJPEG, Windows | Leave a comment

Convert RSS feeds to printable PDF newspapers

RSS has finally caught on to the point where even my non-geek friends have downloaded readers and subscribed to a few feeds. The thing is, you can’t really hand out RSS feeds at a rally, post them on a bulletin board, or leave them on a table where someone will pick them up and read them. That’s why it’s nice that fivefilters.org has provided a free way to turn your favorite feeds into printable PDF newspapers.

I’m as anti-paper as the next guy – heck, I haven’t owned a printer in years – but I know my mom’s not going to read my blog if I don’t hand it to her in paper form. I could do all the formatting myself, but Five Filters takes care of it automatically. The only major limitation is that it can only draw from one feed URL per PDF, but you can work around that by combining feeds using Yahoo! Pipes or a similar tool. It would be nice to pick and choose individual items from a feed reader to go into each newspaper, but this tool gets the basic job done, and the price is certainly right.

source: http://www.downloadsquad.com/2009/07/13/convert-rss-feeds-to-printable-pdf-newspapers/

September 9, 2009 Posted by rubypdf | Hacks, PDF News | newspaper, pdf, printable, rss, rss-to-pdf | Leave a comment

PDF Hacks: 100 Industrial-Strength Tips & Tools

PDF Hacks is ideal for anyone who works with PDFs on a regular basis. Learn how to create PDF documents that are far more powerful than simple representations of paper pages. Hacks cover the full range of PDF functionality, including generating, manipulating, annotating, and consuming PDF information. Far more than another guide to Adobe Acrobat, the book covers a variety of readily available tools for generating, deploying, and editing PDF.

Product Description
PDF–to most of the world it stands for that rather tiresome format used for documents downloaded from the web. Slow to load and slower to print, hopelessly unsearchable, and all but impossible to cut and paste from, the Portable Document Format doesn’t inspire much affection in the average user. But PDF done right is another story. Those who know the ins and outs of this format know that it can be much more than electronic paper. Flexible, compact, interactive, and even searchable, PDF is the ideal way to present content across multiple platforms.PDF Hacks unveils the true promise of Portable Document Format, going way beyond the usual PDF as paged output mechanism. PDF expert Sid Steward draws from his years of analyzing, extending, authoring, and embellishing PDF documents to present 100 clever hacks–tools, tips, quick-and-dirty or not-so-obvious solutions to common problems.

PDF Hacks will show you how to create PDF documents that are far more powerful than simple representations of paper pages. The hacks in the book cover the full range of PDF functionality, from the simple to the more complex, including generating, manipulating, annotating, and consuming PDF information. You’ll learn how to manage content in PDF, navigate it, and reuse it as necessary. Far more than another guide to Adobe Acrobat, the book covers a variety of readily available tools for generating, deploying, and editing PDF.

The little-known tips and tricks in this book are ideal for anyone who works with PDF on a regular basis, including web developers, pre-press users, forms creators, and those who generate PDF for distribution. Whether you want to fine-tune and debug your existing PDF documents or explore the full potential the format offers, PDF Hacks will turn you into a PDF power user.

About the Author
For over five years, Sid Steward has analyzed, extended, secured, cracked, authored, converted, embellished and consumed PDF. He maintained and created custom software for Thomson Financial’s Investext and then EBSCO. He then worked with SoftLock (d/b/a Digital Goods) to create their proprietary PDF security model and integrate it with their larger digital rights system. This project required pushing the envelope of Acrobat API programming. At the same time, he had been privately working on a semi-automated PDF to HTML conversion workflow. This toolset became the core of his PDF conversion service bureau: Boundless Books, Inc., d/b/a AccessPDF. He also performs PDF “finishing” which includes optimizing PDF file size and adding navigation features.

Paperback: 296 pages
Publisher: O’Reilly Media, Inc. (August 16, 2004)
Language: English
ISBN-10: 0596006551
ISBN-13: 978-0596006556
Product Dimensions: 8.7 x 6 x 0.7 inches

September 9, 2009 Posted by rubypdf | Books, Hacks | Acrobat, Adobe Acrobat, adobe pdf, Adobe Reader, ebook, fdf data, Free PDF Samples, Free Software, Indexing Service, java, joboptions files, Paperback, PDFTK | 1 Comment

Run J2SE and J2ME on Windows Mobile

Mysaifu JVM is a Java Virtual Machine which runs on Windows Mobile. It is a free software under the GPLv2 (GNU Public License Version 2).

The following operating systems are supported by latest version of this JVM.

Windows Mobile 6.0
Windows Mobile 5.0
Windows Mobile 2003 Second Edition software for Pocket PC (Pocket PC 2003 SE)
Windows Mobile 2003 software for Pocket PC (Pocket PC 2003)

for details, please visit, http://www2s.biglobe.ne.jp/~dat/java/project/jvm/index_en.html

phoneME for Windows CE, PocketPC and Windows Mobile is an implementation of the phoneME open source Java ME application platform for your Windows Mobile phone or handheld device. There are two different platforms of the phoneME virtual machine: phoneME Feature and phoneME Advanced.
Beyond precompiled binaries of these VMs for WinCE based operating systems (including PocketPC 2002, Windows Mobile 2003 and Windows Mobile 5), this website provides information, patches and instructions in order to compile the phoneME sources yourself.

phoneME Feature

phoneME Feature targets the low-end range of Windows Mobile devices and allows you to run CLDC and MIDP based applications, i.e. midlets.
CLDC builds more or less out of the box once you have correctly set up your build environment. The MIDP stack compiles too if you do not use the Javacall porting layer, but the PCSL and MIDP sources requires a bit more patching in order to build and run midlets. Currently, many optional JSRs have not been ported to the WinCE platform and are hence unsupported.
phoneME Advanced

phoneME Advanced targets Windows Mobile devices with more resources (memory, cpu, storage) and can run CDC based applications that are compatible with a subset of the J2SE 1.4 stack. Other profiles like Foundation Profile and Personal Profile are also supported.
The CDC and Foundation profiles require little patching. While the Personal Basis Profile does not compile at all, the Personal Profile works reasonably well. phoneME Advanced also provides a dual stack implementation with MIDP support. It is based on the Foundation Profile and runs basic midlets pretty good, but as with phoneME Feature, JSR support is rather limited.

More information about the phoneME project and these two types of virtual machines can be found at https://phoneme.dev.java.net.

September 3, 2009 Posted by rubypdf | Open Source | CLDC, GPLv2, J2ME, J2SE, MIDP, phoneME, PocketPC, Windows CE, Windows Mobile | 1 Comment

Acrobat 9.1 SDK Update Released

The Navigator SDK for PDF Portfolio Layouts is finally out of beta and is available for download. Samples, updated documentation and other resources are available in the Acrobat 9.1 SDK August 2009 update. You can get it by following the link below.

NOTE: If you have already created navigators for your PDF Portfolios, it is highly recommended that you replace the AcrobatAPI.swc file in your project, recompile and update your .NAV files. Be sure to increment the “version” attribute in your navigator.xml file to make applying your updated navigator to your existing PDF Portfolios.

Go to the Acrobat SDK Download Page

source: http://blogs.adobe.com/pdfdevjunkie/2009/08/acrobat_91_sdk_update_released.html

August 27, 2009 Posted by rubypdf | PDF News | Acrobat, Acrobat 9.1 SDK, Acrobat SDK, SDK | Leave a comment

	pdc1975 on Two Free RAR Password Cra…
	The Halfshot on Run J2SE and J2ME on Windows…
	Paulo Lima on Free Software to Convert XPS t…
	sanjeev kumar on 20 of the Best SEO Plugins for…
	rubypdf on Google Docs support OCR for PD…

PDF Hacks

PDF Hacks

PDFMiner-Python PDF parser and analyzer

Some PDF Tools developed in Python

How to Freely Convert PDF online

Some Adobe Acrobat Tutorials and Videos

Tutorials

Videos

A Easy and Free Way to download Books From Google

JPEG 2000 for Object Pascal-Delphi and FreePascal

Library Contents

Installation

Convert RSS feeds to printable PDF newspapers

PDF Hacks: 100 Industrial-Strength Tips & Tools

Run J2SE and J2ME on Windows Mobile

phoneME Feature

phoneME Advanced

Acrobat 9.1 SDK Update Released

About

Categories

Blogroll

Recent Posts

RubyPdf Technologies

Twitter Updates

RubyPDF Blog

Recent Comments

Blog Stats

Site info