PDF Hacks

PDF Hacks

using pdfsizeopt to Optimize & Reduce PDF File Size

pdfsizeopt is open source project hosting on Google Code, the main feature is PDF file size optimizer. it bases on the following tools,

  • pdfsizeopt.py
  • Python
  • Ghostscript
  • Java
  • sam2p
  • jbig2
  • png22pnm
  • pngtopnm
  • Multivalent.jar
  • PNGOUT

pdfsizeopt is a collection of best practices and scripts for Unix to optimize the size of PDF files, with focus on PDFs created from TeX and LaTeX documents. pdfsizeopt is developed on a Linux system, and it depends on existing tools such as Python 2.4, Ghostscript 8.50, jbig2enc (optional), sam2p, pngtopnm, pngout (optional), and the Multivalent PDF compressor (optional) written in Java.

for details, please visit pdfsizeopt-a Free and Open Source PDF Manipulation Tool to Reduce PDF File Size

references,

pdfsizeopt home page
Convert JBIG2 to PDF with free and open source software agl’s jbig2enc
Windows version JBIG2 Encoder-Jbig2.exe

 

October 30, 2009 Posted by | Hacks, Linux, Open Source, Software, Tutorials, Windows | Leave a Comment

PDFMiner-Python PDF parser and analyzer

PDFMiner is a suite of programs that help extracting and analyzing text data of PDF documents. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other extra information such as font information or ruled lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.

Features:

  • Written entirely in Python. (for version 2.4 or newer)
  • PDF-1.7 specification support. (well, almost)
  • Non-ASCII languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Basic encryption (RC4) support.
  • PDF to HTML conversion (with a sample converter web app).
  • Outline (TOC) extraction.
  • Tagged contents extraction.
  • Infer text running by using clustering technique.

Download and For details, please visit http://www.unixuser.org/~euske/python/pdfminer/index.html.

btw,

PDFMiner comes with two handy tools: pdf2txt.py and dumppdf.py.

pdf2txt.py extracts text contents from a PDF file. It extracts all the texts that are to be rendered programmatically, It cannot recognize texts drawn as images that would require optical character recognition. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion. You need to provide a password for protected PDF documents when its access is restricted. You cannot extract any text from a PDF document which does not have extraction permission.

For non-ASCII languages, you can specify the output encoding (such as UTF-8).

dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purpose, but it’s also possible to extract some meaningful contents (such as images).

October 19, 2009 Posted by | Linux, Open Source, Software, Windows | , , , , , , , , , | Leave a Comment

Some PDF Tools developed in Python

When Search WordPress.com, I noticed a article  PDF Tools, it introduces some small PDF tools and all developed in Python.

pdf-parser.py

This tool will parse a PDF document to identify the fundamental elements used in the analyzed file. It will not render a PDF document. The code of the parser is quick-and-dirty, I’m not recommending this as text book case for PDF parsers, but it gets the job done.

You can see the parser in action in this screencast.

The stats option display statistics of the objects found in the PDF document. Use this to identify PDF documents with unusual/unexpected objects, or to classify PDF documents. For example, I generated statistics for 2 malicious PDF files, and although they were very different in content and size, the statistics were identical, proving that they used the same attack vector and shared the same origin.

The search option searches for a string in indirect objects (not inside the stream of indirect objects). The search is not case-sensitive, and is susceptible to the obfuscation techniques I documented (as I’ve yet to encounter these obfuscation techniques in the wild, I decided no to resort to canonicalization).

filter option applies the filter(s) to the stream. For the moment, only FlateDecode is supported (e.g. zlib decompression).

The raw option makes pdf-parser output raw data (e.g. not the printable Python representation).

objects outputs the data of the indirect object which ID was specified. This ID is not version dependent. If more than one object have the same ID (disregarding the version), all these objects will be outputted.

reference allows you to select all objects referencing the specified indirect object. This ID is not version dependent.

type alows you to select all objects of a given type. The type is a Name and as such is case-sensitive and must start with a slash-character (/).

Download:

pdf-parser_V0_3_1.zip (https)

MD5: 07CDA54844CD6567473CBF2B0DFC601C

SHA256: 7614AEC453502EEF43F9EA04A82092C4ACDD32AB86D1C4D744B7B590C74152EC

make-pdf tools
make-pdf-javascript.py allows one to create a simple PDF document with embedded JavaScript that will execute upon opening of the PDF document. It’s essentially glue-code for the mPDF.py module which contains a class with methods to create headers, indirect objects, stream objects, trailers and XREFs.

20081109-134003

If you execute it without options, it will generate a PDF document with JavaScript to display a message box (calling app.alert).

To provide your own JavaScript, use option –javascript for a script on the command line, or –javascriptfile for a script contained in a file.

Download:

make-pdf_V0_1_1.zip (https)

MD5: 9AF2E343B78553021C989E8E22355531

SHA256: C604679ABEB0469C1463159E02E74F12487B2755A6096B416A8F4F638DEB8AA9

pdfid.py
This tool is not a PDF parser, but it will scan a file to look for certain PDF keywords, allowing you to identify PDF documents that contain (for example) JavaScript or execute an action when opened. PDFiD will also handle name obfuscation.

The idea is to use this tool first to triage PDF documents, and then analyze the suspicious ones with my pdf-parser.

An important design criterium for this program is simplicity. Parsing a PDF document completely requires a very complex program, and hence it is bound to contain many (security) bugs. To avoid the risk of getting exploited, I decided to keep this program very simple (it is even simpler than pdf-parser.py).

20090330-214223

PDFiD will scan a PDF document for a given list of strings and count the occurrences (total and obfuscated) of each word:

  • obj
  • endobj
  • stream
  • endstream
  • xref
  • trailer
  • startxref
  • /Page
  • /Encrypt
  • /ObjStm
  • /JS
  • /JavaScript
  • /AA
  • /OpenAction
  • /JBIG2Decode

Almost every PDF documents will contain the first 7 words (obj through startxref), and to a lesser extent stream and endstream. I’ve found a couple of PDF documents without xref or trailer, but these are rare (BTW, this is not an indication of a malicious PDF document).

/Page gives an indication of the number of pages in the PDF document. Most malicious PDF document have only one page.

/Encrypt indicates that the PDF document has DRM or needs a password to be read.

/ObjStm counts the number of object streams. An object stream is a stream object that can contain other objects, and can therefor be used to obfuscate objects (by using different filters).

/JS and /JavaScript indicate that the PDF document contains JavaScript. Almost all malicious PDF documents that I’ve found in the wild contain JavaScript (to exploit a JavaScript vulnerability and/or to execute a heap spray). Of course, you can also find JavaScript in PDF documents without malicious intend.

/AA and /OpenAction indicate an automatic action to be performed when the page/document is viewed. All malicious PDF documents with JavaScript I’ve seen in the wild had an automatic action to launch the JavaScript without user interaction.

The combination of automatic action  and JavaScript makes a PDF document very suspicious.

/JBIG2Decode indicates if the PDF document uses JBIG2 compression. This is not necessarily and indication of a malicious PDF document, but requires further investigation.

A number that appears between parentheses after the counter represents the number of obfuscated occurrences. For example, /JBIG2Decode 1(1) tells you that the PDF document contains the name /JBIG2Decode and that it was obfuscated (using hexcodes, e.g. /JBIG#32Decode).

BTW, all the counters can be skewed if the PDF document is saved with incremental updates.

Because PDFiD is just a string scanner (supporting name obfuscation), it will also generate false positives. For example, a simple text file starting with %PDF-1.1 and containing words from the list will also be identified as a PDF document.

Download:

pdfid_v0_0_9.zip (https)

MD5: 1C731D6204C09AAFF219876A8FB5E834

SHA256: 24A9B16E67A84E85488A16879CB611128B2E5921044E48EFB60D784BD785CBD0

October 19, 2009 Posted by | Linux, Open Source, Software, Tutorials, Windows | , , , | Leave a Comment

20 of the Best SEO Plugins for WordPress

Do you want to rank better in Google, and if you have a WordPress blog, how do you do? Maybe is a better way to install some good SEO Plugins, but there are ton of plugins talk about SEO, how to choose?

With more than 120 million blogs in existence, how do people find YOUR content on the Internet? The key starts with great search engine optimization (SEO), which is an art and a science that helps search engines discover your content and understand how relevant it is to specific search queries.

You can blog your heart out, but if you don’t have good SEO, then odds are you won’t have many readers.  Luckily, the WordPress (WordPress) plugin community values SEO and has developed a number of plugins to help. Here are 20 of the best SEO plugins to help you choose the right tags, tell search robots what to work on, optimize your post titles and more.

Have another SEO plugin to recommend? Tell us more about it in the comments.

Nofollow Case by Case – This plugin allows you to strip the “nofollow” command from your comments, and then you can apply it to only the comments you don’t wish to support.

Platinum SEO Plugin – The Platinum SEO Plugin offers you such features as automatic 301 redirects for permalink changes, auto-generation of META tags, post slug optimization, help in avoiding duplicate content and a host of other features.

Redirection – For any number of reasons you sometimes need to move a page from one spot on your blog to another, but then you risk losing that page’s status in search results.  Redirection helps you with your 301 redirects, captures a log of 404s so you can work on correcting them, sets up an RSS feed for errors and more.

SEO Blogroll – Do you worry that the people you link to in your blogroll are feeding off of your PageRank?  With SEO Blogroll you can make separate sections for various groupings of links, with an unlimited number in each, and all of them will receive the “nofollow” attribute.

SEO for Paged Comments – With the introduction of paged comments in WordPress 2.7, there was a potential problem with search engines thinking you had duplicate content as the post would appear on each page.  This plugin aims to take care of this issue for you until the folks at WordPress change things up.

SEO friendly and HTML valid subheadings – Some themes for WordPress will confuse your sub-header tags based on the page they are to be displayed on, but this plugin will automatically reset them to make them more SEO friendly by moving them down one spot in the hierarchical tree.  In other words, h2 becomes h3, h3 becomes h4 and so on.

SEO Friendly Images – Images can be a great source of traffic as people search for images of various subjects, and this plugin helps you with making sure that you have “alt” and “title” tags on all of your images so that the search engines can properly index them.

SEO No Duplicate WordPress Plugin – If you must have duplicate content on your site for whatever reason, SEO No Duplicate will allow you to state which version of the post search engines should index while ignoring the others.

SEO Post Link – The post slug is the blog title you see in a browser’s URL bar, and if it’s too long, search engines won’t take a liking to it.  SEO Post Link comes with an already populated list of words to cut from a title when it turns into a URL to make your post addresses that much friendlier.  You can set it so that it’s limited to a certain number of characters, cut short words, cut unnecessary words and more.

SEO Smart Links – Interlinking your blog can be the key to getting more people to read more of your posts, but it is time consuming and tedious to do it by hand.  SEO Smart Links does this for you automatically when you tell it what words to link to what URLs, and it also allows you to set “nofollow” and “open in window” comands for the links.

SEO Tag Cloud Widget – Love ‘em or hate ‘em, a lot of people use tag clouds on their blogs.  Since their inception they have been fairly unreadable by search engines, but with this plugin they will be converted to an SEO-friendly HTML markup that can be indexed.

SEO Title Tag – Your tags are an important part of your site for making sure that search engines know where to place your posts, and SEO Title Tag focuses exclusively on this.  Unlike some other plugins, and WordPress itself, this extension will allow you to add tags to your pages, your main page and even any URL anywhere on your site.

Simple Tags – An extremely popular plugin that focuses on helping you choose the best tags for your posts by offering suggestions, auto-completion of tags as you type, an AJAX admin interface, mass tag editing and a whole lot more.

Sitemap Generator – This is a more customizable sitemap generator than most with options to support multi-level categories and pages, category/page exclusion, permalink support, choices on what to display, options to show number of comments and more.

TGFI.net SEO WordPress Plugin – This particular plugin will do most of the usual SEO work of optimizing titles and keywords, but it adds a unique twist as it is mainly directed at people who use WordPress as a CMS.

All in One SEO Pack – One of the most popular plugins ever for WordPress, this plugin does a bit of everything for you from helping choose the best post title and keywords, to helping you avoid duplicate content and more.

Automatic SEO Links – Automatic SEO Links allows you to choose a word or phrase for automatic linking, both internal and external, set anchor text, choose if it should be “nofollow” or not, and more.  One of the best features of this plugin is that it will only do this for the first occurrence of a word in a post so you don’t have to worry about spamming your post with numerous links to the same thing.

Google XML Sitemaps – An essential tool in any blogger’s armory of SEO tools.  While the name only mentions “Google (Google),” this plugin creates an XML-sitemap that can be read by Ask, MSN and Yahoo also.

HeadSpace2 – This plugin allows you to install all sorts of meta-data, add specific JavaScript and CSS to pages, suggests tags for your posts and a whole lot more.

Meta Robots WordPress plugin – An easy solution for adding robot metadata to any page you choose on your blog.  You can use it to make your front page links into “nofollows,” prevent indexing of search pages, disable author and date-based archives, prevent indexing of your login page and numerous other features.

source,http://mashable.com/2009/03/20/wordpress-seo-plugins/

August 19, 2009 Posted by | Linux, WordPress | , , , | 2 Comments

Easy Way to Extract RPM with P7zip Under Linux

Red Hat Package Manager, abbreviated RPM, RPM is some sort of cpio archive.

P7ZIP is a port of 7za.exe for POSIX systems like Unix (Linux, Solaris, OpenBSD, FreeBSD, Cygwin, AIX, …), MacOS X and BeOS,it supports many formats:

  • Packing / unpacking: 7z, ZIP, GZIP, BZIP2 and TAR
  • Unpacking only: RAR, CAB, ISO, ARJ, LZH, CHM, MSI, WIM, Z, CPIO, RPM, DEB and NSIS

so if you want to extract a RPM file, such as myrpm.rpm,you can do in this way

7z myrpm.rpm
7z myrpm.cpio

btw, under windows, you can use 7-zip to do the same job.

reference,

How to Unzip or Extract RPM under Linux

PZIP

7-ZIP

August 19, 2009 Posted by | Hacks, Linux, Open Source, Software, Tutorials | , , , , , , | Leave a Comment

Using Ruby Java Bindings on Dreamhost and Fill PDF Form with iText

I am familiar with iText , but not familiar with Ruby, I know Dreamhost supports Ruby on Rails(ROR), but never have a chance to run a real application, though I have a Dreamhost space.

Getting rjb, also known as “Ruby Java Bindings’ to work in a Dreamhost account can be somewhat problematic. Fortunately, it is also fairly straightforward. You just have to install all the dependencies in the user directory.

In my case, I was setting up a Rails Application that used the iText Java library to fill in PDF documents for user download. Of course, the server environment was not using Sun Java and the Java headers were not present, so gem install rjb failed. Joy.

The first course of action was to do a local install of Java.

Download jdk-6u7-linux-x64.bin and jre-6u7-linux-x64.bin from the Sun Java site. Then create an ~/opt directory and extract the JRE and JDK (you will need to chmod u+x both files then call them from the command line). Move the resulting folders to ~/opt . I renamed the folders to jdk and jre for simplicity.

Now you ensure that user gems are enabled in cPanel. Then add the following 3 lines to ~/.bash_profile

export JAVA_HOME=/home/username/opt/jdk
export GEM_PATH=/home/username/ruby/gems
export GEM_HOME=/home/username/ruby/gems

Run source ~/.bash_profile to load the paths.

Now you can run gem install rjb without any problems. You will likely have to re-install Rails and other gems because we will be telling our Rails app to load gems from the user directory. Just use the regular gem install syntax.

Add the following 2 lines to your config/environment.rb at the top

ENV['GEM_PATH']=’/home/username/ruby/gems’
ENV['JAVA_HOME']=’/home/username/opt/jdk’

That’s pretty much it.

I will concede that these probably aren’t the best instructions, but this is the real meat of the solution. If you keep getting “no such file to load” errors, you will need to extract the gems to the vendor/plugins directory. cd RAILS_ROOT/vendor/plugins and gem unpack gem_name for each problematic gem. I believe this is an issue with Passenger.

Please correct me if I am wrong about any of this, it was a very long day!

source: http://blog.patrick-morgan.net/2008/10/using-ruby-java-bindings-on-dreamhost.html

August 18, 2009 Posted by | Hacks, Linux, Tutorials | , , , , , , , , , | 1 Comment

Is PDF real Secure?

ElcomSoft said,

Here is what I was able to easily achieve, in mere seconds, on a regular PC.

  1. Remove the Master Password and all the restrictions** it controls.
  2. Remove the User Password*** (File Open), 40 and 128-bit RC4 encryption.
  3. Remove DRM security from a PDF eBook that was locked**** to my system, and revert this PDF eBook to a regular PDF file that can be viewed and edited in Acrobat.

PDFCrackRubyPDF supplies free software to get the PDF User Password, support 40 and 128-bit RC4 encryption, and 128 AES encryption, it is open source, and supports multiply plantform, the windows version is compiled and suppiled by RubyPDF.

PdfCryptRubyPDF also supplies free software to remove the PDF Master Password with need it, and the PDF User Password with the PDFCrack helps.

If you lost orgianl uncrypted PDF,  but want to remove Certificate Encrypt PDF, you can also ask RubyPDF for help.

August 10, 2009 Posted by | Linux, PDF Security, Software, Windows | , , , | 2 Comments

Recovery PDF Password with Free software PdfCrack

PDFCrack is a GNU/Linux (other POSIX-compatible systems should work too) tool for recovering passwords and content from PDF-files. It is small, command line driven without external dependencies. The application is Open Source (GPL).

Features

  • Supports the standard security handler (revision 2, 3 and 4) on all known PDF-versions
  • Supports cracking both owner and userpasswords
  • Both wordlists and bruteforcing the password is supported
  • Simple permutations (currently only trying first character as Upper Case)
  • Save/Load a running job
  • Simple benchmarking
  • Optimised search for owner-password when user-password is known

Here are some person talk about pdfcrack,

  1. PDFCrack | RubyPdf Technologies

  2. PDFCrack 0.11 for Window Releases | RubyPDF Blog

  3. Free PDF Password Recovery for Windows

July 21, 2009 Posted by | Linux, Open Source, PDF Security, Software, Windows | , , , , , | 4 Comments

   

Follow

Get every new post delivered to your Inbox.