DiffPDF

Basic Usage

Click the File #1 button to choose one PDF file and then the File #2 button to choose another (ideally very similar) PDF file, then click the Compare button to perform the comparison, and when that's finished, navigate through the pairs of differing pages using the View combobox or using the Previous and Next buttons. Alternatively, drag two files—either separately or together—and drop them onto DiffPDF's view panels, then click the Compare button.

The Compare Button

When the Compare button is pressed, DiffPDF does a high-speed scan of every pair of pages (~100 pairs of pages per second on the author's machine). To make the scan as fast as possible DiffPDF does a very rough check of each pair of pages—so it is possible that it identifies some false positives (i.e., page pairs that are really the same). False positives are quite rare. (There are no false negatives—differences are never missed.)

Words Comparison Mode

The default comparison mode is Words which does a smart text comparison word by word for each pair of pages. This mode is fairly liberal regarding whitespace and tries to ignore layout changes (within a page) insofar as possible. It also treats all hyphens (soft-hyphen, minus sign, etc.), the same, that is, as a plain hyphen. This mode is best for alphabetic languages like English.

Characters Comparison Mode

The Characters comparison mode does a smart text comparison character by character for each pair of pages. This mode is liberal regarding whitespace at the ends of lines and tries to ignore layout changes (within a page) insofar as possible. It also treats all hyphens (soft-hyphen, minus sign, etc.), the same, that is, as a plain hyphen. This mode is best for logographic languages like Chinese and Japanese.

Appearance Comparison Mode

The Appearance comparison mode can be used to detect changes in fonts, diagrams, or any other visual aspects. This mode is absolutely strict and compares each pair of pages pixel for pixel. By default this mode shows differences using highlighting just like the Words and Characters modes do. However, it is also possible to compare using composition modes which can be useful to detect very small and subtle differences that aren't immediately apparent.

Zoning

Zoning is an experimental feature designed to produce more accurate results (i.e., fewer false positives). Its main use is for pages that have tables or that mix alphabetic and logographic text, since these can cause the underlying popplar PDF library to provide the page's words mixed up. Warning: using zoning for large complex pages (bigger than A4, multiple columns, tables) in Characters mode can be very slow. (The current focus for the zoning code is functionality not efficiency.) Furthermore, in some cases zoning can cause an increase in false positives—this can occur because the zoning code reorders the text that is fed to the sequence matcher and sometimes the reordering is wrong. Getting this right is non-trivial; changing the tolerances may help.

The Tolerance/R value is the maximum distance between text (i.e., word) rectangles for the rectangles to be placed in the same zone. Lower values create more zones; higher values create fewer zones. More zones are expensive to compute but can produce more accurate results; fewer zones may reduce false positives. The Tolerance/Y value is is used for rounding y coordinates to the nearest multiple of this value. For example, if Tolerance/Y is 5 and a word at position (452,137) is followed by a superscript at (468,140), both will be treated as having a y coordinate of 140.

Page Ranges

By default DiffPDF compares every pair of pages in the two PDFs (or as many pairs of pages as the number of pages in the shorter PDF). It is also possible to compare particular pages or page ranges. For example, if there are two versions of a PDF file, one with pages 1-12 and the other with pages 1-13 because of an extra page having been added as page 4, they can be compared by specifying two page ranges, 1-12 for the first and 1-3, 5-13 for the second. This will make DiffPDF compare pages in the pairs (1, 1), (2, 2), (3, 3), (4, 5), (5, 6), and so on, to (12, 13).

Margins

It is possible to make DiffPDF ignore any text that is above a specified top margin, below a specified bottom margin, left of a specified left margin, and right of a specified right margin. One or more of these margins can be specified by, first, checking the Exclude Margins checkbox, and second by setting any of the margins. Margins can be set by clicking on a page view or by using the margin spinboxes.

Saving

Use the Save As button to pop up a Save dialog. This dialog lets you save a .pdf file with the highlighted changes, or individual image files (e.g., in .png or various other common image formats). The dialog supports saving the current or all left pages, right pages, or both pages.

The Options Dialog

This dialog is invoked by clicking the Options button. The dialog supports changing the highlighting color, whether to use a pen or fill or both, and the fill's opacity. The Square Size is used when doing Appearance mode comparisons: the smaller the size the more fine-grained the highlighting is—and the slower to compute. The Rule width determines the thickness of the margin rules which are used to indicate the vertical position of differences; the rules can be switched off using a Rule width of 0.

Dock Windows

The Controls, Actions, Margins, Zoning, and Log views are in dock widgets—these can be dragged into other dock areas (in which case they will reshape themselves as necessary), or dragged to float free. The Margins, Zoning, and Log views can also be closed; right click a dock area splitter and check their checkbox to open them again. These views may be shown tabbed: if there is enough space they can be dragged out of their tabs and all shown in full.

Command Line Usage

Although DiffPDF is a GUI program, if run from a console with two PDF files listed on the command line, DiffPDF will start up and immediately compare them in Words mode, or in Appearance mode if their names are preceded with -a or --appearance on the command line, or in Characters mode if their names are preceded with -c or --character on the command line. Run DiffPDF with --help to see all the command line options. (This won't work on Windows, although the other command line options will.) Here is the --help output:

usage: diffpdf [options] [file1.pdf [file2.pdf]]

A GUI program that compares two PDF files and shows
their differences.

The files are optional and are normally set through
the user interface.

options:
--help             show this usage text and terminate (run the
                   program without this option and press F1 for
                   online help)
--appearance  -a   set the initial comparison mode to Appearance
--characters  -c   set the initial comparison mode to Characters
--words       -w   set the initial comparison mode to Words
--language=xx      set the program to use the given translation
                   language, e.g., en for English, cz for Czech;
                   English will be used if there is no translation
                   available
--debug=2          write the text fed to the sequence matcher into
                   temporary files (e.g., /tmp/page1.txt etc.)
--debug=3          as --debug=2 but also includes coordinates in
                    y, x order

The text reordering is done by the TextItems::columnZoneYxOrder() method in the textitem.cpp file: suggestions for improvement are welcome! (Note that when using --debug3 coordinates are output in y, x order.)

If you're specifically looking for a command line PDF comparison tool, e.g., for automated testing, try comparepdf.