Willus.com Home   |   Archive   |   About  

Willus.com's K2pdfopt Help Page

Return to K2pdfopt Home Page

MENU
Getting Started:
    1. Windows
  Text Menu
  (now with video!)
  2. Linux
  Ubuntu
  Env vars
  3. Mac OSX

FAQ | Forum

Customizing K2PDFOPT:
    1. K2pdfopt GUIs
    2. Disabling the Windows GUI
    3. The interactive menu
    4. List of command-line options
    5. Using a shortcut
  (now with video!)
  6. Using the K2PDFOPT environment variable
  7. Using the command line

Adjusting the output:
    1. Screen Size
    2. Increasing the magnification
    3. Landscape mode
    4. Output File Size
    5. Setting Margins
    6. Color Output
    7. Uneven Line Breaks/ Excess Margins

Processing Options:
    1. Showing Markings
    2. OCR
    3. Native PDF
  (now with video!)
    4. Auto-Straightening
    5. Ignoring Borders/ Headers/Footers
    6. Detecting Columns
    7. Column Order
    8. Right-to-Left Page Scanning
    9. Using Ghostscript
 
  OPTICAL CHARACTER RECOGNITION (OCR)
NOTE 1: In v1.63, k2pdfopt adds Unicode-16 support to OCR.
NOTE 2: In v1.51, the -wc command-line option has been replaced with -ocrvis.

As of v1.50, k2pdfopt can use one of two OCR engines to convert bitmapped text to native ASCII characters so that the text in the output file can be searched or copied and pasted into other applications. And in v1.63, bitmapped text from any language that Tesseract supports (including, for example, Chinese) is converted to Unicode-16 values and can be copied and pasted into Unicode-aware applications (e.g. most web browsers and modern word processing software). See the examples below.

UPDATE: With k2pdfopt v2.x, if the source PDF document has searchable or highlightable text, then either k2pdfopt output type (native PDF or the default re-flowed text mode) should also have searchable text without having to resort to time-consuming OCR. OCR should only be necessary if the source document is scanned and does not already have a text/OCR layer.


(k2pdfopt -ocr pooh.pdf)

OCR ENGINE CHOICE: TESSERACT VS. GOCR
OCR is not turned on by default. You must select it with the -ocr command-line option (or via "oc" in the interactive menu). You can choose from two different OCR engines to do the conversion to text. The default is Google's open-source Tesseract. It requires support files to be installed on your PC (see below). The other option is GOCR. GOCR requires no additional files and is faster than Tesseract by more than a factor of ten, but Tesseract is far more accurate and still reasonably fast (~25 words per second on a modern PC) and also supports multiple languages (GOCR only supports English / ASCII). Because of this, I decided to make Tesseract the default. See the examples below (the -ocrvis t option (new in v1.51) causes only the OCR'd text to show):

Tesseract 3.01
Conversion time: 15 s
k2pdfopt -ocr -ocrhmax 0.5 -ocrvis t pooh.pdf
       
GOCR 0.49
Conversion time: 3 s
k2pdfopt -ocr g -ocrhmax 0.5 -ocrvis t pooh.pdf


UNICODE-16 ALTERNATE LANGUAGE EXAMPLE (SIMPLIFIED CHINESE)
In k2pdfopt v1.63, any language Tesseract OCR supports can be converted to Unicode-16 characters. The example below shows the OCR results on simplified Chinese using Tesseract's simplified Chinese training data. Use the -ocrlang option to select your language. If no language is specified, the most recently dated training file in the Tesseract training folder is used. Note that if you use -ocrvis t with a language like Chinese, as an example, the text will not look right as displayed by the PDF file because k2pdfopt does not embed any Chinese fonts (or other non-standard fonts) into the PDF file. But if you copy and paste the text into a Unicode-16 compatible application, it will come out as Chinese characters.

(Source PDF file)
       
k2pdfopt -ocr t -ocrlang chi_sim -col 1 crouching_tiger.pdf
(Try copying and pasting the text from the PDF file.)


MULTIPLE LANGUAGES
You can specify multiple languages for OCR if you use Tesseract, e.g. English and Chinese using

   -ocrlang eng+chi_sim

(Be sure not to put any spaces in 'eng+chi_sim'.) In tests, I haven't gotten good results.

TESSERACT INSTALLATION
See the Tesseract Wiki for an explanation of the Tesseract project and how to install language training files.

NOTE! To use the Tesseract OCR engine built into k2pdfopt, you only have to install the Tesseract language training file for your language (see example below for English). You do not need to install the Tesseract engine! You can install multiple language files if you want to be able to OCR documents in different lanugages.

I am hoping to eventually offer an easier way of installing Tesseract's language training files, but for now you'll need to download the one(s) you want from the Tesseract download page:

You can choose the training data for the language your prefer, for example, the English language data (circled in image above). Unfortunately, it is in the form of a gzipped tar file (ends in .tar.gz or .tgz), which Windows cannot extract without some help, so you may need to install something like 7-zip to extract the files from the downloaded archive. As an example, let's assume you extract the files to c:\tesseract-ocr\tessdata (this is probably as good a place as any). Then you'll want to set the environment variable TESSDATA_PREFIX to c:\tesseract-ocr\ as follows (be sure to put the trailing slash at the end!):
(You can see how to set an enviroment variable here.)

If you have correctly set up Tesseract, you'll see the Tesseract banner when you run k2pdfopt with OCR turned on, and the selected language will also show (as of v1.63):


 

This page last modified
Saturday, 30-Nov-2013 12:58:25 MST