FAQ | Forum
Adjusting the output:
OPTICAL CHARACTER RECOGNITION (OCR)
NOTE 1: In v1.63, k2pdfopt adds Unicode-16 support to OCR.
NOTE 2: In v1.51, the -wc command-line option has been replaced with -ocrvis.
As of v1.50, k2pdfopt can use one of two OCR engines to convert bitmapped text to
native ASCII characters so that the text in the output file can be searched
or copied and pasted into other applications. And in v1.63, bitmapped text
from any language that Tesseract supports (including, for example, Chinese) is converted
to Unicode-16 values and can be copied and pasted into Unicode-aware applications
(e.g. most web browsers and modern word processing software).
See the examples below.
UPDATE: With k2pdfopt v2.x, if the source PDF document has
searchable or highlightable text (e.g. if it is computer-generated or scanned but has
an OCR layer), then k2pdfopt output of either type (native PDF or the default
re-flowed text mode) should also have searchable text without having to resort
to time-consuming OCR. OCR should only be necessary if the source document is
scanned and does not already have a text/OCR layer.
(k2pdfopt -ocr pooh.pdf)
OCR ENGINE CHOICE: TESSERACT VS. GOCR
OCR is not turned on by default. You must select it with the -ocr command-line option
(or via "oc" in the interactive menu).
You can choose from two different OCR engines to do the conversion to text. The
default is Google's open-source Tesseract. It requires support files to be installed on your PC
(see below). The other option is GOCR.
GOCR requires no additional files and is faster than Tesseract
by more than a factor of ten, but Tesseract is
far more accurate and still reasonably fast (~25 words per second on a modern PC) and
also supports multiple languages (GOCR only supports English / ASCII).
Because of this, I decided to make Tesseract the default.
See the examples below (the -ocrvis t option (new in v1.51) causes only the OCR'd text to show):
Conversion time: 15 s
k2pdfopt -ocr -ocrhmax 0.5 -ocrvis t pooh.pdf
Conversion time: 3 s
k2pdfopt -ocr g -ocrhmax 0.5 -ocrvis t pooh.pdf
UNICODE-16 ALTERNATE LANGUAGE EXAMPLE (SIMPLIFIED CHINESE)
In k2pdfopt v1.63, any language Tesseract OCR supports can be converted to Unicode-16
characters. The example below shows the OCR results on simplified Chinese using
Tesseract's simplified Chinese training data. Use the
-ocrlang option to select your language. If no language is specified, the
most recently dated training file in the Tesseract training folder is used. Note
that if you use -ocrvis t with a language like Chinese, as an example,
the text will not look right as displayed by the PDF file because k2pdfopt does
not embed any Chinese fonts (or other non-standard fonts) into the PDF file.
But if you copy and paste the text into a Unicode-16 compatible application, it will
come out as Chinese characters.
(Source PDF file)
k2pdfopt -ocr t -ocrlang chi_sim -col 1 crouching_tiger.pdf
(Try copying and pasting the text from the PDF file.)
You can specify multiple languages for OCR if you use Tesseract,
e.g. English and Chinese using
(Be sure not to put any spaces in 'eng+chi_sim'.) In tests, I haven't gotten good results.
See the Tesseract Wiki for an explanation of the Tesseract project and how to install language training files.
To use the Tesseract OCR engine built into k2pdfopt, you only have to install the Tesseract language training file for your language (see example below for English). You do not need to install the Tesseract engine! You can install multiple language files if you want to be able to OCR documents in different lanugages.
I am hoping to eventually offer an easier way of installing Tesseract's language training files,
but for now you'll need to download the one(s) you want from the
Tesseract download page:
You can choose the training data for the language your prefer, for example, the English language data (circled in image above). Unfortunately, it is in the form
of a gzipped tar file (ends in .tar.gz or .tgz),
which Windows cannot extract without some help, so you may
need to install something like 7-zip to extract the files
from the downloaded archive.
As an example, let's assume you extract the files to c:\tesseract-ocr\tessdata (this is probably as good a place as any). Then
you'll want to set the environment variable TESSDATA_PREFIX to c:\tesseract-ocr\ as follows
(be sure to put the trailing slash at the end!):
(You can see how to set an enviroment variable here.)
If you have correctly set up Tesseract, you'll see the Tesseract banner when you run k2pdfopt
with OCR turned on, and the selected language will also show (as of v1.63):