the future of libraries in the digital age
Your Name and Title:
Janusz S. Bień, professor
Library, School, or Organization Name:
University of Warsaw, Formal Linguistics Department
Co-Presenter Name(s):
To be decided later (possibly Jakub Wilk and/or Michał Rudolf)
Area of the World from Which You Will Present:
Warsaw, Poland
Language in Which You Will Present:
English
Target Audience(s):
Digital library staff
Short Session Description (one line):
Scanned publications in digital libraries: new Open Source DjVu tools
Full Session Description (as long as you would like):
The DjVu technology is described by its authors as "an image
compression technique, a document format, and a software platform for
delivering documents images over the Internet"; according to the
recent statistics, about 80% of documents stored in Polish digital
libraries is in this format. Besides the commercial software
supporting this technology there is also the DjVuLibre suit of Open
Source tools and utilities, developed by the technology creators.
In the presentations another Open Source suit of programs will be
discussed. It consist of two sets.
The first set contains some programs for creation and improvement of
DjVu documents including the results of Optical Character
Recogniton. A typical OCR program outputs the results as a PDF
"sandwich" document containg text under image (although since version
11 ABBY FineReader can save directly the output as a DjVu files, the
output in the PDF form contains more information). The pdf2djvu
program conceived by Jakub Wilk (http://jwilk.net/software/pdf2djvu)
convert the PDF files into DjVu preserving all the features
(e.g. outlines) which are representable in the latter format.
The purpose of another program, also conceived by Jakub Wilk
(http://jwilk.net/software/didjvu), is the conversion of graphic files
into the DjVu documents consisting of foreground (the printed text),
mask and background layers (e.g. illustrations). Such separation not
only allows to achieve a high compression ratio, but also improves the
quality of OCR results which should operate only on foreground or
mask.
The third program named, for the historical reasons, ocrodjvu
(http://jwilk.net/software/ocrodjvu) is a wrapper for several Open
Source OCR programs including Tesseract, which achieves quality
comparable with commercial systems
(cf. e.g. a test results).
The second set of programs concerns the delivery of DjVu documents to
the users. It consist of a search engine server and two kind of
clients: marasca installable as a WWW site and djview4poliqarp, a
standalone client installable on a user computer. As the server is
based on the Poliqarp corpus tool, the whole set is called just
Poliqarp for DjVu. The author of djview4poliqarp is Michał Rudolf, the
rest of the system was created by Jakub Wilk.
The tools has been developed in the framework of the project directed
by the present author, the results are available on the principle of
GNU General Public License.
Websites / URLs Associated with Your Session:
http://djvu.org/
http://bc.klf.uw.edu.pl/177/
http://bc.klf.uw.edu.pl/173/
http://poliqarp.wbl.klf.uw.edu.pl
http://jwilk.net/software/
https://bitbucket.org/mrudolf/djview-poliqarp
https://bitbucket.org/jsbien/ndt/wiki/wyniki
Tags: 2.012ContentCreation
Permalink Reply by Janusz S. Bień on October 3, 2012 at 12:24pm The slides for my presentation are already available in our digital library: http://bc.klf.uw.edu.pl/298/.
Regards
Janusz
INFORMATION
PRESENTING
SPONSORS AND PARTNERS
GLOBAL ADVISORY AND OUTREACH BOARD
VOLUNTEERING
TRAINING
PROMOTION
© 2013 Created by Steve Hargadon.
Powered by