E.: Content-based retrieval of historical Ottoman documents stored as textual images

Abstract—There is an accelerating demand to access the visual content of documents stored in historical and cultural archives. Availability of electronic imaging tools and effective image processing techniques makes it feasible to process the multimedia data in large databases. In this paper, a fram...

Full description

Bibliographic Details
Main Authors: Ali Kemal Sinop, Özgür Ulusoy, A. Enis Çetin, Senior Member
Other Authors: The Pennsylvania State University CiteSeerX Archives
Format: Text
Language:English
Published: 2004
Subjects:
Online Access:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.63.7713
http://www.cs.bilkent.edu.tr/~oulusoy/ieee_tip.pdf
id ftciteseerx:oai:CiteSeerX.psu:10.1.1.63.7713
record_format openpolar
spelling ftciteseerx:oai:CiteSeerX.psu:10.1.1.63.7713 2023-05-15T18:32:43+02:00 E.: Content-based retrieval of historical Ottoman documents stored as textual images Ali Kemal Sinop Özgür Ulusoy A. Enis Çetin Senior Member The Pennsylvania State University CiteSeerX Archives 2004 application/pdf http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.63.7713 http://www.cs.bilkent.edu.tr/~oulusoy/ieee_tip.pdf en eng http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.63.7713 http://www.cs.bilkent.edu.tr/~oulusoy/ieee_tip.pdf Metadata may be used without restrictions as long as the oai identifier remains attached to it. http://www.cs.bilkent.edu.tr/~oulusoy/ieee_tip.pdf text 2004 ftciteseerx 2016-01-08T15:26:07Z Abstract—There is an accelerating demand to access the visual content of documents stored in historical and cultural archives. Availability of electronic imaging tools and effective image processing techniques makes it feasible to process the multimedia data in large databases. In this paper, a framework for content-based retrieval of historical documents in the Ottoman Empire archives is presented. The documents are stored as textual images, which are compressed by constructing a library of symbols occurring in a document, and the symbols in the original image are then replaced with pointers into the codebook to obtain a compressed representation of the image. The features in wavelet and spatial domain based on angular and distance span of shapes are used to extract the symbols. In order to make content-based retrieval in historical archives, a query is specified as a rectangular region in an input image and the same symbol-extraction process is applied to the query region. The queries are processed on the codebook of documents and the query images are identified in the resulting documents using the pointers in textual images. The querying process does not require decompression of images. The new content-based retrieval framework is also applicable to many other document archives using different scripts. Index Terms—Angular and distance span, binary wavelet decomposition, content-based retrieval, historical document compression, partial symbol-wise matching. I. Text The Pointers Unknown
institution Open Polar
collection Unknown
op_collection_id ftciteseerx
language English
description Abstract—There is an accelerating demand to access the visual content of documents stored in historical and cultural archives. Availability of electronic imaging tools and effective image processing techniques makes it feasible to process the multimedia data in large databases. In this paper, a framework for content-based retrieval of historical documents in the Ottoman Empire archives is presented. The documents are stored as textual images, which are compressed by constructing a library of symbols occurring in a document, and the symbols in the original image are then replaced with pointers into the codebook to obtain a compressed representation of the image. The features in wavelet and spatial domain based on angular and distance span of shapes are used to extract the symbols. In order to make content-based retrieval in historical archives, a query is specified as a rectangular region in an input image and the same symbol-extraction process is applied to the query region. The queries are processed on the codebook of documents and the query images are identified in the resulting documents using the pointers in textual images. The querying process does not require decompression of images. The new content-based retrieval framework is also applicable to many other document archives using different scripts. Index Terms—Angular and distance span, binary wavelet decomposition, content-based retrieval, historical document compression, partial symbol-wise matching. I.
author2 The Pennsylvania State University CiteSeerX Archives
format Text
author Ali Kemal Sinop
Özgür Ulusoy
A. Enis Çetin
Senior Member
spellingShingle Ali Kemal Sinop
Özgür Ulusoy
A. Enis Çetin
Senior Member
E.: Content-based retrieval of historical Ottoman documents stored as textual images
author_facet Ali Kemal Sinop
Özgür Ulusoy
A. Enis Çetin
Senior Member
author_sort Ali Kemal Sinop
title E.: Content-based retrieval of historical Ottoman documents stored as textual images
title_short E.: Content-based retrieval of historical Ottoman documents stored as textual images
title_full E.: Content-based retrieval of historical Ottoman documents stored as textual images
title_fullStr E.: Content-based retrieval of historical Ottoman documents stored as textual images
title_full_unstemmed E.: Content-based retrieval of historical Ottoman documents stored as textual images
title_sort e.: content-based retrieval of historical ottoman documents stored as textual images
publishDate 2004
url http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.63.7713
http://www.cs.bilkent.edu.tr/~oulusoy/ieee_tip.pdf
genre The Pointers
genre_facet The Pointers
op_source http://www.cs.bilkent.edu.tr/~oulusoy/ieee_tip.pdf
op_relation http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.63.7713
http://www.cs.bilkent.edu.tr/~oulusoy/ieee_tip.pdf
op_rights Metadata may be used without restrictions as long as the oai identifier remains attached to it.
_version_ 1766216906724868096