Classification of Multilingual Mathematical Papers in DML-CZ Preliminary Excursion

Abstract. The growth of digital repositories of scientific documents is speed-ed up by various digitisation activities. Almost all papers of mathematical journals are reviewed by either Mathematical Reviews or ZentralBlatt Math, summing up to more than 2.000.000 entries. In the paper we discuss poss...

Full description

Bibliographic Details
Main Author: Petr Sojka
Other Authors: The Pennsylvania State University CiteSeerX Archives
Format: Text
Language:English
Subjects:
DML
Online Access:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.215.136
http://www.fi.muni.cz/usr/sojka/download/raslan2007/7.pdf
Description
Summary:Abstract. The growth of digital repositories of scientific documents is speed-ed up by various digitisation activities. Almost all papers of mathematical journals are reviewed by either Mathematical Reviews or ZentralBlatt Math, summing up to more than 2.000.000 entries. In the paper we discuss possibilities and experiments we did on the data of Czech Digital Mathematics Library, DML-CZ with the goal of developing novel scalable methods of document classification and retrieval of multilingual mathematical papers. 1 Motivation – Project of Digital Mathematics Library You always admire what you really don’t understand. (Blaise Pascal) Mathematicians from all over the world dream of World Digital Mathematics Library [1], where (almost) all of reviewed mathematical papers in all languages will be stored, indexed and searchable with the today’s leading edge information retrieval machinery. A good resources towards this goals–in addition to the publisher’s digital libraries–are twofold: 1. ‘local ’ repositories of digitised papers as NUMDAM [2] 1, DML-CZ [3] 2 or born-digital archives CEDRAM [4] 3), arXiv.org>math4 2. two review services for the mathematical community: both ZentrallBlatt Math5 and Mathematical Reviews6 have more than 2.000.000 entries (paper metadata and reviews) from more than 2300 mathematical serials and journals. Google Scholar7 is becoming useful in the meantime, but lacks specialised math search and metadata guessed from parsing crawled papers are of low quality (compared to the controlled repositories). Both review services agreed on the supported Mathematics Subject Classification (MSC) scheme8 1� � 4�����������5� � 6� � ����2� � � ������7������ � � ������ � 3� � ��� � 8� � ���� � � �, and currently used MSC 2000 is being revised for use in