A bottom-up approach for XML document classification

Thesis (M.Sc.)--Memorial University of Newfoundland, 2009. Computer Science Includes bibliographical references (leaves 61-64) Extensible Markup Language (XML) is a simple and flexible text format derived from Standard Generalized Markup Language (SGML) [1]. It has been widely accepted as a crucial...

Full description

Bibliographic Details
Main Author:	Wu, Junwei.
Other Authors:	Memorial University of Newfoundland. Dept. of Computer Science
Format:	Thesis
Language:	English
Published:	2009
Subjects:	Data mining XML (Document markup language) > Classification Handle The Newfoundland studies University of Newfoundland
Online Access:	http://collections.mun.ca/cdm/ref/collection/theses4/id/41292

id	ftmemorialunivdc:oai:collections.mun.ca:theses4/41292
record_format	openpolar
spelling	ftmemorialunivdc:oai:collections.mun.ca:theses4/41292 2023-05-15T17:23:33+02:00 A bottom-up approach for XML document classification Wu, Junwei. Memorial University of Newfoundland. Dept. of Computer Science 2009 viii, 64 leaves : ill. Image/jpeg; Application/pdf http://collections.mun.ca/cdm/ref/collection/theses4/id/41292 Eng eng Electronic Theses and Dissertations (8.26 MB) -- http://collections.mun.ca/PDFs/theses/Wu_Junwei.pdf a3243873 http://collections.mun.ca/cdm/ref/collection/theses4/id/41292 The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission. Paper copy kept in the Centre for Newfoundland Studies, Memorial University Libraries Data mining XML (Document markup language)--Classification Text Electronic thesis or dissertation 2009 ftmemorialunivdc 2015-08-06T19:21:57Z Thesis (M.Sc.)--Memorial University of Newfoundland, 2009. Computer Science Includes bibliographical references (leaves 61-64) Extensible Markup Language (XML) is a simple and flexible text format derived from Standard Generalized Markup Language (SGML) [1]. It has been widely accepted as a crucial component of many information retrieval related applications, such as XML databases, web services, etc. One of the reasons for its wide acceptance is its customized format during data transmission or storage. Classification is an important data mining task that aims to assign unknown objects to classes that best characterize them. In this thesis, we propose a method to classify XML documents under the assumption that they do not have a common schema that may or may not be available, which is closer to the real cases. Our method is similarity-based. Its main characteristic is its way to handle the roles played by texts and the structural information. Unlike most existing methods, we use a bottom-up approach, i.e., we start from the text first, and then embed the structural information. This is based on the observation that in XML documents with diversified tag structures, the most informative information is carried by the terms in the texts. Our experiments show that this strategy can achieve a better performance than the existing methods for documents from sources that exhibit heterogeneous structures. Thesis Newfoundland studies University of Newfoundland Memorial University of Newfoundland: Digital Archives Initiative (DAI) Handle The ENVELOPE(161.983,161.983,-78.000,-78.000)
institution	Open Polar
collection	Memorial University of Newfoundland: Digital Archives Initiative (DAI)
op_collection_id	ftmemorialunivdc
language	English
topic	Data mining XML (Document markup language)--Classification
spellingShingle	Data mining XML (Document markup language)--Classification Wu, Junwei. A bottom-up approach for XML document classification
topic_facet	Data mining XML (Document markup language)--Classification
description	Thesis (M.Sc.)--Memorial University of Newfoundland, 2009. Computer Science Includes bibliographical references (leaves 61-64) Extensible Markup Language (XML) is a simple and flexible text format derived from Standard Generalized Markup Language (SGML) [1]. It has been widely accepted as a crucial component of many information retrieval related applications, such as XML databases, web services, etc. One of the reasons for its wide acceptance is its customized format during data transmission or storage. Classification is an important data mining task that aims to assign unknown objects to classes that best characterize them. In this thesis, we propose a method to classify XML documents under the assumption that they do not have a common schema that may or may not be available, which is closer to the real cases. Our method is similarity-based. Its main characteristic is its way to handle the roles played by texts and the structural information. Unlike most existing methods, we use a bottom-up approach, i.e., we start from the text first, and then embed the structural information. This is based on the observation that in XML documents with diversified tag structures, the most informative information is carried by the terms in the texts. Our experiments show that this strategy can achieve a better performance than the existing methods for documents from sources that exhibit heterogeneous structures.
author2	Memorial University of Newfoundland. Dept. of Computer Science
format	Thesis
author	Wu, Junwei.
author_facet	Wu, Junwei.
author_sort	Wu, Junwei.
title	A bottom-up approach for XML document classification
title_short	A bottom-up approach for XML document classification
title_full	A bottom-up approach for XML document classification
title_fullStr	A bottom-up approach for XML document classification
title_full_unstemmed	A bottom-up approach for XML document classification
title_sort	bottom-up approach for xml document classification
publishDate	2009
url	http://collections.mun.ca/cdm/ref/collection/theses4/id/41292
long_lat	ENVELOPE(161.983,161.983,-78.000,-78.000)
geographic	Handle The
geographic_facet	Handle The
genre	Newfoundland studies University of Newfoundland
genre_facet	Newfoundland studies University of Newfoundland
op_source	Paper copy kept in the Centre for Newfoundland Studies, Memorial University Libraries
op_relation	Electronic Theses and Dissertations (8.26 MB) -- http://collections.mun.ca/PDFs/theses/Wu_Junwei.pdf a3243873 http://collections.mun.ca/cdm/ref/collection/theses4/id/41292
op_rights	The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.
_version_	1766113233648746496

A bottom-up approach for XML document classification

Similar Items