Heading-based sectional hierarchy identification for HTML documents


Creative Commons License

Pembe F. C., Güngör T.

22nd International Symposium on Computer and Information Sciences, ISCIS 2007, Ankara, Turkey, 7 - 09 November 2007, pp.75-80 identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1109/iscis.2007.4456839
  • City: Ankara
  • Country: Turkey
  • Page Numbers: pp.75-80
  • Istanbul Kültür University Affiliated: No

Abstract

Most of the documents found on the Web are prepared in HTML format which was basically designed for presentation of data. As a result, some limitations are encountered when these documents are accessed automatically for a semantic interpretation of their content. One such inadequacy is in representing the sectional hierarchy (i.e. sections and subsections) of these documents and the headings in this hierarchy. Automatically obtaining this information is a difficult task due to the underlying format and the cluttered structure encountered in most of the Web pages. In this paper, we propose a novel approach to extract heading-based sectional hierarchies of HTML documents. This is the first part of the research, where we aim to use this information in automatic summaries to improve Web search experience of Internet users. ©2007 IEEE.