【摘要】Purpose:Current segmentation systems almost invariably focus on linear segmentation and can only divide text into linear sequences of segments. This suits cohesive text such as news feed but not coherent texts such as documents of a digital library which have hierarchical structures. To overcome the focus on linear segmentation in document segmentation and to realize the purpose of hierarchical segmentation for a digital library’s structured resources, this paper aimed to propose a new multi-granularity hierarchical topic-based segmentation system (MHTSS) to decide section breaks.Design/methodology/approach:MHTSS adopts up-down segmentation strategy to divide a structured, digital library document into a document segmentation tree. Specifically, it works in a three-stage process, such as document parsing, coarse segmentation based on document access structures and fine-grained segmentation based on lexical cohesion.Findings:This paper analyzed limitations of document segmentation methods for the structured, digital library resources. Authors found that the combination of document access structures and lexical cohesion techniques should complement each other and allow for a better segmentation of structured, digital library resources. Based on this finding, this paper proposed the MHTSS for the structured, digital library resources. To evaluate it, MHTSS was compared to the TT and C99 algorithms on real-world digital library corpora. Through comparison, it was found that the MHTSS achieves top overall performance.Practical implications:With MHTSS, digital library users can get their relevant information directly in segments instead of receiving the whole document. This will improve retrieval performance as well as dramatically reduce information overload.Originality/value:This paper proposed MHTSS for the structured, digital library resources, which combines the document access structures and lexical cohesion techniques to decide section breaks. With this system, end-users can access a document by sections through a document structure tree.
【关键词】Hierarchical segmentation, Access structures, AIC, Digital library resources, Lexical cohesion, Optimum partitioning clustering, Structured segmentation
该文发表于《The Electronic Library》第35卷第1期