Marinai S., Fujisawa H. (eds.) Machine Learning in Document Analysis and Recognition

Подождите немного. Документ загружается.

Machine Learning for Digital Document

Processing: from Layout Analysis to Metadata

Extraction

Floriana Esposito, Stefano Ferilli, Teresa M.A. Basile, and Nicola Di Mauro

Universit`a degli Studi di Bari

Dipartimento di Informatica

Via Orabona, 4

70126 Bari - Italy

{esposito,ferilli,basile,ndm}@di.uniba.it

Summary. In the last years, the spread of computers and the Internet caused a

signiﬁcant amount of documents to be available in digital format. Collecting them

in digital repositories raised problems that go beyond simple acquisition issues, and

cause the need to organize and classify them in order to improve the eﬀectiveness and

eﬃciency of the retrieval procedure. The success of such a process is tightly related

to the ability of understanding the semantics of the document components and

content. Since the obvious solution of manually creating and maintaining an updated

index is clearly infeasible, due to the huge amount of data under consideration,

there is a strong interest in methods that can provide solutions for automatically

acquiring such a knowledge. This work presents a framework that intensively exploits

intelligent techniques to support diﬀerent tasks of automatic document processing

from acquisition to indexing, from categorization to storing and retrieval.

The prototypical version of the system DOMINUS is presented, whose main char-

acteristic is the use of a Machine Learning Server, a suite of diﬀerent inductive

learning methods and systems, among which the more suitable for each speciﬁc doc-

ument processing phase is chosen and applied. The core system is the incremental

ﬁrst-order logic learner INTHELEX. Thanks to incrementality, it can continuously

update and reﬁne the learned theories, dynamically extending its knowledge to han-

dle even completely new classes of documents.

Since DOMINUS is general and ﬂexible, it can be embedded as a document

management engine into many diﬀerent Digital Library systems. Experiments in a

real-world domain scenario, scientiﬁc conference management, conﬁrmed the good

performance of the proposed prototype.

1 Introduction & Motivations

In the World Wide Web era, a huge amount of documents in digital format are

spread throughout the most diverse Web sites, and a speciﬁc research area,

F. Esposito et al.: Machine Learning for Digital Document Processing: from Layout Analysis

to Metadata Extraction, Studies in Computational Intelligence (SCI) 90, 105–138 (2008)

www.springerlink.com

 Springer-Verlag Berlin Heidelberg 2008

106 F. Esposito et al.

focused on principles and techniques for setting up and managing document

collections in a digital form, quickly expanded. Usually, these large repositories

of digital documents are deﬁned as Digital Libraries, intended as distributed

collections of textual and/or multimedia documents, whose main goal is the

acquisition and the organization of the information contained therein.

During the past years a considerable eﬀort was spent in the development

of intelligent techniques in order to automatically transform paper documents

into digital format, saving the original layout with the aim of reconstruction.

Machine Learning techniques have been applied to attain this goal, and a suc-

cessful application in preserving cultural heritage material is reported in [1].

Today, most documents are generated, stored and exchanged in a digital

format, although it is still necessary to maintain the typing convention of

classical paper documents. The speciﬁc problem we will deal with consists in

the application of intelligent techniques to a system for managing a collection

of digital documents on the Internet; such a system, aimed at automatically

extracting signiﬁcant information from the documents, is useful to properly

store, retrieve and manage them in a Semantic Web perspective [2]. Indeed,

organizing the documents on the grounds of the knowledge they contain is

fundamental for being able to correctly access them according to the user’s

particular needs. For instance, in the scientiﬁc papers domain, in order to

identify the subject of a paper and its scientiﬁc context, an important role

is played by the information available in components such as Title, Authors,

Abstract and Bibliographic references. This last component in particular, with

respect to others, is a source of problems both because it is placed at the end

of the paper, and because it is, in turn, made up of diﬀerent sub-components

containing various kinds of information, to be handled and exploited in dif-

ferent ways.

At the moment we are not aware of techniques able to automatically an-

notate the layout components of digital documents, without reference to a

speciﬁc template. We argue that a process is still necessary to identify the

signiﬁcant components of a digital document through three typical phases:

Layout Analysis, Document Image Classiﬁcation and Document Image Un-

derstanding. As widely known, Layout Analysis consists in the perceptual or-

ganization process that aims at identifying the single blocks of a document and

at detecting relations among them (Layout Structure); then, associating the

proper logical role to each component yields the Document Logical Structure.

Since the logical structure is diﬀerent according to the kind of document, two

steps are in charge of identifying such a structure: Document Image Classiﬁca-

tion, aiming at the categorization of the document (e.g., newspaper, scientiﬁc

paper, email, technical report, call for papers) and Document Image Under-

standing, aiming at the identiﬁcation of the signiﬁcant layout components for

that class. Once the class as been deﬁned it is possible to associate to each

component a tag that expresses its role (e.g., signature, object, title, author,

abstract, footnote, etc.). We propose to apply multistrategy Machine Learn-

ing techniques along these phases of document processing where the classical

Machine Learning for Digital Document Processing 107

statistical and numerical approaches to classiﬁcation and learning may fail,

being not able to deal with the lack of a strict layout regularity in the variety

of documents available online.

The problem of Document Image Processing requires a ﬁrst-order language

representation for two reasons. First, classical attribute-value languages de-

scribe a document by means of a ﬁxed set of features, each of which takes

a value from a corresponding pre-speciﬁed value set; the exploitation of this

language in this domain represents a limitation since one cannot know apriori

how many components make up a generic document. Second, in an attribute-

value formalism it is not possible to represent and eﬃciently handle the re-

lationships among components; the information coming from the topological

structure of all components in a document turns out to be very useful in doc-

ument understanding. For instance, in a scientiﬁc paper, it is useful to know

that the acknowledgments usually appear above the references section and

in the end of the document, or that the aﬃliation of the authors is reported

generally at the beginning of the document, below or on the right of their

names.

The continuous ﬂow of new and diﬀerent documents in a Web repository

or in Digital Libraries calls for incremental abilities of the system, that must

be able to update or revise a faulty knowledge previously acquired for iden-

tifying the logical structure of a document. Traditionally, Machine Learning

methods that automatically acquire knowledge in developing intelligent sys-

tems, require to be provided with a set of training examples, belonging to a

deﬁned number of classes, and exploit them altogether in a batch way.

Although sometimes the term incremental is used to deﬁne some learn-

ing method [3, 4, 5, 6], incrementality generally refers to the possibility of

adjusting some parameters in the model on the grounds of new observations

that become available when the system is already operational. Thus, classical

approaches require that the number of classes is deﬁned and ﬁxed since the

beginning of the induction step: this prevents the opportunity of dealing with

totally new instances, belonging to new classes, that require the ability to

incrementally revise a domain theory as soon as new data are encountered.

Indeed, Digital Libraries require autonomous or semi-autonomous operation

and adaptation to changes in the domain, the context, or the user needs. If any

of these changes happens, the classical approach requires that the entire learn-

ing process is restarted to produce a model capable of coping with the new

scenario. Such requirements suggest that incremental learning, as opposed to

classical batch one, is needed whenever either incomplete information is avail-

able at the time of initial theory generation, or the nature (and the kinds)

of the concepts evolves dynamically. E.g., this is the case of modiﬁcations in

time of typing style of documents that nevertheless belong to the same class

or of the introduction of a completely new class. Incremental processing allows

for continuous responsiveness to the changes in the context, can potentially

improve eﬃciency and deals with concept evolution. The incremental setting

implicitly assumes that the information (observations) gained at any given

108 F. Esposito et al.

moment is incomplete, and thus that any learned theory could be susceptible

of changes.

This chapter presents the prototypical version of DOMINUS (DOcument

Management IN telligent Universal System): such a system is characterized by

the intensive exploitation of intelligent techniques in each step of document

processing from acquisition to indexing, from categorization to storing and

retrieval. Since it is general and ﬂexible, it can be embedded as a document

management engine into many diﬀerent Digital Library systems. In the fol-

lowing, after a brief description of the architecture of DOMINUS, the results

of the layout analysis on digital documents are discussed, with the interest-

ing improvements achieved by using kernel-based approaches and incremental

ﬁrst-order learning techniques: the satisfying results in document layout cor-

rection, classiﬁcation and understanding allow to start an eﬀective structural

metadata extraction. Then, the categorization, ﬁling and indexing tasks are

described with the results obtained in the eﬀective retrieval of scientiﬁc docu-

ments. Finally, the application of the system in a real-world domain scenario,

scientiﬁc conference management, is reported and discussed.

2 The Document Management System Architecture

This Section brieﬂy presents the overall architecture of DOMINUS,reported

in Figure 1. A central role is played by the Learning Server, which intervenes

during diﬀerent processing steps in order to continuously adapt the knowl-

edge taking into consideration new experimental evidence and changes in the

context. The corresponding process ﬂow performed by the system from the

original digital documents acquisition to text extraction and indexing is re-

ported in Figure 2.

The layout analysis process on documents in digital format starts with

the application of a pre-processing module, called WINE (Wrapper for the

Interpretation of Non-uniform Electronic document formats), that rewrites

basic PostScript operators to turn their drawing instructions into objects (see

Section 3). It takes as input a digital document and produces (by an in-

termediate vector format) the initial document’s XML basic representation,

that describes it as a set of pages made up of basic blocks. Due to the large

number of basic blocks discovered by WINE, that often correspond to frag-

ments of words, it is necessary a ﬁrst aggregation based on blocks overlapping

or adjacency, yielding composite blocks corresponding to whole words. The

number of blocks after this step is still large, thus a further aggregation (e.g.,

of words into lines) is needed. Since grouping techniques based on the mean

distance between blocks proved unable to correctly handle the case of multi-

column documents, such a task was cast to a multiple instance problem (see

Section 3.1) and solved exploiting the kernel-based method proposed in [7],

implemented in a Learning Server module that is able to generate rewriting

rules that suggest how to set some parameters in order to group together

Machine Learning for Digital Document Processing 109

Fig. 1. Document Management System Architecture

word blocks to obtain lines. The inferred rules will be stored in the Theories

knowledge base for future exploitation by RARE (Rule Aggregation REwriter)

and modiﬁcation.

Once such a line-block representation is generated, DOC (Document

Organization Composer) collects the semantically related blocks into groups

by identifying the surrounding frames based on white spaces and the results

of the background structure analysis. This is an improvement of the original

Breuel’s algorithm [8], that ﬁnds iteratively the maximal white rectangles in

a page: here the process is forced to stop before ﬁnding insigniﬁcant white

spaces such as inter-word or inter-line ones (see Section 3.2).

At the end of this step, some blocks might not be correctly recognized. In

such a case a phase of layout correction is needed, that is automatically per-

formed in DOCG (Document Organization Correction Generator) by exploit-

ing embedded rules stored in the Theories knowledge base. Such rules were

automatically learned from previous manual corrections collected on some

document during the ﬁrst trials and using the Learning Server.

Once the layout structure has been correctly and deﬁnitely identiﬁed, a

semantic role must be associated to each signiﬁcant components in order to

perform the automatic extraction of the interesting text with the aim of im-

proving document indexing. This step is performed by DLCC (Document and

Layout Components Classiﬁer) by ﬁrstly associating the document to a class

that expresses its type and then associating to every signiﬁcant layout com-

ponent a tag expressing its role. Both these steps are performed thanks to

theories previously learned and stored in the Theories knowledge base. In

110 F. Esposito et al.

Fig. 2. Document Management System Process Flow

case of failure these theories can be properly updated. The theory revision

step is performed by a ﬁrst-order incremental learning system that runs on

the new observations and tries to modify the old theories in the knowledge

base. At the end of this step both the original document and its XML rep-

resentation, enriched with class information and components annotation, is

stored in the Internal Document Database, IDD.

Finally, the text is extracted from the signiﬁcant components and the

Indexing Server is called by the IGT (Index Generator for Text) module

to manage such information, useful for an eﬀective content-based document

retrieval.

Machine Learning for Digital Document Processing 111

3 Layout Structure Recognition

BasedontheODA/ODIF standard, any document can be progressively parti-

tioned into a hierarchy of abstract representations, called its layout structure.

Here we describe an approach implemented for discovering a full layout hier-

archy in digital documents based primarily on layout information.

The layout analysis process starts with a preliminary preprocessing step

performed by a module that takes as input a generic digital document and

produces a corresponding vectorial description. An algorithm for performing

this task is PSTOEDIT [9], but it was discarded because it only applies to

PostScript (PS)andPortable Document Format (PDF) documents and returns

a description lacking suﬃcient details for our purposes.

Thus, a module named WINE has been purposely developed. At the mo-

ment, it deals with digital documents in PS or PDF formats, that represent

the current de facto standard for document interchange. The PostScript [10]

language is a simple interpretative programming language with powerful

graphical capabilities that allow to precisely describe any page. The PDF [11]

language is an evolution of PostScript that rapidly gained acceptance as a ﬁle

format for digital documents. Like PostScript, it is an open standard, enabling

integrated solutions from a broad range of vendors. In particular, WINE con-

sists of a rewriting of basic PostScript operators that turns the instructions

into objects. For example, the PS instruction to display a text becomes an

object describing a text with attributes for the geometry (location on the

page) and appearance (font, color, etc.). The output of WINE is a vector for-

mat describing the initial digital document as a set of pages, each of which

is composed of basic blocks. The descriptors used by WINE for representing a

document are the following:

box(id,x0,y0,x1,y1,font,size,RGB,row,string): a piece of text in the document,

represented by its bounding box;

stroke(id,x0,y0,x1,y1,RGB,thickness): a graphical (horizontal/vertical) line of

the document;

ﬁll(id,x0,y0,x1,y1,RGB): a closed area ﬁlled with one color;

image(id,x0,y0,x1,y1): a raster image;

page(n,w,h): page information;

where: id is the block identiﬁer; (x0,y0) and (x1,y1) are respectively the

upper-left/lower-right coordinates of the block (note that x0=x1 for vertical

lines and y0=y1 for horizontal lines); fo

nt is the the type font; size represents

the text size; RGB is the color of the text, line or area in #rrggbb format;

row is the index of the row in which the text appears; string is the text of the

document contained in the block; thickness is the thickness of the line; n rep-

resents the page number; w and h are the page width and height, respectively.

Figure 3 reports an extract of the vectorial transformation of the document.

Such a vectorial representation is translated into an XML basic represen-

tation, that will be modiﬁed as long as the layout analysis process proceeds,

112 F. Esposito et al.

Fig. 3. WINE output: Vectorial Transformation of the Document

in order to represent the document by means of increasingly complex aggre-

gations of basic components progressively discovered by the various layout

analysis phases.

3.1 A Kernel-Based Method to Group Basic Blocks

The ﬁrst step in the document layout analysis concerns the identiﬁcation of

rules to automatically shift from the basic digital document description to a

higher level one. Indeed, by analyzing the PS or PDF source, the “elementary”

blocks that make up the document, identiﬁed by WINE, often correspond

just to fragments of words (see Figure 3), thus a ﬁrst aggregation based on

their overlapping or adjacency is needed in order to obtain blocks surrounding

whole words (word-blocks). Successively, a further aggregation starting from

the word-blocks could be performed to have blocks that group words in lines

(line-blocks), and ﬁnally the line-blocks could be merged to build a paragraph

(frames). As to the grouping of blocks into lines, since techniques based on

the mean distance between blocks proved unable to correctly handle cases of

multi-column documents, we decided to apply Machine Learning approaches

in order to automatically infer rewriting rules that could suggest how to set

some parameters in order to group together rectangles (words) to obtain lines.

To do this, RARE uses a kernel-based method to learn rewriting rules able to

perform the bottom-up construction of the whole document starting from the

basic/word blocks up to the lines. Speciﬁcally, such a learning task was cast to

a Multiple Instance Problem and solved exploiting the kernel-based algorithm

Machine Learning for Digital Document Processing 113

Fig. 4. Block Features

proposed in [7]. In our setting, each elementary block is described by means

of a feature-vector of the form:

[Block

Name,Page No,X

,H,W]

made up of parameters interpreted according to the representation in Figure 4,

i.e.:

• Block

Name: the identiﬁer of the considered block;

• Page

No: the number of page in which the block is positioned;

• X

and X

:thex coordinate values, respectively, for the start and end

point of the block;

• Y

and Y

:they coordinate values, respectively, for the start and end point

of the block;

• C

and C

:thex and y coordinate values, respectively, for the centroid of

the block;

• H, W : the distances (height and width) between start and end point of,

respectively, x and y coordinate values.

Starting with this description of the elementary blocks, the corresponding

example descriptions, from which rewriting rules have to be learned, are built

considering each block along with its Close Neighbor blocks: Given a block

and the Close Neighbor blocks CNO

, with their own description:

,Page No,X

[CNO

,Page No,X

nki

nkf

nki

nkf

nkx

nky

]

we represent an example E by means of the template [O

,CNO

], i.e.:

[New

Block Name,Page No,X

nki

nkf

nki

nkf

nkx

nky

]

where the New

Block Name is a name for the new block built by appending

the names of both the original blocks, the information about the x and y

coordinates are the original ones and two new parameters, D

and D

,contain

the information about the distances between the two blocks.

Fixed a block O

, the template [O

,CNO

] is used to ﬁnd, among all

the word blocks in the document, every instance of close neighbors of the

114 F. Esposito et al.

considered block O

. Such an example (set of instances) will be labelled by

an expert as positive for the target concept “the two blocks can be merged”

if and only if the blocks O

and CNO

are adjacent and belong to the same

line in the original document, or as negative otherwise. Figure 5 reports an

example of the selected close neighbor blocks for the block b1. All the blocks

represented with dashed lines could eventually be merged, and hence they will

represent the positive instances for the concept merge, while dotted lines have

been exploited to represent the blocks that could not be merged, and hence

will represent the negative instances for the target concept. It is worth noting

that not every pair of adjacent blocks has to be considered a positive example

since they could belong to diﬀerent frames in the considered document. Such

a situation is reported in Figure 6. Indeed, typical cases in which a block

is adjacent to the considered block but actually belongs to another frame

are, e.g., when they belong to adjacent columns of a multi-column document

(right part of Figure 6) or when they belong to two diﬀerent frames of the

original document (for example, the Title and the Authors frame - left part

of Figure 6).

In such a representation, a block O

has at least one close neighbor

block and at most eight (CNO

with k ∈{1, 2,...,8} or, top-down, from

left to right: top

left corner, top, top right corner, right, bottom right corner,

Fig. 5. Close Neighbor blocks for block b1

Fig. 6. Selection of positive and negative blocks according to the original document:

one-column document on the left, two-columns document on the right