Издательство CSLI Publications, 1998, -212 pp.
During the last few years, a new approach to language processing has started to emerge This approach, which has become known under the name of "Data Oriented Parsing" or "DOP"1, embodies the assumption that human language comprehension and production works with representations of concrete past language experiences, rather than with abstract grammatical rules The models that instantiate this approach therefore maintain corpora of linguistic representations of previously occurring utterances New utterance-representations are constructed by productively combining (partial) structures from the corpus A probability model is used to choose from the collection of different structures of different sizes those that make up the most appropriate representation of an utterance.
In this book, DOP models for various kinds of linguistic representations are described, ranging from tree representations, compositional semantic representations, attribute-value representations, and dialogue representations These models are studied from formal, linguistic and computational perspectives and are tested on available language corpora The mam outcome of these tests suggests that the productive units of natural language cannot be defined in terms of a minimal set of rules (or constraints or principles), as is usually attempted in linguistic theory, but need to be defined in terms of a redundant set of previously experienced structures with virtually no restriction on their size and complexity. It has been argued that this outcome has important consequences for linguistic theory, leading to an entirely new view of the nature of linguistic competence and the relationship between linguistic theory and models of performance. In particular, it means that the knowledge of a speaker/hearer cannot be understood as a grammar, but as a statistical ensemble of language experiences that changes slightly every time a new utterance is processed.
Although this book may seem primarily intended for readers with a background in computational linguistics, I have given maximal effort to make it comprehensible to all students and researchers of language, from theoretical linguists to psycholinguists and computer scientists. I believe that there is still a cultural gap to be bridged between natural language technology and theory. On the one hand, there is the Statistical Natural Language Processing community which seems to have lost all links with current linguistic theory. On the other hand, there is the Theoretical Linguistics community whose results are often ignored by natural language technology and psycholinguistics. In this book I will argue that there can be no such thing as statistical linguistics without a theory of linguistic representation, and there can be no adequate linguistic theory without a statistical enrichment. If this book helps to bridge the gap between these two communities, its aim has been achieved. On the other hand, I realize that I may be easily criticized by both communities, which is the consequence of being interdisciplinary.
The only background knowledge I assume throughout the book are (1) the basic notions of grammatical theory, such as context-free grammar, Chomsky hierarchy and generative capacity; and (2) the basic notions of probability theory, such as the classical definitions of absolute, conditional and joint probability. Some knowledge of logic and compositional semantics is also helpful. I have tried to keep technical details to a minimum and referred to the relevant literature as much as possible. Having said this, my first aim has been to write a comprehensible book which can be read without the need of exteal literature.
Introduction: what are the productive units of natural language?
An experience-based model for phrase-structure
Formal Stochastic Language Theory
Parsing and disambiguation
Testing the model: can we restrict the productive units?
Leaing new words
Leaing new structures
An experience-based model for compositional semantic representations
Speech understanding and dialogue processing
Experience-based models for non-context-free representations
During the last few years, a new approach to language processing has started to emerge This approach, which has become known under the name of "Data Oriented Parsing" or "DOP"1, embodies the assumption that human language comprehension and production works with representations of concrete past language experiences, rather than with abstract grammatical rules The models that instantiate this approach therefore maintain corpora of linguistic representations of previously occurring utterances New utterance-representations are constructed by productively combining (partial) structures from the corpus A probability model is used to choose from the collection of different structures of different sizes those that make up the most appropriate representation of an utterance.
In this book, DOP models for various kinds of linguistic representations are described, ranging from tree representations, compositional semantic representations, attribute-value representations, and dialogue representations These models are studied from formal, linguistic and computational perspectives and are tested on available language corpora The mam outcome of these tests suggests that the productive units of natural language cannot be defined in terms of a minimal set of rules (or constraints or principles), as is usually attempted in linguistic theory, but need to be defined in terms of a redundant set of previously experienced structures with virtually no restriction on their size and complexity. It has been argued that this outcome has important consequences for linguistic theory, leading to an entirely new view of the nature of linguistic competence and the relationship between linguistic theory and models of performance. In particular, it means that the knowledge of a speaker/hearer cannot be understood as a grammar, but as a statistical ensemble of language experiences that changes slightly every time a new utterance is processed.
Although this book may seem primarily intended for readers with a background in computational linguistics, I have given maximal effort to make it comprehensible to all students and researchers of language, from theoretical linguists to psycholinguists and computer scientists. I believe that there is still a cultural gap to be bridged between natural language technology and theory. On the one hand, there is the Statistical Natural Language Processing community which seems to have lost all links with current linguistic theory. On the other hand, there is the Theoretical Linguistics community whose results are often ignored by natural language technology and psycholinguistics. In this book I will argue that there can be no such thing as statistical linguistics without a theory of linguistic representation, and there can be no adequate linguistic theory without a statistical enrichment. If this book helps to bridge the gap between these two communities, its aim has been achieved. On the other hand, I realize that I may be easily criticized by both communities, which is the consequence of being interdisciplinary.
The only background knowledge I assume throughout the book are (1) the basic notions of grammatical theory, such as context-free grammar, Chomsky hierarchy and generative capacity; and (2) the basic notions of probability theory, such as the classical definitions of absolute, conditional and joint probability. Some knowledge of logic and compositional semantics is also helpful. I have tried to keep technical details to a minimum and referred to the relevant literature as much as possible. Having said this, my first aim has been to write a comprehensible book which can be read without the need of exteal literature.
Introduction: what are the productive units of natural language?
An experience-based model for phrase-structure
Formal Stochastic Language Theory
Parsing and disambiguation
Testing the model: can we restrict the productive units?
Leaing new words
Leaing new structures
An experience-based model for compositional semantic representations
Speech understanding and dialogue processing
Experience-based models for non-context-free representations