Диссертация, Колумбийский университет, 2009, -184 pp.
Современные методы поиска в мультимедийных данных.
This thesis investigates a number of advanced directions and techniques in multimedia search with a focus on search over visual content and its associated multimedia information. This topic is of interest as the size and availability of multimedia databases are rapidly multiplying and users have increasing need for methods for indexing and accessing these collections in a variety of applications, including Web image search, personal photo collections, and biomedical applications, among others.
Multimedia search refers to retrieval over databases containing multimedia documents. The design principle is to leverage the diverse cues contained in these data sets to index the semantic visual content of the documents in the database and make them accessible through simple query interfaces. The goal of this thesis is to develop a general framework for conducting these semantic visual searches and exploring new cues that can be leveraged for enhancing retrieval within this framework.
A promising aspect of multimedia retrieval is that multimedia documents contain a richness of relevant cues from a variety of sources. A problem emerges in deciding how to use each of these cues when executing a query. Some cues may be more powerful than others and these relative strengths may change from query to query. Recently, systems using classes of queries with similar optimal weightings have been proposed; however, the definition of the classes is left up to system designers and is subject to human error. We propose a framework for automatically discovering query-adaptive multimodal search methods. We develop and test this framework using a set of search cues and propose a new machine leaing-based model for adapting the usage of each of the available search cues depending upon the type of query provided by the user. We evaluate the method against a large standardized video search test set and find that automatically-discovered query classes can significantly out-perform hand-defined classes.
While multiple cues can give some insight to the content of an image, many of the existing search methods are subject to some serious flaws. Searching the text around an image or piece of video can be helpful, but it also may not reflect the visual content. Querying with image examples can be powerful, but users are not likely to adopt such a model of interaction. To address these problems, we examine the new direction of utilizing pre-defined, pre-trained visual concept detectors (such as \person" or \boat") to automatically describe the semantic content in images in the search set. Textual search queries are then mapped into this space of semantic visual concepts, essentially allowing the user to utilize a preferred method of interaction (typing in text keywords) to search against semantic visual content. We test this system against a standardized video search set. We find that larger concept lexicons logically improve retrieval performance, but there is a severely diminishing level of retu. Also, we propose an approach for leveraging many visual concepts by mining the cooccurrence of these concepts in some initial search results and find that this process can significantly increase retrieval performance.
We observe that many traditional multimedia search systems are blind to structural cues in datasets authored by multiple contributors. Specifically, we find that many images in the news or on the Web are copied, manipulated, and reused. We propose that the most frequently copied images are inherently more \interesting" than others and that highly-manipulated images can be of particular interest, representing drifts in ideological perspective. We use these cues to improve search and summarization. We develop a system for reranking image search results based on the number of times that images are reused within the initial search results and find that this reranking can significantly improve the accuracy of the retued list of images especially for queries of popular named entities. We further develop a system to characterize the types of edits present between two copies of an image and infer cues about the image's edit history. Across a plurality of images, these give rise to a sort of \family tree" for the image. We find that this method can find the most-original and most-manipulated images from within these sets, which may be useful for summarization.
The specific significant contributions of this thesis are as follows. (1) The first system to use a machine leaing-based approach to discover classes of queries to be used for query-adaptive search, a process which we show to outperform humans in conducting the same task. (2) An in-depth investigation of using visual concept lexicons to rank visual media against textual keywords, a promising approach which provides a keyword-based interface to users but indexes media based solely on its visual content. (3) A system to utilize image reuse trends (specifically, duplication) behaviors of authors to enhance retrieval in web image retrieval. (4) The first system to attempt to recover the manipulation histories of images for the purposes of summarization and exploration.
Introduction.
Query-class-dependent Models for Multimodal Search.
Leveraging Concept Lexicons and Detectors for Semantic Visual Search.
Improved Search through Mining Multimedia Reuse Patte.
Making Sense of Iconic Content in Search Results: Tracing Image Manipulation Histories.
Conclusions and Future Work.
Современные методы поиска в мультимедийных данных.
This thesis investigates a number of advanced directions and techniques in multimedia search with a focus on search over visual content and its associated multimedia information. This topic is of interest as the size and availability of multimedia databases are rapidly multiplying and users have increasing need for methods for indexing and accessing these collections in a variety of applications, including Web image search, personal photo collections, and biomedical applications, among others.
Multimedia search refers to retrieval over databases containing multimedia documents. The design principle is to leverage the diverse cues contained in these data sets to index the semantic visual content of the documents in the database and make them accessible through simple query interfaces. The goal of this thesis is to develop a general framework for conducting these semantic visual searches and exploring new cues that can be leveraged for enhancing retrieval within this framework.
A promising aspect of multimedia retrieval is that multimedia documents contain a richness of relevant cues from a variety of sources. A problem emerges in deciding how to use each of these cues when executing a query. Some cues may be more powerful than others and these relative strengths may change from query to query. Recently, systems using classes of queries with similar optimal weightings have been proposed; however, the definition of the classes is left up to system designers and is subject to human error. We propose a framework for automatically discovering query-adaptive multimodal search methods. We develop and test this framework using a set of search cues and propose a new machine leaing-based model for adapting the usage of each of the available search cues depending upon the type of query provided by the user. We evaluate the method against a large standardized video search test set and find that automatically-discovered query classes can significantly out-perform hand-defined classes.
While multiple cues can give some insight to the content of an image, many of the existing search methods are subject to some serious flaws. Searching the text around an image or piece of video can be helpful, but it also may not reflect the visual content. Querying with image examples can be powerful, but users are not likely to adopt such a model of interaction. To address these problems, we examine the new direction of utilizing pre-defined, pre-trained visual concept detectors (such as \person" or \boat") to automatically describe the semantic content in images in the search set. Textual search queries are then mapped into this space of semantic visual concepts, essentially allowing the user to utilize a preferred method of interaction (typing in text keywords) to search against semantic visual content. We test this system against a standardized video search set. We find that larger concept lexicons logically improve retrieval performance, but there is a severely diminishing level of retu. Also, we propose an approach for leveraging many visual concepts by mining the cooccurrence of these concepts in some initial search results and find that this process can significantly increase retrieval performance.
We observe that many traditional multimedia search systems are blind to structural cues in datasets authored by multiple contributors. Specifically, we find that many images in the news or on the Web are copied, manipulated, and reused. We propose that the most frequently copied images are inherently more \interesting" than others and that highly-manipulated images can be of particular interest, representing drifts in ideological perspective. We use these cues to improve search and summarization. We develop a system for reranking image search results based on the number of times that images are reused within the initial search results and find that this reranking can significantly improve the accuracy of the retued list of images especially for queries of popular named entities. We further develop a system to characterize the types of edits present between two copies of an image and infer cues about the image's edit history. Across a plurality of images, these give rise to a sort of \family tree" for the image. We find that this method can find the most-original and most-manipulated images from within these sets, which may be useful for summarization.
The specific significant contributions of this thesis are as follows. (1) The first system to use a machine leaing-based approach to discover classes of queries to be used for query-adaptive search, a process which we show to outperform humans in conducting the same task. (2) An in-depth investigation of using visual concept lexicons to rank visual media against textual keywords, a promising approach which provides a keyword-based interface to users but indexes media based solely on its visual content. (3) A system to utilize image reuse trends (specifically, duplication) behaviors of authors to enhance retrieval in web image retrieval. (4) The first system to attempt to recover the manipulation histories of images for the purposes of summarization and exploration.
Introduction.
Query-class-dependent Models for Multimodal Search.
Leveraging Concept Lexicons and Detectors for Semantic Visual Search.
Improved Search through Mining Multimedia Reuse Patte.
Making Sense of Iconic Content in Search Results: Tracing Image Manipulation Histories.
Conclusions and Future Work.