All in a days life... Thanks for reading :)

Google

04 December 2005

Bridging the Semantic Gap through Feature-to-Semantic Mapping

This is the literature survey cum short proposal (if it qualifies to become one) that I wrote... Just wondering... can I make public stuff like this?

Anywayz...

==========================

Bridging the Semantic Gap through Feature-to-Semantic Mapping


Content-based Video Retrieval (CBVR) deals with extraction of relevant video data based on specific queries relating to video content. Early CBVR approaches solely extend Content-based Image Retrieval (CBIR) techniques, namely low-level feature extraction. A number of examples are provided in [3]. This however would not make for effective video retrieval because video itself has properties that are more complex than still images.

CBIR queries exclusively based on low-level features alone are unsuitable for video. This is because people prefer to look for meaningful events rather than making sense of visual features in a particular video. For example, a soccer fan might look for a bicycle-kick goal scored in a match between Liverpool against Arsenal. Features such as ball color or the player’s body shape are of no concern, since the utmost interest is in the bicycle-kick event itself. Moreover, low-level features provide very little semantic relationship with events, which makes it a weak reference for retrieval.

It is however undeniable that low-level feature extraction, which is prevalent in CBIR, is crucial in making sense of video data. Used alone however, it cannot suffice for effective retrieval. There is thus a need to automatically bridge the semantic gap between the low-level features of video, and the high-level semantic concepts that describe it.

A worth mentioning literature, which adopts machine learning, attempts automatic labeling of broadcast video content with relevant semantic concepts [2]. Manual annotation of a video sample is initially performed to serve as training data. During training, multi-modal machine learning techniques assisted by minimal human interaction are applied to 1-5% of the video data to build semantic models. Based on these models, the remaining 95-99% is then fully automatically annotated. Besides lowering the cost of manual annotation, impressive semantic concept detection results were reported.

Another alternative suggests a four layer video data model [Petkovich]. The first layer, containing raw data such as frame rate, color depth, video format, etc., serves as the basis for automatically extracting domain-independent features (both static and dynamic) into the second layer. These features are then assigned to regions. The object (third) layer assigns one or more of these regions to logical concepts, which in turn are interpreted as a meaningful object-type, based on established object grammar. Finally, the event layer, which describes object interactions in the spatio-temporal manner, uses event grammar that makes use of object-type with audio and spatio-temporal data, along with real-world relations of the domain knowledge. This approach provides a framework for automatic mapping from features to semantic concept.

Other approaches that are showing promising results are also experimented with. A text-retrieval approach [1] and semantic classification based on motion-descriptors [4] are just to name a few.

Bridging the semantic gap can prove very crucial. Many of the literature focus on mapping semantics to low-level features, which shows that this field has great potential. Areas that might benefit are security surveillance, news, sports and many others. The success would provide a natural way of querying video data, whether from the point of view of the novice user or experts alike.

References

[1] J. Sivic, and A. Zisserman, “Video Google: A Text Retrieval Approach to Object Matching in Videos”, Proceedings of the International Conference on Computer Vision, 2003.
http://www.robots.ox.ac.uk/~vgg/publications/html/sivic03-abstract.html

[2] J.R. Smith, M. Campbell, M. Naphade, A. Natsev, J. Tesic, “Learning and Classification of Semantic Concepts in Broadcast Video”, https://analysis.mitre.org/proceedings/Final_Papers_Files/362

_Camera_Ready_Paper.pdf.

[3] M. Petkovic, “Content-based video retrieval”, EDBT PhD Workshop, http://www.edbt2000.uni-konstanz.de/phd-workshop/papers/Petkovic.pdf.

[4] Y.F. Ma and H.J. Zhang, “Motion Pattern based Video Classification and Retrieval”, EURASIP Journal on Applied Signal Processing 2003:2, 199–208.

0 Comments:

Post a Comment

<< Home