ICMR 2017

Tutorial information

Video Indexing, Search, Detection, and Description with focus on TRECVID

This full-day tutorial event will be targeted towards building baseline systems focusing on several TREC Video Retrieval Evaluation (TRECVID) tasks/tracks such as: Ad-hoc Video Search (AVS), Semantic Indexing (SIN), Multimedia Event Detection (MED), Instance Search (INS), and Video to Text (VTT).

Participants, by the end of the tutorial, are expected to gain knowledge and practical experience in building the basic pipeline components needed in each of the tasks. All resources will be available to participants to reuse and/or build on to participate in one or more TRECVID future tasks.

Modules of the tutorial (mini-tutorials):

Semantic INdexing (SIN)
Zero-Example (0Ex) Video Search (AVS)
Ad Hoc Video Search (AVS)
Multimedia Event Detection (MED)
Instance Search (INS)
Video to Text (VTT)

1) Semantic INdexing (SIN)

Abstract. TRECVID Semantic INdexing task [1] is aiming at evaluating methods and systems for automatic content-based video indexing. The task is defined as follows: given a test collection, a reference shot segmentation, and concept definitions, return for each target concept a list of at most 2,000 shot IDs from the test collection ranked according to their likelihood of containing the target.

This tutorial section will give an overview of the SIN task followed by the description of two main approaches, a “classical” one based on engineered features, classification and fusion, and a deep learning-based one [2]. A baseline implementation built by the LIG team and the IRIM group will be introduced and shared.

Georges Quénot

Contact: Georges dot Quenot at imag dot fr

2) Zero-Example (0Ex) Video Search (AVS)

Abstract. 0Ex is basically text-to-video search, where queries are described in text and no visual example is given. Such search paradigm depends heavily on the scale and accuracy of concept classifiers in interpreting the semantic content of videos. The general idea is to annotate and index videos with concepts during offline processing, and then to retrieve videos with relevant concepts match to query description [3,4]. 0Ex starts since the very beginning of TRECVID in year 2003, growing from around twenty concepts (high-level features) to today’s more than ten thousands of classifiers. The queries also evolve from finding a specific thing (e.g., find shots of an airplane taking off) to detecting a complex and generic event (e.g., wedding shower) [5], while dataset size expands yearly from less than 200 hours to more than 5,000 hours of videos [6].

This tutorial section will give an overview of 0Ex search paradigm, with topics in (i) development of concept classifiers, (ii) indexing and feature pooling, (iii) query processing and concept selection, (iv) video recounting. Interesting problems to be discussed include (i) how to determine the number of concepts for query answering, and (ii) how to identify query-relevant fragments for feature pooling and video recounting. A 0Ex baseline system, with a few thousands of concept classifiers (from SIN, ImageNet concept banks) and built on MED and AVS datasets, will be introduced and shared in public domain.

Chong-Wah Ngo

Contact: cscwngo at cityu dot edu dot hk

3) Ad Hoc Video Search (AVS)

Abstract. TRECVID Ad-hoc Video Search task is aiming at modeling the end user search use-case, who is looking for segments of video containing persons, objects, activities, locations, etc. and combinations of the former. The task is defined as follows: given a test collection, a reference shot segmentation, and a set of Ad-hoc queries, return for each query a list of at most 1,000 shot IDs from the test collection ranked according to their likelihood of containing the target query.

This tutorial section will give an overview of AVS task [7] followed by an overview of the methods used by the participants in the 2016 (first) edition. Most of them are based on the use of a battery of visual concept detectors complemented by methods for mapping the queries on the available concepts.

Georges Quénot

Contact: Georges dot Quenot at imag dot fr

4) Multimedia Event Detection (MED)

Abstract. This tutorial session will highlight recent research towards detection of events, like ‘working on a woodworking project’ and ‘winning a race without a vehicle’, when video examples to learn from are scarce or even completely absent. In the first part of the lecture we consider the scenario where in the order of ten to hundred examples are available. We provide an overview of supervised classification approaches to event detection, relying on shallow and deep feature encodings, as well as semantic encodings atop of convolutional neural networks predicting concepts and attributes [8]. As events become more and more specific, it is unrealistic to assume that ample examples to learn from will be commonly available [9,10]. That is why we turn our attention to retrieval approaches in the second part. The key to event recognition when examples are absent is to have a lingual video representation. Once the video is represented in a textual form, standard retrieval metrics can be used. We cover video representation learning algorithms that emphasize on concepts, social tags or semantic embeddings [11,12,13]. We will detail how these representations allow for accurate event retrieval and are also able to translate and summarize events in video content, even in absence of training examples.

Cees Snoek

Contact: cgmsnoek at uva dot nl

5) Instance Search (INS)

Abstract. TRECVID Instance Search task [14] is aiming at exploring technologies to efficiently and effectively search and retrieve specific objects from videos by given visual examples. The task is especially focusing on finding "instances" of object, person, or location, unlike finding objects of specified classes as SIN task deals with.

This tutorial section will give an overview of INS task followed by standard pipeline including short list generation by bag of visual word technique [15], handling of geometric information and context, efficiency management such as inverted index, and so on [16,17]. A baseline implementation built by NII team will be introduced and shared.

Shin'ichi Satoh

Contact: satoh at nii dot ac dot jp

Duy-Dinh Le

Contact: duyld at uit dot edu dot vn

Vinh-Tiep Nguyen

Contact: nvtiep at fit dot hcmus dot edu dot vn

6) Video to Text (VTT)

Abstract. This tutorial session considers the challenge of matching or generating a sentence to a video. The major challenge in video to text matching is that the query and the retrieval set instances belong to different domains, so they are not directly comparable. Videos are represented by audiovisual feature vectors which have a different intrinsic dimensionality, meaning, and distribution than the textual feature vectors used for the sentences. As a solution, many works aim to align the two feature spaces so they become comparable. We discuss solutions based on low-level, mid-level and high-level alignment for video to text matching [18,19,20]. The goal of video to text generation is to automatically assign a caption to a video. We will cover state-of-the-art approaches relying on recurrent neural networks atop a deep convolutional network, and highlight recent innovations inside and outside the network architectures. Examples will be illustrated in the context of the new TRECVID video to text pilot.

Cees Snoek

Contact: cgmsnoek at uva dot nl

References

[1] G Awad, C. G. M. Snoek, A. F. Smeaton, G. Quénot. TRECVID Semantic Indexing of Video: A 6-Year Retrospective. Invited Paper. ITE Transactions on Media Technology and Applications, 2016.
[2] M. Budnik, E. Gutierrez-Gomez, B. Safadi, D. Pellerin, G. Quénot. Learned features versus engineered features for multimedia indexing. Multimedia Tools and Applications, Springer Verlag. Published online, December 2016.
[3] Y.-J. Lu, H. Zhang, M. der Boer, C.-W. Ngo. Event detection with zero-example: Select the right and suppress the wrong concepts. ACM ICMR, 2016.
[4] Y.-J. Lu, P. A. Nguyen, H. Zhang, C.-W. Ngo. Concept-based interactive search system. MMM (Video Browser Showdown), 2017.
[5] H. Zhang, Y.-J. Lu, M. der Boer, F. ter Haar, Z. Qiu, K. Schutte, W. Kraaij, C.-W. Ngo. VIREO-TNO@TRECVID 2015: Multimedia event detection. TRECVID Workshop, 2015.
[6] X.-Y. Wei, Y.-G, Jiang, C.-W. Ngo. Concept driven multi-modality fusion for video search. IEEE Trans. on Circuits and Systems for Video Technology, 2011.
[7] G. Awad, J. Fiscus, D. Joy, M. Michel, A. F. Smeaton, W. Kraaij, G. Quénot, M. Eskevich, R. Aly, R. Ordelman, G. J. F. Jones, B. Huet, M. Larson. TRECVID 2016: Evaluating video search, video event detection, localization, and hyperlinking. TRECVID 2016, NIST, USA.
[8] A. Habibian, T. Mensink, C. G. M. Snoek. Video2vec embeddings recognize event when examples are scarce. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2017.
[9] M. Mazloom, X. Li, C. G. M. Snoek. Tagbook: A semantic video representation without supervision for event detection. IEEE Trans. on Multimedia, 18(7):1378-1388, 2016.
[10] P. Mettes, D. C. Koelma, C. G. M. Snoek. The Imagenet Shuffle: Reorganized pre-training for video event detection. ACM ICMR, 2016.
[11] A. Habibian, T. Mensink, C. G. M. Snoek. VideoStory: A new multimedia embedding for few-example recognition and translation of events. ACM Multimedia, 2014.
[12] M. Mazloom, E. Gavves, C. G. M. Snoek. Conceptlets: Selective semantics for classifying video events. IEEE Trans. on Multimedia, 16(8):2214-2228, 2014.
[13] A. Habibian, C. G. M. Snoek. Recommendations for recognizing video events by concept vocabularies. Computer Vision and Image Understanding, 124:110-122, 2014.
[14] G. Awad, W. Kraaij, P. Over, S. Satoh. Instance search retrospective with focus on TRECVID. International Journal of Multimedia Information Retrieval, 2017.
[15] C.-Z. Zhu, S. Satoh. Large vocabulary quantization for searching instances from videos. ACM ICMR, 2012.
[16] C.-Z. Zhu, H. Jégou, S. Satoh. Query-adaptive asymmetrical dissimilarities for visual object retrieval. International Conference on Computer Vision (ICCV), 2013.
[17] D.-D. Le, S. Phan, V.-T. Nguyen, C.-Z. Zhu, D. M. Nguyen, T. D. Ngo, S. Kasamwattanarote, P. Sebastien, M.-T. Tran, D. A. Duong, S. Satoh. National Institute of Informatics, Japan at TRECVID 2014. TRECVID Workshop, 2014.
[18] J. Dong, X. Li, C. G. M. Snoek. Word2VisualVec: Image and video to sentence matching by visual feature prediction. ArXive, 2016.
[19] J. Dong, X. Li, W. Lan, Y. Huo, C. G. M. Snoek. Early embedding and late reranking for video captioning. ACM Multimedia, 2016.
[20] A. Habibian, T. Mensink, C. G. M. Snoek. Discovering semantic vocabularies for cross-media retrieval. ACM ICMR, 2015.