================================================================================================================================= YouTube-Objects dataset v2.2 Vicky Kalogeiton, Alessandro Prest, Christian Leistner, Javier Civera, Cordelia Schmid, Vittorio Ferrari ================================================================================================================================= The YouTube-Objects dataset contains videos collected from YouTube for 10 object classes. The videos are weakly annotated, i.e. we ensure that each video contains at least one object of the corresponding class. This release contains a total of 720,000 frames. For each class, there is a tar.gz file that contains all frames of all videos for this class. We release individual video frames after decompression, in order to eliminate possible confusion when decoding the videos and in the frame numbering. The frames are stored in .jpg format. For each class, all frames from all videos are concatenated and are named sequentially using 8 digits (e.g 00000001.jpg, 00000002.jpg etc). If you use this release, please cite [2]. ================================================================================================================================= 1. Ranges ================================================================================================================================= This folder contains a ranges.mat file for each class. Each file contains the shot partitioning for all videos as well as the video that each shot belongs to. It is a 3xS array, where S is the number of shots. Its structure is the following: --------------------------------------------------------------------------------------------------------------------------------- row 1: Index of the first frame of the shot row 2: Index of the last frame of the shot row 3: Index of the video the shot comes from --------------------------------------------------------------------------------------------------------------------------------- ================================================================================================================================= 2. Ground truth bounding-boxes ================================================================================================================================= This folder contains the ground truth bounding-boxes for this release for each class. The annotation protocol is the following: we uniformly sample frames per shot so that the total number of sampled frames is roughly equal to the number of PASCAL VOC 2007 training samples. Then, we split these frames into training (70%) and test (30%) sets. In order to avoid any bias between training and test set, frames from the same video belong only to one set. In the training set, we annotated one instance per frame, while in the test set we annotated all instances of the desired object class. For each class there are two files: bb_gtTraining.mat and bb_gtTest.mat Each of them contains one structure array with the ground truth bounding boxes for the training and test sets, respectively. - im: classname and frame number - boxes: an Nx4 double array with the coordinates of the N ground-truth bounding-boxes, [x1 y1 x2 y2] The total (over all classes) number of annotated samples is 6,975 (obtained from 6,087 frames). YouTube-objects v1.0 instead only had 1,407 bounding-box annotations. ================================================================================================================================= 3. Optical flow by [3] ================================================================================================================================= This folder contains the optical flow for each shot of this release for each class. For each shot, the optical flow is stored in a cell array 1x(N-1), where N is the number of frames in this shot. Each frame is resized to fit a (MxM) window, where M = 400. ================================================================================================================================= 4. Superpixels by [4] ================================================================================================================================= This folder contains the SLIC superpixels for each shot of this release for each class. For each shot, the SLIC superpixels are stored in a cell array 1x(N), where N is the number of frames in this shot. Each frame is resized to fit a (MxM) window, where M = 400. ================================================================================================================================= 5. Relation to v1.0 of the YouTube-Objects dataset ================================================================================================================================= In v1.0 of the YouTube-Objects dataset there are some decompression problems. For backward compatibility reasons, we include two MATLAB files that show the frame-frame correspondence between the v2.0 and v1.0 of the YouTube-Objects dataset: a) MappingNewtoOldData.mat and b) MappingOldtoNewData.mat Their structure is the following: a) MappingNewtoOldData.mat --------------------------------------------------------------------------------------------------------- column 1 | column 2 | column 3 | column 4 | column 5 | column 6 --------------------------------------------------------------------------------------------------------- Index of Frame | Index of Video | Index of Frame | Index of Video | Index of Shot | Index of Frame in the | in the | in the | in the | in the | in the YouTube-Objects | YouTube-Objects | YouTube-Objects | YouTube-Objects | YouTube-Objects| YouTube-Objects dataset v2.0 | dataset v2.0 | dataset v1.0 | dataset v1.0 | dataset v1.0 | dataset v1.0 | | | | | within the shot --------------------------------------------------------------------------------------------------------- example: MappingNewtoOldData(10781,:) = [10781 ;2 ;6787 ;3 ;23 ;6] Frame "00010781.jpg" of the YouTube-Objects dataset v2.0 belongs to video "2" in the YouTube-Objects dataset v2.0 Its corresponding one from the v1.0 of the YouTube-Objects dataset is the frame "00006787.jpg", which belongs to video "3" shot "23" and frame (within the shot) "6" ("frame0006.jpg") b) MappingOldtoNewData.mat --------------------------------------------------------------------------------------------------------- column 1 | column 2 | column 3 | column 4 | column 5 | column 6 --------------------------------------------------------------------------------------------------------- Index of Frame | Index of Video | Index of Shot | Index of Frame | Index of Frame | Index of Video in the | in the | in the | in the | in the | in the YouTube-Objects | YouTube-Objects | YouTube-Objects | YouTube-Objects | YouTube-Objects| YouTube-Objects dataset v1.0 | dataset v1.0 | dataset v1.0 | dataset v1.0 | dataset v2.0 | dataset v2.0 | | | within the shot | | --------------------------------------------------------------------------------------------------------- example: MappingOldtoNewData(217,:) = [217 ;1 ;2 ;16 ;229 ;1] Frame "00000217.jpg" of the YouTube-Objects dataset v1.0 belongs to video "1" in the YouTube-Objects dataset v1.0 shot "2" in the YouTube-Objects dataset v1.0 and frame (within the shot) "16" ("frame0016.jpg") in the YouTube-Objects dataset v1.0 Its corresponding one from the v2.0 of the YouTube-Objects dataset is the frame "00000229.jpg", which belongs to video "1" in the YouTube-Objects dataset v2.0. ================================================================================================================================= 6. Spatio-temporal tubes as produced by [1] ================================================================================================================================= CandidateTubes.mat and SelectedTubes.mat These files contain the candidate and the selected spatio-temporal tubes as produced by [1]. The structure of these files is the following: The CandidateTubes file is a NxM cell array, where N is the number of frames for this class and M is the number of tubes (as produced by [1]) for this frame. The index of each row of the array refers to the index of the frame. The SelectedTubes file is a Nx1 cell array, where N is the number of frames for this class. The index of each row of the array refers to the index of the frame. Each cell of these files contains a tube, as produced by [1]. For a frame, a tube is represented as a bounding-box, so each cell contains the coordinates [x1 y1 x2 y2] of this bounding-box. ================================================================================================================================= References ================================================================================================================================= [1] Learning Object Class Detectors from Weakly Annotated Video Alessandro Prest, Christian Leistner, Javier Civera, Cordelia Schmid, Vittorio Ferrari, In Computer Vision and Pattern Recognition (CVPR), 2012. [2] Analysing domain shift factors between videos and images for object detection Vicky Kalogeiton, Vittorio Ferrari, Cordelia Schmid, In PAMI, 2016 [3] Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation Thomas Brox, Jitendra Malik, PAMI, March 2011 [4] SLIC Superpixels Compared to State-of-the-art Superpixel Methods Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, Sabine Süsstrunk PAMI, May 2012 ================================================================================================================================= Support ================================================================================================================================= For any query/suggestion/complaint please send us an email: vicky.kalogeiton@ed.ac.uk (please contact this address first) vittoferrari@gmail.com cordelia.Schmid@inria.fr ================================================================================================================================= Versions history ================================================================================================================================= 2.2 --- - superpixels for all shots. The method used is [4]. 2.1 --- - updated ground truth bounding-boxes - optical flow for all shots. The method used is [3]. 2.0 --- - more ground truth bounding-boxes - fix decompression problems 1.0 --- - first public release