================================================================================ Aspects dataset 1.0 Anestis Papazoglou, Luca Del Pero, Vittorio Ferrari ================================================================================ This dataset contains video shots for two different object classes: tigers and cars. We collected the shots from 188 car ads (~1–2 min each) and 14 nature documentaries about tigers (~40 min), amounting to roughly 14 h of video. We automatically partitioned these raw videos into shorter shots, and kept only those showing at least one instance of the class. This produced 806 shots for the car and 1880 for the tiger class, typically 1–100 sec in length. We release individual video frames after decompression, in order to eliminate possible confusion when decoding the videos and in the frame numbering. The frames are stored in .jpg format. For each class, all frames from all shots are concatenated and named sequentially using 8 digits (e.g 00000001.jpg, 00000002.jpg etc). We also release the ground-truth aspect annotations as well as automatic segmentations used in [1]. The annotation files and further metadata about the videos are stored as .mat files and require matlab to parse. ================================================================================ 1. Ranges ================================================================================ We provide a ranges.mat file for each class, specifying which frames belong to which shot. This contains an (N+1)x1 array, where N is the number of shots for the class. Each element i in the array denotes the index of first frame of the i-th shot. We also provide a positiveRanges.mat file for each class, specifying which shots contain an instance of the object class in at least one frame. This contains a Mx1 array, where M is the the number of shots containing the object class. Each element in the array corresponds tp the index of a shot containing the object class. Note: We have not actually included shots that do not contain the object class in the archives for space considerations. ================================================================================ 2. Ground-truth ================================================================================ The ground-truth annotations for each class are placed under the folder "GroundTruth/Aspect". The ground-truth for each annotated frame is stored in a separate file named annotationXXXXXXXX.mat, where XXXXXXXX is the 8 digit index of the corresponding frame. Each file contains a structure with the following fields: -------------------------------- Tiger class --------------------------------- objects_num: The number of class objects in the frame. faceid: An index (0 to 6) indicating the orientation of the face. These indexes correspond to 'not visible', 'left', 'left-front', 'front', 'right-front', 'right', 'top' respectively. face: A string representation of the corresponding faceid. front_feetid: A value (0 to 4) indicating the action of the legs. 0 corresponds to 'not visible', 1 is 'standing', 2 is 'walking', 3 is 'running', 4 is 'lying'. front_feet: A string representation of the above value. back_feetid: A value (0 to 4) indicating the action of the legs. 0 corresponds to 'not visible', 1 is 'standing', 2 is 'walking', 3 is 'running', 4 is 'lying'. back_feet: A string representation of the above value. sternumid: A binary value indicating whether the part is visible. sternum: A string representation of the above value. buttocksid: A binary value indicating whether the part is visible. buttocks: A string representation of the above value. shoulder_bladesid: An index indicating which shoulderblade is visible. 0 corresponds to none, 1 to left and 2 to right. shoulder_blades: A string representation of the above value. thighsid: A binary value indicating whether the part is visible. thighs: A string representation of the above value. backid: A binary value indicating whether the part is visible. back: A string representation of the above value. bellyid: A binary value indicating whether the part is visible. belly: A string representation of the above value. segmentationid: A value (0 to 2) indicating the quality of the provided segmentation. 0 is wrong segmentation, 1 is mostly on the object but inaccurate, 2 is perfect. segmentation: A string representation of the above value. -------------------------------- Car class --------------------------------- objects_num: The number of class objects in the frame. roof: A binary value indicating whether the part is visible. windshield_front: A binary value indicating whether the part is visible. windshield_rear: A binary value indicating whether the part is visible. light_front_left: A binary value indicating whether the part is visible. light_front_right: A binary value indicating whether the part is visible. light_rear_left: A binary value indicating whether the part is visible. light_rear_right: A binary value indicating whether the part is visible. wheel_front_left: A binary value indicating whether the part is visible. wheel_front_right: A binary value indicating whether the part is visible. wheel_rear_left: A binary value indicating whether the part is visible. wheel_rear_right: A binary value indicating whether the part is visible. door_front_left: A binary value indicating whether the part is visible. door_front_right: A binary value indicating whether the part is visible. segmentationid: A value (0 to 2) indicating the quality of the provided segmentation. 0 is wrong segmentation, 1 is mostly on the object but inaccurate, 2 is perfect. segmentation: A string representation of the above value. ================================================================================ 2. Segmentations ================================================================================ We provide segmentations produced by [2] for all shots that contain at least one instance of the object class. The segmentation for each shot is stored as a file named "segmentationShotX.mat", where X indicates the shot index. Inside each file you will find a cell array of size Kx1, where K is the number frames in that shot. Each element in that cell array is the binary segmentation mask (1=foreground, 0=background) for the corresponding frame in the shot. For instance, the first element corresponds to the first frame of the shot etc. ================================================================================ References ================================================================================ [1] Discovering object aspects from video Anestis Papazoglou, Luca Del Pero, Vittorio Ferrari In Image and Vision Computing, August 2016 [2] Fast object segmentation in unconstrained video Anestis Papazoglou, Vittorio Ferrari, In International Conference on Computer Vision (ICCV), 2013 ================================================================================ Support ================================================================================ For any query/suggestion/complaint please send us an email: papazoglou.anestis@gmail.com