Idiap-ETHZ Faces and Poses Dataset V 1.0 ======================================== Luo Jie, Barbara Caputo, Vittorio Ferrari Introduction ------------ Welcome to this release of the "Faces and Poses" dataset. This dataset was first used in [1]. It contains 1703 image-caption pairs. A caption typically contains the names of some of the persons appearing in the corresponding image, as well as verbs indicating what they are doing. 1600 'training' pairs are to be used for evaluating algorithms for recovering the correspondence between names in the caption and persons in the images (as this correspondence is not given to the algorithm beforehand). As typically such algorithms also train appearance models for the persons, the remaining 103 'test' pairs can be used to evaluate how well these models can recognize persons in new images. The number of training images is slightly smaller (10 images) than the number we reported in the paper [1] because we removed a few images containing offensive content. The dataset was collected in December 2008 by querying Google Images using the image crawler of Florian Schroff [2]. The query keywords were generated by combining various names (sport stars and politicians) and verbs (from sports and social interactions). The list of keywords can be found within the dataset, as we organized the downloaded images into folders named after the keywords used to retrieve them. The images were filtered using a face detector [4] to ensure every image contains at least one person with face smaller than 5% of the image area (so as to leave enough space for the body pose to be visible). Along with each image, we also stored the corresponding snippet of text returned by Google Images. External annotators were then asked to extend or modifying the snippet into a full, realistic caption where necessary. This resulted in captions with varied long sentences, mentioning the action of at least one person in the image as well as names/verbs not appearing in the image (as 'noise'). During the construction of the dataset, we tried to reduce intervention to a very minimum, to keep it as realistic as possible. As a result, the dataset contains various challenges for face recognition and pose recognition algorithms, such as profile faces, low resolution images, varying illuminations, and even some hand-drawings and cartoons. Ground-truth for name-verb pairs mentioned in the captions as well as the location of their associated persons in the images are also available as part of the dataset. This ground-truth data is included for evaluating the output of algorithms that attempt to recover the image-text correspondences. The ground-truth data should not be given to the algorithm beforehand (please see [1] for a detailed experimental protocol). In addition, we also include the name-verb pairs extracted automatically from the captions using the language parser developed by Koen Deschacht and Marie-Francine Moens [3] as well as the face and upper-body bounding-boxes detecting using our detectors [4,5]. These are included to facilitate a direct comparison to our results. Contents ------------ 1. data.tar.gz -- dataset of images and captions Let be the directory where this package was uncompressed. The resulting sub-directories contain: /train/ - images and captions for training the classifiers /train/jpg - images directory /train/jpg/ - images downloaded using each as query /train/txt - captions directory /train/txt/ - captions, each .txt file correspond to an image in the image directory /test/ - images and captions for testing the classifiers .... - the test directory is organized in the same way as the train directory An example caption annotation: *********************************************************** caption: Chinese President Hu Jintao and President of the Republic of Korea Lee Myung-bak shake hands before their meeting at the presidential palace in Seoul, capital of the Republic of Korea. Hu Jintao arrived in Seoul on Monday for a two-day state visit to ROK. names: Hu Jintao (182,111), Lee Myung-bak (347,124) verbs: shake hands, shake hands *********************************************************** Under 'names': the ground-truth persons doing the action in the image, with the coordinates of the center of their face in parenthesis; Under 'verbs': the corresponding ground-truth verb association, in the same order as "names" 2. captions.mat -- captions and name-verb pairs in MAT-File format captions(idx).fdir - directory of the caption captions(idx).fname - file name of the caption captions(idx).text - the full caption captions(idx).manual - the ground truth name-verb pairs in the caption a P-by-2 cell matrices captions(idx).manual{p, 1} is the p-th name captions(idx).manual{p, 1} is the p-th verb associated with the p-th name, 'NULL' if no verb is found captions(idx).auto - the name-verb pairs extracted automatically using a language parser [3], in the same format as ground truth pairs 3. bbx.mat -- detected face and upbody bound boxes in MAT-File format bbx format: [ top-left coordinate x, top-left coordinate y, width, height ] bbx(idx).fdir - directory of the image bbx(idx).fname - file name of the caption bbx(idx).face_bbx - the detected face(s) a P-by-4 matrices, each row correspond to a detected face bbx bbx(idx).name - the ground truth labels of the face bbx(es) 'NULL' if it is a false detection or an un-named person bbx(idx).upbody_bbx - the detected upbody(s) a P-by-4 matrices, each row correspond to a detected face bbx bbx(idx).verb - the ground truth labels of the face bbx(es) 4. dictionary.mat -- a list of frequent names and verbs considered in the experiments reported in [1] References ---------- [1] L. Jie, B. Caputo and V. Ferrari. Who's Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation In Advances in Neural Information Processing Systems 22 (NIPS), 2009. [2] http://www.robots.ox.ac.uk/~schroff/software/mkdb_thin.zip [3] K. Deschacht and M.-F. Moens. Semi-supervised semantic role labeling using the latent words language model In proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2009. [4] http://torch3vision.idiap.ch [5] http://www.robots.ox.ac.uk/~vgg/software/UpperBody/ Important Notice ---------------- These images were downloaded from the internet, and may subject to copyright. We don't own the copyright of the images and only provide them for non-commerical research purpose. Support ------- For any query/suggestion/complaint or simply to say you like/use the annotation and software just drop us an email jluo AT idiap.ch ferrari AT vision.ee.ethz.ch bcaputo AT idiap.ch