Idiap-ETHZ Faces and Poses Dataset V 1.0
========================================

Luo Jie, Barbara Caputo, Vittorio Ferrari


Introduction
------------

Welcome to this release of the "Faces and Poses" dataset. This dataset was first
used in [1]. It contains 1703 image-caption pairs. A caption typically contains
the names of some of the persons appearing in the corresponding image, as well as
verbs indicating what they are doing. 1600 'training' pairs are to be used for
evaluating algorithms for recovering the correspondence between names in the caption
and persons in the images (as this correspondence is not given to the algorithm
beforehand). As typically such algorithms also train appearance models for the
persons, the remaining 103 'test' pairs can be used to evaluate how well these
models can recognize persons in new images. The number of training images is slightly
smaller (10 images) than the number we reported in the paper [1] because we removed
a few images containing offensive content.

The dataset was collected in December 2008 by querying Google Images using the image 
crawler of Florian Schroff [2]. The query keywords were generated by combining various
names (sport stars and politicians) and verbs (from sports and social interactions).
The list of keywords can be found within the dataset, as we organized the downloaded
images into folders named after the keywords used to retrieve them. The images were
filtered using a face detector [4] to ensure every image contains at least one person
with face smaller than 5% of the image area (so as to leave enough space for the body
pose to be visible).

Along with each image, we also stored the corresponding snippet of text returned by
Google Images. External annotators were then asked to extend or modifying the snippet
into a full, realistic caption where necessary. This resulted in captions with varied
long sentences, mentioning the action of at least one person in the image as well as
names/verbs not appearing in the image (as 'noise'). During the construction of the
dataset, we tried to reduce intervention to a very minimum, to keep it as realistic
as possible. As a result, the dataset contains various challenges for face recognition
and pose recognition algorithms, such as profile faces, low resolution images, varying
illuminations, and even some hand-drawings and cartoons. Ground-truth for name-verb
pairs mentioned in the captions as well as the location of their associated persons in
the images are also available as part of the dataset. This ground-truth data is included
for evaluating the output of algorithms that attempt to recover the image-text 
correspondences. The ground-truth data should not be given to the algorithm beforehand
(please see [1] for a detailed experimental protocol).

In addition, we also include the name-verb pairs extracted automatically from the captions
using the language parser developed by Koen Deschacht and Marie-Francine Moens [3] as well
as the face and upper-body bounding-boxes detecting using our detectors [4,5]. These are
included to facilitate a direct comparison to our results.


Contents
------------

1. data.tar.gz -- dataset of images and captions
    Let <dir_root> be the directory where this package was uncompressed.
	The resulting sub-directories contain:
    
    <dir_root>/train/                 - images and captions for training the classifiers
    <dir_root>/train/jpg              - images directory
    <dir_root>/train/jpg/<key_words>  - images downloaded using each <key_words> as query
    <dir_root>/train/txt              - captions directory
    <dir_root>/train/txt/<key_words>  - captions, each .txt file correspond to an image in
                                        the image directory
                                      
    <dir_root>/test/                  - images and captions for testing the classifiers
    ....                              - the test directory is organized in the same way as 
                                        the train directory

    An example caption annotation:    
    ***********************************************************
    caption: Chinese President Hu Jintao and President of the Republic of Korea Lee Myung-bak shake hands before their meeting at the presidential palace in Seoul, capital of the Republic of Korea. Hu Jintao arrived in Seoul on Monday for a two-day state visit to ROK.
    
    names: Hu Jintao (182,111), Lee Myung-bak (347,124)
    
    verbs: shake hands, shake hands 
    ***********************************************************  
    Under 'names':  the ground-truth persons doing the action in the image, with the coordinates
                    of the center of their face in parenthesis;
    Under 'verbs':  the corresponding ground-truth verb association, in the same order as "names" 

2. captions.mat -- captions and name-verb pairs in MAT-File format 
      
      captions(idx).fdir   - directory of the caption 
      captions(idx).fname  - file name of the caption
      captions(idx).text   - the full caption
      captions(idx).manual - the ground truth name-verb pairs in the caption
                             a P-by-2 cell matrices
                             captions(idx).manual{p, 1} is the p-th name
                             captions(idx).manual{p, 1} is the p-th verb associated with 
                             the p-th name, 'NULL' if no verb is found
      captions(idx).auto   - the name-verb pairs extracted automatically using a language
                             parser [3], in the same format as ground truth pairs

3. bbx.mat -- detected face and upbody bound boxes in MAT-File format
       
      bbx format:  [ top-left coordinate x, top-left coordinate y, width, height ] 

      bbx(idx).fdir        - directory of the image
      bbx(idx).fname       - file name of the caption 
      bbx(idx).face_bbx    - the detected face(s)
                             a P-by-4 matrices, each row correspond to a detected face bbx            
      bbx(idx).name        - the ground truth labels of the face bbx(es)
                             'NULL' if it is a false detection or an un-named person
      bbx(idx).upbody_bbx  - the detected upbody(s)
                             a P-by-4 matrices, each row correspond to a detected face bbx
      bbx(idx).verb        - the ground truth labels of the face bbx(es)

4. dictionary.mat -- a list of frequent names and verbs considered in the experiments reported
                     in [1]   

References
----------
[1] L. Jie, B. Caputo and V. Ferrari.
    Who's Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation
    In Advances in Neural Information Processing Systems 22 (NIPS), 2009. 

[2] http://www.robots.ox.ac.uk/~schroff/software/mkdb_thin.zip

[3] K. Deschacht and M.-F. Moens.
    Semi-supervised semantic role labeling using the latent words language model
    In proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2009. 

[4] http://torch3vision.idiap.ch

[5] http://www.robots.ox.ac.uk/~vgg/software/UpperBody/


Important Notice
----------------

These images were downloaded from the internet, and may subject to copyright. We don't own the
copyright of the images and only provide them for non-commerical research purpose.  


Support
-------
For any query/suggestion/complaint or simply to say you like/use the annotation and software just drop us an email

jluo    AT idiap.ch
ferrari AT vision.ee.ethz.ch
bcaputo AT idiap.ch