# CLEVRER Dataset The CLEVRER dataset is a diagnostic video question answering dataset for temporal and causal reasoning. Current version of the dataset includes 3 parts: videos, annotations and questions. There are 20,000 videos, separated into train (index 0 - 9999), validation (index 10000 - 14999), and test (index 15000 - 19999) splits. ## Videos CLEVRER contains synthetic videos of moving and colliding objects. Each video is 5 seconds long and contains 128 frames with resolution 480 x 320. The videos are stored in mp4 format and are arranged in separate folders per every 1000 files. ## Lincense Our dataset is under the CC0. ## Annotations CLEVRER provides annotations for the train and validation splits for model diagnostics and further benchmarking. Each annotation file corresponds to a video and contains object static properties (i.e. color, material, shape), motion trajectories (i.e. position, direction, velocity etc.), and collision events. The files are in json format and arranged in separated folders per every 1000 files. The following example displays the structure of a single annotation file. ``` annotations/train/annotation_00000-01000/annotation_00000.json: { scene_index: 0, video_filename: 'video_00000.mp4', object_property: [ {object_id: 0, color: 'blue', material: 'rubber', shape: 'sphere'}, ... ], motion_trajectory: [ { frame_id: 0 objects: [ { object_id: 0, location: [...], orientation: [...], velocity: [...], angular_velocity: [...], inside_camera_view: False }, ... ], }, ... ], collision: [ {object_ids: [0, 1], frame_id: 19, location: [...]}, ... ] } ``` ## Questions Questions in CLEVRER include 4 types: descriptive, explanatory, predictive and counterfactual. Descriptive questions are answered by a single word token while the other three question types are multiple choice. Each question in the training and validation set comes with program and answer annotations. The programs are represented by lists of module names arranged in postfix notation. These informations are omitted in the test split. Subtypes of descriptive questions are also included for further diagnostics. Each multiple choice question may include more than one or no correct choices. Structures of the training (validation) and test question files are shown below. ``` questions/train.json: [ { scene_index: 0, video_filename: 'video_00000.mp4', questions: [ // Descriptive question { question_id: 0, question: 'What is the shape of the object to collide with the purple object?', question_type: 'descriptive', question_subtype: 'query_shape', program: [...], answer: 'sphere', }, ... // Multiple choice question { question_id: 21, question: 'Which event will happen if the cylinder is removed?', question_type: 'counterfactual', program: [...], choices: [ {choice_id: 0, choice: 'The blue rubber sphere collides with the cube', program: [...], answer: 'wrong'}, ... ] }, ... ], }, ... ] questions/test.json: [ { scene_index: 15000, video_filename: 'video_15000.mp4', questions: [ // Descriptive question { question_id: 0, question: 'How many objects are moving?', question_type: 'descriptive', question_subtype: 'count', }, ... // Multiple choice question { question_id: 11, question: 'Which of the following is responsible for the rubber cube's colliding with the cylinder?', question_type: 'explanatory', choices: [ {choice_id: 0, choice: 'The presence of the brown sphere'}, ... ] }, ... ], }, ... ] ```