Dataset Viewer
Auto-converted to Parquet Duplicate
Image
imagewidth (px)
253
640
Question
stringlengths
87
224
Options
listlengths
4
4
Answer
int64
0
3
Category
stringclasses
12 values
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "basketball basket", "man on the skateboard", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "dragon statue", "building", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "laptop", "car", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "\"Exit\" sign", "mirror", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "national flag", "hat", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "laptop", "wine glass", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "airplane", "speedboat", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "\"Mexican Mobile Eating\" sign", "woman in black suit", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "spoon", "plate", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "horse", "grey car", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "\"snap\" sign", "tennis ball", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "chandelier", "lamp on the shelf", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "kite", "building", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "surfboard", "brown building", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "cat", "potted plant", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "\"3141\" number", "infusion pump", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "sponsor flag", "person with skateboard", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "lamp on the right", "TV", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "flower", "yellow umbrella", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "kites", "streetlight", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "umbrella", "banana sign", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "\"OBAMA\" sign", "building", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "boat", "shed", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "frisbee", "traffic cone", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "street clock", "red flag", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "metal container", "faucet", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "teddy bear", "bushes", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "clock", "street lights", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "\"No parking\" sign", "motorcycle", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "remote control", "mirror", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "cake", "letter decorations", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "stop sign", "chimney", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "cable cars", "orange flag", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "speakers", "keyboard", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "blue forward sign", "taxi sign", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "skateboard", "information board", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "picture on the wall", "person in dark blue cloth", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "person holding a laptop", "white truck", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "cow", "house", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "road sign", "street light", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "person on snowboard", "tree", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "giraffe", "\"rainforest cafe\" sign", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "flower", "tv", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "tree", "tiered tower", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "fan", "people", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "stop sign", "white sign on wall", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "yellow source bottle", "man in white shirt", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "glass vases", "coffee machine", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "man in grey suit", "traffic light", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "laptop monitor", "orange cat", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "traffic light", "colorful umbrella", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "traffic light", "purple billboard", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "starbucks sign", "white truck", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "airplane", "crane", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "blue barrel", "dog", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "kite", "bushes", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "person in blue", "street lights", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "black fan", "faucet", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "green street lamp", "giraffe", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "fire hydrant", "white suv", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "yellow \"P\" sign", "person with hotdog", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "laptop", "microwave", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "McDonald's Sign", "bus only sign", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "toaster", "fridge", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "potted plant", "bus", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "white sign with a black arrow", "white sign with street name", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "man with red hat", "green building", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "\"flu shot\" label", "umbrella", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "ship model", "curtain", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "boat", "plane", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "umbrella", "traffic light", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "green machines", "skier holding a flag", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "black cloth", "yellow box", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "red cup", "bottle with green cap", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "\"No smoking\" sign", "horse with white hair", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "person", "yellow board", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "potted plant", "red building", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "\"McDonald's\" sign", "pedestrian traffic lights", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "round stone seat on the left", "silver seat in the back", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "small vase", "microwave", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "person holding poles", "building", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "door handle", "faucet", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "laptop", "baby", null, null ]
0
height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?
[ "step stool", "notebook", null, null ]
1
height_higher
Consider the real-world 3D locations of the objects. Is the pillar directly above the boat?
[ "yes", "no", null, null ]
1
location_above
Consider the real-world 3D locations of the objects. Is the white desk directly above the white drawers?
[ "yes", "no", null, null ]
0
location_above
Consider the real-world 3D locations of the objects. Is the traffic light directly above the "one-way" sign?
[ "yes", "no", null, null ]
0
location_above
Consider the real-world 3D locations of the objects. Is the umbrella directly above the chair?
[ "yes", "no", null, null ]
0
location_above
Consider the real-world 3D locations of the objects. Is the empty picture directly above the couch?
[ "yes", "no", null, null ]
0
location_above
Consider the real-world 3D locations of the objects. Is the trash bin directly above the trash bag?
[ "yes", "no", null, null ]
1
location_above
Consider the real-world 3D locations of the objects. Is the man on skateboard directly above the fire hydrant?
[ "yes", "no", null, null ]
0
location_above
Consider the real-world 3D locations of the objects. Is the desserts directly above the teapot?
[ "yes", "no", null, null ]
1
location_above
Consider the real-world 3D locations of the objects. Is the street sign directly above the stop sign?
[ "yes", "no", null, null ]
0
location_above
Consider the real-world 3D locations of the objects. Is the "ATM here" sign directly above the black car?
[ "yes", "no", null, null ]
1
location_above
Consider the real-world 3D locations of the objects. Is the pink umbrella directly above the person with blue shirt?
[ "yes", "no", null, null ]
1
location_above
Consider the real-world 3D locations of the objects. Is the airplane directly above the flag?
[ "yes", "no", null, null ]
1
location_above
Consider the real-world 3D locations of the objects. Is the umbrella directly above the man on the phone?
[ "yes", "no", null, null ]
1
location_above
Consider the real-world 3D locations of the objects. Is the yellow umbrella directly above the lady?
[ "yes", "no", null, null ]
1
location_above
Consider the real-world 3D locations of the objects. Is the road sign directly above the white car?
[ "yes", "no", null, null ]
1
location_above
Consider the real-world 3D locations of the objects. Is the tent directly above the man in red jacket?
[ "yes", "no", null, null ]
0
location_above
End of preview. Expand in Data Studio

AgentVQA: A Multi-Domain Visual Question Answering Dataset

AgentVQA is a comprehensive dataset for training and evaluating visual agents across multiple domains, including GUI interaction, spatial reasoning, game understanding, robot manipulation, and video perception. The dataset contains multiple-choice questions based on screenshots, images, and videos.

Dataset Structure

The dataset is organized into five main domains:

1. Web Agents Domain (GUI Interaction)

  • android-in-the-wild: Android app interactions (1000 samples)
  • mind2web: Web navigation tasks (1000 samples, stratified by category)
  • monday: Mobile OS interactions (1000 samples, stratified by OS)
  • screenspot: GUI element grounding (1000 samples)
  • screenspot-pro: Advanced GUI grounding (1000 samples)

2. Spatial Reasoning Domain

  • embspatial-bench: Object-object and object-scene spatial relationships (1000 samples)
  • space-10: Multi-image spatial reasoning with entity presence tasks (1000 samples)
  • 3dsrbench: 3D spatial reasoning benchmark
  • spatialmm: Spatial reasoning in multi-modal contexts
  • omnispatial: Comprehensive spatial reasoning across multiple task types

3. Game Understanding Domain

  • gameqa: 3D spatial perception and reasoning in games (1000 samples)
  • videogamebunny: Video game understanding across multiple categories
  • but-they-are-cats-tutorial: Game tutorial comprehension tasks
  • marioqa: Video-based event understanding in Super Mario gameplay

4. Robot Manipulation Domain

  • robo2vlm: Robot task success state evaluation (1000 samples)
  • roborefit: Robot grasping and manipulation tasks (1000 samples)
  • erqa: Embodied reasoning for robot question answering (400 samples)
  • manipbench: Robot manipulation trajectory planning (1000 samples)
  • sharerobot: Multi-frame robot task understanding (1000 samples)

5. Video Perception & Egocentric Understanding Domain

  • perception-test: Camera motion and video perception tasks (1000 samples)
  • vsi-bench: Video spatial intelligence with egocentric viewpoint questions (1000 samples)
  • openeqa: Open-ended embodied question answering from video (1000 samples)
  • videgothink: Egocentric video understanding and reasoning (1000 samples)
  • egoplan-bench: Egocentric task planning and goal understanding (1000 samples)

Loading the Dataset

Load a specific dataset:

from datasets import load_dataset

# Load a web agents dataset
ds = load_dataset("advaitgupta/AgentVQA", "android-in-the-wild", split="train")

# Load a spatial reasoning dataset
ds = load_dataset("advaitgupta/AgentVQA", "embspatial-bench", split="train")

# Load a video perception dataset
ds = load_dataset("advaitgupta/AgentVQA", "perception-test", split="train")

# Load an egocentric dataset
ds = load_dataset("advaitgupta/AgentVQA", "egoplan-bench", split="train")

Loading Video Datasets (stored as bytes):

from datasets import load_dataset
from decord import VideoReader
from io import BytesIO

# Load video dataset
ds = load_dataset("advaitgupta/AgentVQA", "videgothink", split="train")
sample = ds[0]

# Access video bytes
video_bytes = sample["Video_Bytes"]
filename = sample["Video_Filename"]

# Option 1: Save to file
with open(f"/tmp/{{filename}}", "wb") as f:
    f.write(video_bytes)

# Option 2: Use decord for direct loading
vr = VideoReader(BytesIO(video_bytes))
frames = vr.get_batch(range(len(vr))).asnumpy()

Helper Functions:

from datasets import load_dataset

# Available domains and their datasets
DOMAINS = {{
    "web_agents": [
        "android-in-the-wild",
        "mind2web",
        "monday",
        "screenspot",
        "screenspot-pro"
    ],
    "spatial_reasoning": [
        "embspatial-bench",
        "space-10",
        "3dsrbench",
        "spatialmm",
        "omnispatial"
    ],
    "game_understanding": [
        "gameqa",
        "videogamebunny",
        "but-they-are-cats-tutorial",
        "marioqa"
    ],
    "robot_manipulation": [
        "robo2vlm",
        "roborefit",
        "erqa",
        "manipbench",
        "sharerobot"
    ],
    "video_perception": [
        "perception-test",
        "vsi-bench",
        "openeqa",
        "videgothink",
        "egoplan-bench"
    ]
}}

def load_domain_dataset(domain, dataset_name):
    \"\"\"Load a specific dataset from a domain.\"\"\"
    return load_dataset("advaitgupta/AgentVQA", dataset_name, split="train")

def load_all_domain_datasets(domain):
    \"\"\"Load all datasets from a specific domain.\"\"\"
    datasets = {{}}
    for dataset_name in DOMAINS[domain]:
        datasets[dataset_name] = load_dataset("advaitgupta/AgentVQA", dataset_name, split="train")
    return datasets

def load_all_datasets():
    \"\"\"Load all datasets from all domains.\"\"\"
    all_datasets = {{}}
    for domain in DOMAINS:
        all_datasets.update(load_all_domain_datasets(domain))
    return all_datasets

# Example usage:
ds = load_domain_dataset("video_perception", "perception-test")
video_datasets = load_all_domain_datasets("video_perception")
print(video_datasets.keys())

Dataset Fields

Web Agents Domain

android-in-the-wild

  • image: Screenshot of the app
  • step_id: Step number in the episode
  • action_history: Previous actions taken
  • question: The question about next action
  • options: List of 4 possible actions
  • correct_answer: Index of correct option (0-3)
  • episode_goal: Overall goal of the episode
  • source_dataset: Original source dataset name

mind2web

  • image: Screenshot of the webpage
  • question: The question about next action
  • options: List of possible actions
  • answer: Index of correct option
  • task: Description of the task
  • action_history: Previous actions taken
  • category: Task category
  • website: Website name
  • domain: Domain category

monday

  • image: Screenshot of the mobile interface
  • goal: Goal description
  • question: The question about next action
  • options: List of possible actions
  • answer: Index of correct option
  • action_history: Previous actions taken
  • os: Operating system (Android/iOS)

screenspot & screenspot-pro

  • image: Screenshot
  • question: Instruction to complete
  • options: List of tap/click actions
  • answer: Index of correct option

Spatial Reasoning Domain

embspatial-bench

  • Image: Scene image
  • Question: Spatial relationship question
  • Options: List of 4 possible answers
  • Answer: Index of correct option (0-3)
  • Category: High-level category

space-10

  • Images: List of 8 scene images (multi-image reasoning)
  • Question: Entity presence question
  • Options: List of 4 descriptions
  • Answer: Index of correct option (0-3)
  • Category: Task category

3dsrbench, spatialmm, omnispatial

  • Image: Scene image
  • Question: Spatial reasoning question
  • Options: List of possible answers
  • Answer: Index of correct option
  • Category/Task_Type: Task categorization

Game Understanding Domain

gameqa

  • Image: Game screenshot
  • Question: Game state reasoning question
  • Options: List of 4 possible answers
  • Answer: Index of correct option (0-3)
  • Game_Type: Type of game
  • Game_Name: Specific game name

videogamebunny

  • Image: Game screenshot
  • Question: Game understanding question
  • Options: List of 4 possible answers
  • Answer: Index of correct option (0-3)
  • Category: Question category

marioqa

  • Video: Super Mario gameplay video clip
  • Question: Event-based question about gameplay
  • Options: List of 4 possible answers
  • Answer: Correct option letter (A, B, C, or D)
  • Question_Type: Type of question
  • Event: Event type being queried

Robot Manipulation Domain

robo2vlm

  • Image: Robot scene image
  • Question: Task success evaluation question
  • Options: List of 4 possible answers
  • Answer: Index of correct option (0-3)
  • Category: Question category

roborefit

  • Image: Robot manipulation scene
  • Question: Grasping instruction
  • Options: List of 4 coordinate pairs
  • Answer: Index of correct option (0-3)

erqa

  • Images: List of robot scene images
  • Question: Embodied reasoning question
  • Options: List of 4 possible answers
  • Answer: Index of correct option (0-3)
  • Category: Question type

manipbench

  • Image: Robot manipulation scene with grid overlay
  • Question: Trajectory planning question
  • Options: List of 4 trajectory options
  • Answer: Index of correct option (0-3)
  • Category: Task category

sharerobot

  • Images: List of robot trajectory frames
  • Question: Task understanding question
  • Options: List of 4 possible task descriptions
  • Answer: Index of correct option (0-3)
  • Category: Task type

Video Perception & Egocentric Domain

perception-test

  • Video_Bytes: Raw video bytes (decode with decord)
  • Video_Filename: Original video filename
  • Question: Perception question (e.g., camera motion)
  • Options: List of possible answers
  • Answer: Index of correct option (0-3)

vsi-bench

  • Video_Bytes: Raw video bytes
  • Video_Filename: Original video filename
  • Question: Spatial intelligence question from egocentric viewpoint
  • Options: List of possible answers (directional/spatial)
  • Answer: Index of correct option (0-3)
  • Category: Question type (e.g., object_rel_direction)

openeqa

  • Video_Bytes: Raw video bytes
  • Video_Filename: Original video filename
  • Question: Open-ended question about the environment
  • Options: List of possible answers
  • Answer: Index of correct option (0-3)
  • Category: Question category (e.g., attribute recognition, world knowledge)

videgothink

  • Video_Bytes: Raw video bytes
  • Video_Filename: Original video filename
  • Question: Understanding question about egocentric actions
  • Options: List of possible answers
  • Answer: Index of correct option (0-3)

egoplan-bench

  • Video_Bytes: Raw video bytes
  • Video_Filename: Original video filename
  • Question: Planning question about task execution
  • Options: List of possible next actions
  • Answer: Index of correct option (0-3)
  • Task_Goal: Overall goal of the task

Citation

If you use this dataset, please cite the original papers for each component dataset.

License

MIT License

Downloads last month
293