Datasets:

AgentVQA
/

AgentVQA

Question stringlengths 87 224	Options listlengths 4 4	Answer int64 0 3	Category stringclasses 12 values
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "basketball basket", "man on the skateboard", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "dragon statue", "building", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "laptop", "car", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "\"Exit\" sign", "mirror", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "national flag", "hat", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "laptop", "wine glass", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "airplane", "speedboat", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "\"Mexican Mobile Eating\" sign", "woman in black suit", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "spoon", "plate", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "horse", "grey car", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "\"snap\" sign", "tennis ball", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "chandelier", "lamp on the shelf", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "kite", "building", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "surfboard", "brown building", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "cat", "potted plant", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "\"3141\" number", "infusion pump", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "sponsor flag", "person with skateboard", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "lamp on the right", "TV", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "flower", "yellow umbrella", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "kites", "streetlight", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "umbrella", "banana sign", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "\"OBAMA\" sign", "building", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "boat", "shed", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "frisbee", "traffic cone", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "street clock", "red flag", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "metal container", "faucet", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "teddy bear", "bushes", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "clock", "street lights", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "\"No parking\" sign", "motorcycle", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "remote control", "mirror", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "cake", "letter decorations", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "stop sign", "chimney", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "cable cars", "orange flag", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "speakers", "keyboard", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "blue forward sign", "taxi sign", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "skateboard", "information board", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "picture on the wall", "person in dark blue cloth", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "person holding a laptop", "white truck", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "cow", "house", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "road sign", "street light", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "person on snowboard", "tree", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "giraffe", "\"rainforest cafe\" sign", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "flower", "tv", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "tree", "tiered tower", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "fan", "people", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "stop sign", "white sign on wall", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "yellow source bottle", "man in white shirt", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "glass vases", "coffee machine", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "man in grey suit", "traffic light", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "laptop monitor", "orange cat", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "traffic light", "colorful umbrella", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "traffic light", "purple billboard", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "starbucks sign", "white truck", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "airplane", "crane", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "blue barrel", "dog", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "kite", "bushes", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "person in blue", "street lights", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "black fan", "faucet", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "green street lamp", "giraffe", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "fire hydrant", "white suv", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "yellow \"P\" sign", "person with hotdog", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "laptop", "microwave", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "McDonald's Sign", "bus only sign", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "toaster", "fridge", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "potted plant", "bus", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "white sign with a black arrow", "white sign with street name", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "man with red hat", "green building", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "\"flu shot\" label", "umbrella", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "ship model", "curtain", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "boat", "plane", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "umbrella", "traffic light", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "green machines", "skier holding a flag", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "black cloth", "yellow box", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "red cup", "bottle with green cap", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "\"No smoking\" sign", "horse with white hair", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "person", "yellow board", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "potted plant", "red building", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "\"McDonald's\" sign", "pedestrian traffic lights", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "round stone seat on the left", "silver seat in the back", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "small vase", "microwave", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "person holding poles", "building", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "door handle", "faucet", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "laptop", "baby", null, null ]	0	height_higher
Consider the real-world 3D locations of the objects. Which object has a higher location?	[ "step stool", "notebook", null, null ]	1	height_higher
Consider the real-world 3D locations of the objects. Is the pillar directly above the boat?	[ "yes", "no", null, null ]	1	location_above
Consider the real-world 3D locations of the objects. Is the white desk directly above the white drawers?	[ "yes", "no", null, null ]	0	location_above
Consider the real-world 3D locations of the objects. Is the traffic light directly above the "one-way" sign?	[ "yes", "no", null, null ]	0	location_above
Consider the real-world 3D locations of the objects. Is the umbrella directly above the chair?	[ "yes", "no", null, null ]	0	location_above
Consider the real-world 3D locations of the objects. Is the empty picture directly above the couch?	[ "yes", "no", null, null ]	0	location_above
Consider the real-world 3D locations of the objects. Is the trash bin directly above the trash bag?	[ "yes", "no", null, null ]	1	location_above
Consider the real-world 3D locations of the objects. Is the man on skateboard directly above the fire hydrant?	[ "yes", "no", null, null ]	0	location_above
Consider the real-world 3D locations of the objects. Is the desserts directly above the teapot?	[ "yes", "no", null, null ]	1	location_above
Consider the real-world 3D locations of the objects. Is the street sign directly above the stop sign?	[ "yes", "no", null, null ]	0	location_above
Consider the real-world 3D locations of the objects. Is the "ATM here" sign directly above the black car?	[ "yes", "no", null, null ]	1	location_above
Consider the real-world 3D locations of the objects. Is the pink umbrella directly above the person with blue shirt?	[ "yes", "no", null, null ]	1	location_above
Consider the real-world 3D locations of the objects. Is the airplane directly above the flag?	[ "yes", "no", null, null ]	1	location_above
Consider the real-world 3D locations of the objects. Is the umbrella directly above the man on the phone?	[ "yes", "no", null, null ]	1	location_above
Consider the real-world 3D locations of the objects. Is the yellow umbrella directly above the lady?	[ "yes", "no", null, null ]	1	location_above
Consider the real-world 3D locations of the objects. Is the road sign directly above the white car?	[ "yes", "no", null, null ]	1	location_above
Consider the real-world 3D locations of the objects. Is the tent directly above the man in red jacket?	[ "yes", "no", null, null ]	0	location_above

End of preview. Expand in Data Studio

AgentVQA: A Multi-Domain Visual Question Answering Dataset

AgentVQA is a comprehensive dataset for training and evaluating visual agents across multiple domains, including GUI interaction, spatial reasoning, game understanding, robot manipulation, and video perception. The dataset contains multiple-choice questions based on screenshots, images, and videos.

Dataset Structure

The dataset is organized into five main domains:

1. Web Agents Domain (GUI Interaction)

android-in-the-wild: Android app interactions (1000 samples)
mind2web: Web navigation tasks (1000 samples, stratified by category)
monday: Mobile OS interactions (1000 samples, stratified by OS)
screenspot: GUI element grounding (1000 samples)
screenspot-pro: Advanced GUI grounding (1000 samples)

2. Spatial Reasoning Domain

embspatial-bench: Object-object and object-scene spatial relationships (1000 samples)
space-10: Multi-image spatial reasoning with entity presence tasks (1000 samples)
3dsrbench: 3D spatial reasoning benchmark
spatialmm: Spatial reasoning in multi-modal contexts
omnispatial: Comprehensive spatial reasoning across multiple task types

3. Game Understanding Domain

gameqa: 3D spatial perception and reasoning in games (1000 samples)
videogamebunny: Video game understanding across multiple categories
but-they-are-cats-tutorial: Game tutorial comprehension tasks
marioqa: Video-based event understanding in Super Mario gameplay

4. Robot Manipulation Domain

robo2vlm: Robot task success state evaluation (1000 samples)
roborefit: Robot grasping and manipulation tasks (1000 samples)
erqa: Embodied reasoning for robot question answering (400 samples)
manipbench: Robot manipulation trajectory planning (1000 samples)
sharerobot: Multi-frame robot task understanding (1000 samples)

5. Video Perception & Egocentric Understanding Domain

perception-test: Camera motion and video perception tasks (1000 samples)
vsi-bench: Video spatial intelligence with egocentric viewpoint questions (1000 samples)
openeqa: Open-ended embodied question answering from video (1000 samples)
videgothink: Egocentric video understanding and reasoning (1000 samples)
egoplan-bench: Egocentric task planning and goal understanding (1000 samples)

Loading the Dataset

Load a specific dataset:

from datasets import load_dataset

# Load a web agents dataset
ds = load_dataset("advaitgupta/AgentVQA", "android-in-the-wild", split="train")

# Load a spatial reasoning dataset
ds = load_dataset("advaitgupta/AgentVQA", "embspatial-bench", split="train")

# Load a video perception dataset
ds = load_dataset("advaitgupta/AgentVQA", "perception-test", split="train")

# Load an egocentric dataset
ds = load_dataset("advaitgupta/AgentVQA", "egoplan-bench", split="train")

Loading Video Datasets (stored as bytes):

from datasets import load_dataset
from decord import VideoReader
from io import BytesIO

# Load video dataset
ds = load_dataset("advaitgupta/AgentVQA", "videgothink", split="train")
sample = ds[0]

# Access video bytes
video_bytes = sample["Video_Bytes"]
filename = sample["Video_Filename"]

# Option 1: Save to file
with open(f"/tmp/{{filename}}", "wb") as f:
    f.write(video_bytes)

# Option 2: Use decord for direct loading
vr = VideoReader(BytesIO(video_bytes))
frames = vr.get_batch(range(len(vr))).asnumpy()

Helper Functions:

from datasets import load_dataset

# Available domains and their datasets
DOMAINS = {{
    "web_agents": [
        "android-in-the-wild",
        "mind2web",
        "monday",
        "screenspot",
        "screenspot-pro"
    ],
    "spatial_reasoning": [
        "embspatial-bench",
        "space-10",
        "3dsrbench",
        "spatialmm",
        "omnispatial"
    ],
    "game_understanding": [
        "gameqa",
        "videogamebunny",
        "but-they-are-cats-tutorial",
        "marioqa"
    ],
    "robot_manipulation": [
        "robo2vlm",
        "roborefit",
        "erqa",
        "manipbench",
        "sharerobot"
    ],
    "video_perception": [
        "perception-test",
        "vsi-bench",
        "openeqa",
        "videgothink",
        "egoplan-bench"
    ]
}}

def load_domain_dataset(domain, dataset_name):
    \"\"\"Load a specific dataset from a domain.\"\"\"
    return load_dataset("advaitgupta/AgentVQA", dataset_name, split="train")

def load_all_domain_datasets(domain):
    \"\"\"Load all datasets from a specific domain.\"\"\"
    datasets = {{}}
    for dataset_name in DOMAINS[domain]:
        datasets[dataset_name] = load_dataset("advaitgupta/AgentVQA", dataset_name, split="train")
    return datasets

def load_all_datasets():
    \"\"\"Load all datasets from all domains.\"\"\"
    all_datasets = {{}}
    for domain in DOMAINS:
        all_datasets.update(load_all_domain_datasets(domain))
    return all_datasets

# Example usage:
ds = load_domain_dataset("video_perception", "perception-test")
video_datasets = load_all_domain_datasets("video_perception")
print(video_datasets.keys())

Dataset Fields

Web Agents Domain

android-in-the-wild

image: Screenshot of the app
step_id: Step number in the episode
action_history: Previous actions taken
question: The question about next action
options: List of 4 possible actions
correct_answer: Index of correct option (0-3)
episode_goal: Overall goal of the episode
source_dataset: Original source dataset name

mind2web

image: Screenshot of the webpage
question: The question about next action
options: List of possible actions
answer: Index of correct option
task: Description of the task
action_history: Previous actions taken
category: Task category
website: Website name
domain: Domain category

monday

image: Screenshot of the mobile interface
goal: Goal description
question: The question about next action
options: List of possible actions
answer: Index of correct option
action_history: Previous actions taken
os: Operating system (Android/iOS)

screenspot & screenspot-pro

image: Screenshot
question: Instruction to complete
options: List of tap/click actions
answer: Index of correct option

Spatial Reasoning Domain

embspatial-bench

Image: Scene image
Question: Spatial relationship question
Options: List of 4 possible answers
Answer: Index of correct option (0-3)
Category: High-level category

space-10

Images: List of 8 scene images (multi-image reasoning)
Question: Entity presence question
Options: List of 4 descriptions
Answer: Index of correct option (0-3)
Category: Task category

3dsrbench, spatialmm, omnispatial

Image: Scene image
Question: Spatial reasoning question
Options: List of possible answers
Answer: Index of correct option
Category/Task_Type: Task categorization

Game Understanding Domain

gameqa

Image: Game screenshot
Question: Game state reasoning question
Options: List of 4 possible answers
Answer: Index of correct option (0-3)
Game_Type: Type of game
Game_Name: Specific game name

videogamebunny

Image: Game screenshot
Question: Game understanding question
Options: List of 4 possible answers
Answer: Index of correct option (0-3)
Category: Question category

marioqa

Video: Super Mario gameplay video clip
Question: Event-based question about gameplay
Options: List of 4 possible answers
Answer: Correct option letter (A, B, C, or D)
Question_Type: Type of question
Event: Event type being queried

Robot Manipulation Domain

robo2vlm

Image: Robot scene image
Question: Task success evaluation question
Options: List of 4 possible answers
Answer: Index of correct option (0-3)
Category: Question category

roborefit

Image: Robot manipulation scene
Question: Grasping instruction
Options: List of 4 coordinate pairs
Answer: Index of correct option (0-3)

erqa

Images: List of robot scene images
Question: Embodied reasoning question
Options: List of 4 possible answers
Answer: Index of correct option (0-3)
Category: Question type

manipbench

Image: Robot manipulation scene with grid overlay
Question: Trajectory planning question
Options: List of 4 trajectory options
Answer: Index of correct option (0-3)
Category: Task category

sharerobot

Images: List of robot trajectory frames
Question: Task understanding question
Options: List of 4 possible task descriptions
Answer: Index of correct option (0-3)
Category: Task type

Video Perception & Egocentric Domain

perception-test

Video_Bytes: Raw video bytes (decode with decord)
Video_Filename: Original video filename
Question: Perception question (e.g., camera motion)
Options: List of possible answers
Answer: Index of correct option (0-3)

vsi-bench

Video_Bytes: Raw video bytes
Video_Filename: Original video filename
Question: Spatial intelligence question from egocentric viewpoint
Options: List of possible answers (directional/spatial)
Answer: Index of correct option (0-3)
Category: Question type (e.g., object_rel_direction)

openeqa

Video_Bytes: Raw video bytes
Video_Filename: Original video filename
Question: Open-ended question about the environment
Options: List of possible answers
Answer: Index of correct option (0-3)
Category: Question category (e.g., attribute recognition, world knowledge)

videgothink

Video_Bytes: Raw video bytes
Video_Filename: Original video filename
Question: Understanding question about egocentric actions
Options: List of possible answers
Answer: Index of correct option (0-3)

egoplan-bench

Video_Bytes: Raw video bytes
Video_Filename: Original video filename
Question: Planning question about task execution
Options: List of possible next actions
Answer: Index of correct option (0-3)
Task_Goal: Overall goal of the task

Citation

If you use this dataset, please cite the original papers for each component dataset.

License

MIT License

Downloads last month: 293

Size of downloaded dataset files:

10.8 GB

Size of the auto-converted Parquet files:

10.8 GB

Number of rows:

20,452