Datasets:
Image imagewidth (px) 253 640 | Question stringlengths 87 224 | Options listlengths 4 4 | Answer int64 0 3 | Category stringclasses 12
values |
|---|---|---|---|---|
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"basketball basket",
"man on the skateboard",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"dragon statue",
"building",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"laptop",
"car",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"\"Exit\" sign",
"mirror",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"national flag",
"hat",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"laptop",
"wine glass",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"airplane",
"speedboat",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"\"Mexican Mobile Eating\" sign",
"woman in black suit",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"spoon",
"plate",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"horse",
"grey car",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"\"snap\" sign",
"tennis ball",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"chandelier",
"lamp on the shelf",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"kite",
"building",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"surfboard",
"brown building",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"cat",
"potted plant",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"\"3141\" number",
"infusion pump",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"sponsor flag",
"person with skateboard",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"lamp on the right",
"TV",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"flower",
"yellow umbrella",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"kites",
"streetlight",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"umbrella",
"banana sign",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"\"OBAMA\" sign",
"building",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"boat",
"shed",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"frisbee",
"traffic cone",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"street clock",
"red flag",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"metal container",
"faucet",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"teddy bear",
"bushes",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"clock",
"street lights",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"\"No parking\" sign",
"motorcycle",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"remote control",
"mirror",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"cake",
"letter decorations",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"stop sign",
"chimney",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"cable cars",
"orange flag",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"speakers",
"keyboard",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"blue forward sign",
"taxi sign",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"skateboard",
"information board",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"picture on the wall",
"person in dark blue cloth",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"person holding a laptop",
"white truck",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"cow",
"house",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"road sign",
"street light",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"person on snowboard",
"tree",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"giraffe",
"\"rainforest cafe\" sign",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"flower",
"tv",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"tree",
"tiered tower",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"fan",
"people",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"stop sign",
"white sign on wall",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"yellow source bottle",
"man in white shirt",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"glass vases",
"coffee machine",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"man in grey suit",
"traffic light",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"laptop monitor",
"orange cat",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"traffic light",
"colorful umbrella",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"traffic light",
"purple billboard",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"starbucks sign",
"white truck",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"airplane",
"crane",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"blue barrel",
"dog",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"kite",
"bushes",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"person in blue",
"street lights",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"black fan",
"faucet",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"green street lamp",
"giraffe",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"fire hydrant",
"white suv",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"yellow \"P\" sign",
"person with hotdog",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"laptop",
"microwave",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"McDonald's Sign",
"bus only sign",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"toaster",
"fridge",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"potted plant",
"bus",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"white sign with a black arrow",
"white sign with street name",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"man with red hat",
"green building",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"\"flu shot\" label",
"umbrella",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"ship model",
"curtain",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"boat",
"plane",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"umbrella",
"traffic light",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"green machines",
"skier holding a flag",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"black cloth",
"yellow box",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"red cup",
"bottle with green cap",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"\"No smoking\" sign",
"horse with white hair",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"person",
"yellow board",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"potted plant",
"red building",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"\"McDonald's\" sign",
"pedestrian traffic lights",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"round stone seat on the left",
"silver seat in the back",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"small vase",
"microwave",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"person holding poles",
"building",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"door handle",
"faucet",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"laptop",
"baby",
null,
null
] | 0 | height_higher | |
Consider the real-world 3D locations of the objects. Which object has a higher location? | [
"step stool",
"notebook",
null,
null
] | 1 | height_higher | |
Consider the real-world 3D locations of the objects. Is the pillar directly above the boat? | [
"yes",
"no",
null,
null
] | 1 | location_above | |
Consider the real-world 3D locations of the objects. Is the white desk directly above the white drawers? | [
"yes",
"no",
null,
null
] | 0 | location_above | |
Consider the real-world 3D locations of the objects. Is the traffic light directly above the "one-way" sign? | [
"yes",
"no",
null,
null
] | 0 | location_above | |
Consider the real-world 3D locations of the objects. Is the umbrella directly above the chair? | [
"yes",
"no",
null,
null
] | 0 | location_above | |
Consider the real-world 3D locations of the objects. Is the empty picture directly above the couch? | [
"yes",
"no",
null,
null
] | 0 | location_above | |
Consider the real-world 3D locations of the objects. Is the trash bin directly above the trash bag? | [
"yes",
"no",
null,
null
] | 1 | location_above | |
Consider the real-world 3D locations of the objects. Is the man on skateboard directly above the fire hydrant? | [
"yes",
"no",
null,
null
] | 0 | location_above | |
Consider the real-world 3D locations of the objects. Is the desserts directly above the teapot? | [
"yes",
"no",
null,
null
] | 1 | location_above | |
Consider the real-world 3D locations of the objects. Is the street sign directly above the stop sign? | [
"yes",
"no",
null,
null
] | 0 | location_above | |
Consider the real-world 3D locations of the objects. Is the "ATM here" sign directly above the black car? | [
"yes",
"no",
null,
null
] | 1 | location_above | |
Consider the real-world 3D locations of the objects. Is the pink umbrella directly above the person with blue shirt? | [
"yes",
"no",
null,
null
] | 1 | location_above | |
Consider the real-world 3D locations of the objects. Is the airplane directly above the flag? | [
"yes",
"no",
null,
null
] | 1 | location_above | |
Consider the real-world 3D locations of the objects. Is the umbrella directly above the man on the phone? | [
"yes",
"no",
null,
null
] | 1 | location_above | |
Consider the real-world 3D locations of the objects. Is the yellow umbrella directly above the lady? | [
"yes",
"no",
null,
null
] | 1 | location_above | |
Consider the real-world 3D locations of the objects. Is the road sign directly above the white car? | [
"yes",
"no",
null,
null
] | 1 | location_above | |
Consider the real-world 3D locations of the objects. Is the tent directly above the man in red jacket? | [
"yes",
"no",
null,
null
] | 0 | location_above |
End of preview. Expand in Data Studio
AgentVQA: A Multi-Domain Visual Question Answering Dataset
AgentVQA is a comprehensive dataset for training and evaluating visual agents across multiple domains, including GUI interaction, spatial reasoning, game understanding, robot manipulation, and video perception. The dataset contains multiple-choice questions based on screenshots, images, and videos.
Dataset Structure
The dataset is organized into five main domains:
1. Web Agents Domain (GUI Interaction)
- android-in-the-wild: Android app interactions (1000 samples)
- mind2web: Web navigation tasks (1000 samples, stratified by category)
- monday: Mobile OS interactions (1000 samples, stratified by OS)
- screenspot: GUI element grounding (1000 samples)
- screenspot-pro: Advanced GUI grounding (1000 samples)
2. Spatial Reasoning Domain
- embspatial-bench: Object-object and object-scene spatial relationships (1000 samples)
- space-10: Multi-image spatial reasoning with entity presence tasks (1000 samples)
- 3dsrbench: 3D spatial reasoning benchmark
- spatialmm: Spatial reasoning in multi-modal contexts
- omnispatial: Comprehensive spatial reasoning across multiple task types
3. Game Understanding Domain
- gameqa: 3D spatial perception and reasoning in games (1000 samples)
- videogamebunny: Video game understanding across multiple categories
- but-they-are-cats-tutorial: Game tutorial comprehension tasks
- marioqa: Video-based event understanding in Super Mario gameplay
4. Robot Manipulation Domain
- robo2vlm: Robot task success state evaluation (1000 samples)
- roborefit: Robot grasping and manipulation tasks (1000 samples)
- erqa: Embodied reasoning for robot question answering (400 samples)
- manipbench: Robot manipulation trajectory planning (1000 samples)
- sharerobot: Multi-frame robot task understanding (1000 samples)
5. Video Perception & Egocentric Understanding Domain
- perception-test: Camera motion and video perception tasks (1000 samples)
- vsi-bench: Video spatial intelligence with egocentric viewpoint questions (1000 samples)
- openeqa: Open-ended embodied question answering from video (1000 samples)
- videgothink: Egocentric video understanding and reasoning (1000 samples)
- egoplan-bench: Egocentric task planning and goal understanding (1000 samples)
Loading the Dataset
Load a specific dataset:
from datasets import load_dataset
# Load a web agents dataset
ds = load_dataset("advaitgupta/AgentVQA", "android-in-the-wild", split="train")
# Load a spatial reasoning dataset
ds = load_dataset("advaitgupta/AgentVQA", "embspatial-bench", split="train")
# Load a video perception dataset
ds = load_dataset("advaitgupta/AgentVQA", "perception-test", split="train")
# Load an egocentric dataset
ds = load_dataset("advaitgupta/AgentVQA", "egoplan-bench", split="train")
Loading Video Datasets (stored as bytes):
from datasets import load_dataset
from decord import VideoReader
from io import BytesIO
# Load video dataset
ds = load_dataset("advaitgupta/AgentVQA", "videgothink", split="train")
sample = ds[0]
# Access video bytes
video_bytes = sample["Video_Bytes"]
filename = sample["Video_Filename"]
# Option 1: Save to file
with open(f"/tmp/{{filename}}", "wb") as f:
f.write(video_bytes)
# Option 2: Use decord for direct loading
vr = VideoReader(BytesIO(video_bytes))
frames = vr.get_batch(range(len(vr))).asnumpy()
Helper Functions:
from datasets import load_dataset
# Available domains and their datasets
DOMAINS = {{
"web_agents": [
"android-in-the-wild",
"mind2web",
"monday",
"screenspot",
"screenspot-pro"
],
"spatial_reasoning": [
"embspatial-bench",
"space-10",
"3dsrbench",
"spatialmm",
"omnispatial"
],
"game_understanding": [
"gameqa",
"videogamebunny",
"but-they-are-cats-tutorial",
"marioqa"
],
"robot_manipulation": [
"robo2vlm",
"roborefit",
"erqa",
"manipbench",
"sharerobot"
],
"video_perception": [
"perception-test",
"vsi-bench",
"openeqa",
"videgothink",
"egoplan-bench"
]
}}
def load_domain_dataset(domain, dataset_name):
\"\"\"Load a specific dataset from a domain.\"\"\"
return load_dataset("advaitgupta/AgentVQA", dataset_name, split="train")
def load_all_domain_datasets(domain):
\"\"\"Load all datasets from a specific domain.\"\"\"
datasets = {{}}
for dataset_name in DOMAINS[domain]:
datasets[dataset_name] = load_dataset("advaitgupta/AgentVQA", dataset_name, split="train")
return datasets
def load_all_datasets():
\"\"\"Load all datasets from all domains.\"\"\"
all_datasets = {{}}
for domain in DOMAINS:
all_datasets.update(load_all_domain_datasets(domain))
return all_datasets
# Example usage:
ds = load_domain_dataset("video_perception", "perception-test")
video_datasets = load_all_domain_datasets("video_perception")
print(video_datasets.keys())
Dataset Fields
Web Agents Domain
android-in-the-wild
image: Screenshot of the appstep_id: Step number in the episodeaction_history: Previous actions takenquestion: The question about next actionoptions: List of 4 possible actionscorrect_answer: Index of correct option (0-3)episode_goal: Overall goal of the episodesource_dataset: Original source dataset name
mind2web
image: Screenshot of the webpagequestion: The question about next actionoptions: List of possible actionsanswer: Index of correct optiontask: Description of the taskaction_history: Previous actions takencategory: Task categorywebsite: Website namedomain: Domain category
monday
image: Screenshot of the mobile interfacegoal: Goal descriptionquestion: The question about next actionoptions: List of possible actionsanswer: Index of correct optionaction_history: Previous actions takenos: Operating system (Android/iOS)
screenspot & screenspot-pro
image: Screenshotquestion: Instruction to completeoptions: List of tap/click actionsanswer: Index of correct option
Spatial Reasoning Domain
embspatial-bench
Image: Scene imageQuestion: Spatial relationship questionOptions: List of 4 possible answersAnswer: Index of correct option (0-3)Category: High-level category
space-10
Images: List of 8 scene images (multi-image reasoning)Question: Entity presence questionOptions: List of 4 descriptionsAnswer: Index of correct option (0-3)Category: Task category
3dsrbench, spatialmm, omnispatial
Image: Scene imageQuestion: Spatial reasoning questionOptions: List of possible answersAnswer: Index of correct optionCategory/Task_Type: Task categorization
Game Understanding Domain
gameqa
Image: Game screenshotQuestion: Game state reasoning questionOptions: List of 4 possible answersAnswer: Index of correct option (0-3)Game_Type: Type of gameGame_Name: Specific game name
videogamebunny
Image: Game screenshotQuestion: Game understanding questionOptions: List of 4 possible answersAnswer: Index of correct option (0-3)Category: Question category
marioqa
Video: Super Mario gameplay video clipQuestion: Event-based question about gameplayOptions: List of 4 possible answersAnswer: Correct option letter (A, B, C, or D)Question_Type: Type of questionEvent: Event type being queried
Robot Manipulation Domain
robo2vlm
Image: Robot scene imageQuestion: Task success evaluation questionOptions: List of 4 possible answersAnswer: Index of correct option (0-3)Category: Question category
roborefit
Image: Robot manipulation sceneQuestion: Grasping instructionOptions: List of 4 coordinate pairsAnswer: Index of correct option (0-3)
erqa
Images: List of robot scene imagesQuestion: Embodied reasoning questionOptions: List of 4 possible answersAnswer: Index of correct option (0-3)Category: Question type
manipbench
Image: Robot manipulation scene with grid overlayQuestion: Trajectory planning questionOptions: List of 4 trajectory optionsAnswer: Index of correct option (0-3)Category: Task category
sharerobot
Images: List of robot trajectory framesQuestion: Task understanding questionOptions: List of 4 possible task descriptionsAnswer: Index of correct option (0-3)Category: Task type
Video Perception & Egocentric Domain
perception-test
Video_Bytes: Raw video bytes (decode with decord)Video_Filename: Original video filenameQuestion: Perception question (e.g., camera motion)Options: List of possible answersAnswer: Index of correct option (0-3)
vsi-bench
Video_Bytes: Raw video bytesVideo_Filename: Original video filenameQuestion: Spatial intelligence question from egocentric viewpointOptions: List of possible answers (directional/spatial)Answer: Index of correct option (0-3)Category: Question type (e.g., object_rel_direction)
openeqa
Video_Bytes: Raw video bytesVideo_Filename: Original video filenameQuestion: Open-ended question about the environmentOptions: List of possible answersAnswer: Index of correct option (0-3)Category: Question category (e.g., attribute recognition, world knowledge)
videgothink
Video_Bytes: Raw video bytesVideo_Filename: Original video filenameQuestion: Understanding question about egocentric actionsOptions: List of possible answersAnswer: Index of correct option (0-3)
egoplan-bench
Video_Bytes: Raw video bytesVideo_Filename: Original video filenameQuestion: Planning question about task executionOptions: List of possible next actionsAnswer: Index of correct option (0-3)Task_Goal: Overall goal of the task
Citation
If you use this dataset, please cite the original papers for each component dataset.
License
MIT License
- Downloads last month
- 293