This paper investigates multimodal agents, in particular, OpenAI’s
Computer-User Agent (CUA), trained to control and complete tasks through a
standard computer interface, similar to humans. We evaluated the agent’s
performance on the New York Times Wordle game to elicit model behaviors and
identify shortcomings. Our findings revealed a significant discrepancy in the
model’s ability to recognize colors correctly depending on the context. The
model had a $5.36\%$ success rate over several hundred runs across a week of
Wordle. Despite the immense enthusiasm surrounding AI agents and their
potential to usher in Artificial General Intelligence (AGI), our findings
reinforce the fact that even simple tasks present substantial challenges for
today’s frontier AI models. We conclude with a discussion of the potential
underlying causes, implications for future development, and research directions
to improve these AI systems.
Dieser Artikel untersucht Zeitreisen und deren Auswirkungen.
PDF herunterladen:
2504.15434v1