New AGI test outperforms leading AI models significantly

techcrunch.com

The Arc Prize Foundation has introduced a new test for evaluating artificial general intelligence (AGI), called ARC-AGI-2. This test challenges many leading AI models, including those from Anthropic, Google, and DeepSeek, which have performed poorly on it. The new test aims to measure how well AI can identify visual patterns and solve novel problems. Initial scores show that "reasoning" AI models like OpenAI’s o1-pro and DeepSeek’s R1 are scoring between 1% and 1.3%. Other powerful models, such as GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash, also achieved around 1%. To establish a baseline, over 400 people took the test. They averaged a correct response rate of about 60%, significantly better than the AI models. François Chollet, co-founder of the Arc Prize Foundation and a noted AI researcher, believes this new test measures intelligence more effectively than its predecessor, ARC-AGI-1. ARC-AGI-2 is designed to make sure AI systems cannot rely just on brute force computing power to find solutions. It requires them to interpret new patterns dynamically and emphasizes the importance of efficiency in problem-solving. Greg Kamradt, another co-founder, highlighted that intelligence involves not just solving tasks but doing so efficiently. Previously, OpenAI’s model o3 scored high on the older test but performed poorly on the new one, achieving only 4% with significant computational resources. The introduction of ARC-AGI-2 comes as experts in the tech industry call for fresh benchmarks to assess AI's true capabilities, including traits like creativity. In addition to the test, the Arc Prize Foundation has launched a contest for 2025, challenging developers to achieve 85% accuracy on ARC-AGI-2 while spending no more than $0.42 per task.


With a significance score of 5.1, this news ranks in the top 2% of today's 18694 analyzed articles.

Get summaries of news with significance over 5.5 (usually ~10 stories per week). Read by 9000 minimalists.


loading...