Stanford develops tool to evaluate health AI models
Stanford researchers have created a new tool to assess how effectively AI language models handle routine health care tasks. This development aims to provide a clearer evaluation of AI performance in clinical settings. Concerns have been raised about the reliability of AI in health care, as many models have only demonstrated success in knowledge tests. A recent study found that GPT-4 had a 35% error rate when answering physician queries compared to human responses. The new evaluation tool seeks to address these concerns by focusing on practical applications rather than theoretical knowledge. This shift aims to ensure that AI can perform effectively in real-world health care scenarios.