OpenAI has unveiled HealthBench, a new benchmark designed to assess AI models in healthcare using real-world applicability and physician judgment. The benchmark includes 5,000 simulated health conversations, each evaluated against physician-created rubrics to ensure accuracy and relevance.
Developed in collaboration with 262 physicians across 60 countries, HealthBench spans 49 languages and 26 medical specialties. Each model response is graded based on criteria such as expertise-tailored communication, emergency referrals, and response depth, with evaluations conducted using GPT-4.1.
“Our findings show that large language models have improved significantly and already outperform experts in writing responses,” OpenAI stated. However, the company noted room for improvement, particularly in context-seeking and reliability.
HealthBench is now publicly available on GitHub, aligning with OpenAI’s broader efforts to advance AI in high-impact fields like healthcare.
The Larger Trend: Project Stargate Faces Delays
The announcement comes as Project Stargate, a $500 billion AI infrastructure initiative involving OpenAI CEO Sam Altman, Oracle’s Larry Ellison, and SoftBank’s Masayoshi Son, encounters setbacks.
Originally touted as a potential game-changer for healthcare—including AI-driven cancer vaccines—the project is now facing delays due to economic uncertainty and U.S. tariffs on critical tech components. SoftBank’s pledged $100 billion investment has yet to materialize, with financing discussions still pending.
Despite challenges, OpenAI remains focused on AI-driven healthcare innovation, with HealthBench serving as a key step toward ensuring real-world benefits.






Leave a comment