OpenAI Launches ChatGPT Health, Trained with Physician Oversight and HealthBench Evaluation
OpenAI has introduced a new dedicated health interface within ChatGPT, ChatGPT Health, designed to personalize health-related interactions using user-linked medical records and wellness data.
The system features a separately compartmentalized architecture, distinct memory, and layered privacy protections. Central to the product is a model said to be refined through extensive feedback from medical practitioners, with training and evaluation tailored to clinical relevance.

Image credit: OpenAI
The underlying model powering ChatGPT Health was developed in close collaboration with over 260 physicians across 60 countries and multiple specialties. These clinicians contributed over 600,000 feedback instances across 30 focus areas, helping shape both model behavior and safety thresholds. This clinical feedback loop guided decisions on when to recommend professional care escalation and how to deliver information clearly while avoiding oversimplification.
See also: OpenAI Introduces Open Benchmark To Assess AI Performance in Realistic Healthcare Scenarios
To evaluate performance, OpenAI created a domain-specific open-source benchmark HealthBench. HealthBench employs physician-written rubrics grounded in clinical practice and consists of 5,000 multilingual conversations and over 48,000 expert-defined evaluation criteria. The metrics assess safety, appropriateness of guidance, clarity of explanation, and contextual relevance. This evaluation framework is designed to reflect real-world expectations of how clinicians judge the utility of medical information.
In terms of operational logic, ChatGPT Health applies model outputs to a personalized data layer that users can populate by connecting medical records or third-party health apps like Apple Health, MyFitnessPal, and Function. Data from these sources remains siloed within the Health interface and cannot be accessed or inferred in non-Health conversations. The information is reportedly not used for model training, and encryption with added isolation layers is applied to Health-specific data in transit and at rest.
The model also incorporates context awareness from outside the Health space in a controlled, one-directional manner—non-sensitive inputs (like a lifestyle change) may inform a health-related response, but health information is not used outside the Health section.
Access to the model is currently limited to early users in eligible regions via applying for the waitlist, with expansion planned across platforms. Axios framed the rollout as a response to the already high volume of health-related use in ChatGPT, citing 40 million daily queries around medical and insurance topics even prior to this launch.
This launch coincides with OpenAI’s broader healthcare efforts, like its joint development of Muse with Formation Bio and Sanofi announced in 2024. Muse is an AI tool for clinical trial recruitment, using OpenAI models to generate tailored strategies and materials across indications. It automates profiling, literature review, and compliant content creation, and is now being used in Sanofi’s Phase 3 trials for multiple sclerosis. Other collaborations include Moderna’s deployment of ChatGPT Enterprise, Thermo Fisher Scientific and Lundbeck integrating OpenAI tools into drug development workflows, and Eli Lilly partnering with OpenAI on generative AI for novel therapeutics.
Among OpenAI’s more notable moves in biotech is its backing of Chai Discovery, now valued at $1.3B after raising $130M last month. The company’s latest AI model is Chai-2, made for full-length monoclonal antibody design at atomic resolution. OpenAI has also worked with Retro Biosciences to engineer new Yamanaka factors—transcription factors that reprogram adult cells into induced pluripotent stem cells. Using a protein-specialized GPT-4b micro model, the team designed variants that reportedly boosted pluripotency markers more than 50-fold, accelerated reprogramming, and improved genomic stability in aging cells.
OpenAI isn’t alone as a general-purpose AI vendor increasingly embedding itself into life sciences, with companies like Anthropic (Claude for Life Sciences), Google (Med-PaLM and Med-Gemini), and Microsoft (deploying tools like MAI-DxO, its multi-agent diagnostic system) also moving into clinical and research settings. For a closer look at how foundation models are being adapted for biomedical work, and how neurosymbolic methods and agents are emerging to address their limitations, see our deep dives: New LLMs, Agents, and Graphs in Life Sciences and Can AI Diagnose Better Than Humans?.
Topic: AI in Bio