Contents
1. 서론
최근 LLM이 의료 분야의 다양한 응용 프로그램에서 향상된 성능을 보여주고 있습니다. 특히 MedPalm과 GPT-4와 같은 모델들은 MedQA, MMLU, USMLE 등의 벤치마크에서 전문가 수준의 성능을 발휘하고 있습니다. 그러나 이런 평가가 실제 복잡한 임상 환경에서도 유효할지는 아직 불분명합니다. 특히, 의료 종사자들이 직면하는 다양한 의료 기록 작업을 처리하는 데 있어 LLM의 유효성을 검증할 필요가 있습니다.
이런 배경 하에, 본 논문에서는 의료 분야에서 LLM의 효과를 평가할 수 있는 새로운 벤치마크 데이터셋인 MedAlign을 소개합니다. 이 데이터셋은 임상 지시사항과 관련된 EHR을 효과적으로 처리할 수 있는지를 평가하기 위해 설계되었습니다.
2. 연구 배경 및 관련 작업
의료 데이터의 양이 급속도로 증가함에 따라, 의료 종사자들은 점점 더 복잡한 정보를 효율적으로 관리해야 할 필요성이 커지고 있습니다. 이런 상황에서 LLM을 활용하면 의료 인터페이스의 사용성을 개선하고, 의료 종사자의 부담을 줄일 수 있습니다. 그러나 LLM의 평가는 주로 간단한 의료 질의응답이나 특정 의료 전문 분야에 국한되어 이루어져 왔습니다. 이에 본 연구는 실제 임상 환경에서 발생할 수 있는 다양한 유형의 의료 지시사항을 포함하는 벤치마크 데이터셋의 필요성을 제기합니다.
3. 데이터셋 생성 과정
3.1 데이터셋 설계
MedAlign 데이터셋은 의료 종사자가 실제로 사용할 수 있는 다양한 임상 지시사항을 포함하고 있습니다. 데이터셋은 임상 지시사항과 해당 지시사항에 적합한 EHR을 연결하는 형식으로 구성됩니다. 이 과정에서 BM25 등의 정보 검색 기법을 사용하여 지시사항과 관련된 EHR을 자동으로 매칭합니다.
3.2 데이터 수집 및 처리
데이터 수집은 온라인으로 진행되었으며, 다양한 전문 분야의 의료 종사자들이 참여하였습니다. 수집된 지시사항은 후속 처리를 거쳐 EHR과 쌍을 이루게 되며, 이를 통해 LLM의 응답을 생성하고 평가합니다.
4. 데이터셋 설명 및 평가
4.1 데이터셋의 특성
MedAlign 데이터셋은 983개의 임상 지시사항을 포함하고 있으며, 이는 다양한 임상 상황에서의 LLM의 성능을 평가하는 데 사용됩니다. 각 지시사항은 특정 EHR과 연결되어 있으며, 이를 통해 LLM이 실제 임상 지시사항을 얼마나 잘 이해하고 처리할 수 있는지를 평가할 수 있습니다.
4.2 LLM의 성능 평가
LLM의 응답은 임상 전문가에 의해 평가되며, 이를 통해 모델의 성능을 객관적으로 측정합니다. 평가는 정확성, 관련성 및 임상적 적절성을 기준으로 이루어집니다. 또한, 자동 평가 메트릭스를 사용하여 평가 과정을 표준화하고, 효율성을 높입니다.
5. 결론
MedAlign 데이터셋은 LLM이 실제 의료 환경에서 유용하게 사용될 수 있는지를 평가하는 데 기여합니다. 이 데이터셋을 통해 LLM의 임상 지시사핥 처리 능력을 객관적으로 평가하고자 하였습니다.
Figure 1: In MedAlign, patient EHRs are transformed into XML markup (example provided in Figure S4) and paired with clinician-generated instructions using a retrieval-based (BM25) scoring metric. The resulting set of instruction + EHR pairs is then reviewed by clinicians to write gold responses, which are used to evaluate EHR instruction following in large language models
Large language models (LLMs) have revolutionized natural language processing in tasks such as reading comprehension, reasoning, and language generation [52], prompting researchers to explore applications in healthcare [36]. Recent LLMs like MedPalm [34] and GPT-4 [24] have demonstrated expert-level performance on medical question-answering benchmarks including MedQA [14], MMLU [12], and the USMLE [16]. However, these benchmarks employ multiple-choice, exam-style evaluations where question stems summarize key information and a single answer choice is best. It is not known if performance on these tasks will translate when a model is deployed in the complex clinical environments.
To be useful, LLMs need to perform well on the specific information-related tasks that clinicians currently complete themselves while caring for patients. These tasks are a significant burden on clinicians, who spend 45% of their day interacting with computers instead of patients [39] and 10 hours a week generating documentation [11], in part contributing to professional burnout [21]. Examples of these tasks include summarizing a patient’s asthma treatment history from different specialists the patient has visited, generating a differential diagnosis based on partially resulted laboratory data, or searching through the clinical notes for mentions of a patient’s family support system in order to create the best plan for the patient’s hospital discharge (see Table 2). Such tasks could be passed as instructions to an LLM in the form of questions or imperatives (e.g., “Write a discharge summary”) grounded in a patient’s Electronic Health Record (EHR, an electronic representation of a patient’s medical history). However, despite the excitement about LLMs to transform the practice of medicine, evaluations to date have not authentically represented the variety of tasks and idiosyncrasies of EHR data that clinicians face in the real world.
Given the recent emergence of instruction-following capabilities in LLMs [43], there is potential for LLMs to ameliorate such administrative burden. Hand-curated exemplars of instructions and responses have been critical to improve performance of models [6], especially on clinical reasoning and knowledge recall tasks in the healthcare domain [34]. Thus, a high quality dataset of instruction-EHR-response tuples that represents the breadth of clinical tasks is essential not only as a shared benchmark, but potentially to accelerate the training of specialized LLMs for healthcare [32].
However, building such a dataset requires an extraordinary effort from a multidisciplinary collaboration. In particular, generating an instruction-following benchmark dataset with representative EHR-based tasks and expert responses is challenging due to the substantial cost and logistical complexity of clinician review. There is a need for an EHR dataset that (1) contains a diverse set of questions and instructions generated by practicing clinicians; (2) pairs these queries with EHRs from both inpatient and ambulatory care settings; (3) leverages both structured and unstructured data from the longitudinal EHR; and (4) is available to the broader academic community.
In light of these challenges and opportunities, we present three contributions:
The volume of patient care data is growing exponentially, with a compound annual growth rate approaching 36% [7]. Utilizing LLMs to more efficiently interact with patient data holds great potential to help clinicians manage increasingly complicated information needs and circumvent low-usability EHR interfaces [19]. However, evaluation of LLMs to improve meaningful outcomes like clinician burnout or patient health has been inadequately studied, mainly due to benchmark datasets which do not represent true clinician needs [13], narrowly focus on a specific medical specialty or subset of EHR data [17], and/or are overly simplistic due to templated question construction [27, 48]. These works highlight the challenges in collecting high-quality clinician-generated questions and answers; we consider each in turn.
Questions and instructions in an EHR-based benchmark dataset should be paired with relevant patient EHRs. In order to ensure relevancy, prior works have provided clinicians with specific patient EHRs and asked them to generate questions based on those patients’ data [17]. Unfortunately, requiring EHRs as context for question generation limits scalability, as medical institutions restrict access to patient data to preserve patient privacy. Pampari et al. [27] attempted to overcome these scalability issues by generating questions via a template-based approach, but this led to issues with question quality and diversity [48]. Our method of soliciting clinician-generated instructions without a specific patient’s EHR as context overcomes these scaling issues, albeit at the cost of potentially less relevant instruction-to-EHR pairings (we discuss our approach to addressing this problem in the Dataset Curation section).
Beyond generating questions, generating expert answers at scale is also prohibitively difficult. Reviewing an EHR to answer patient-specific queries can take 30+ minutes for a single patient [33]. This excludes any time required to generate a response to the query. Prior works have attempted to overcome the bottleneck of generating responses by extracting answers verbatim from individual clinical notes or discharge summaries [35, 25, 9]. However, many clinical tasks require synthesizing information from both structured data and multiple free-text documents to arrive at an adequate response, an aspect not explored in existing EHR QA datasets. In such cases, answers extracted from a single note in the patient’s record may not be an adequate; free-text text generation is required. While there is at least one example of an EHR-based question answering dataset in the literature that includes both structured and unstructured data [30], it neither contains free-text responses nor is publicly available. Finally, all of the aforementioned datasets focus on simple question answering (i.e., providing concise, factoid-style answers) rather than general instruction following, which often requires executing a series of complex directives and commands to accomplish tasks. To the best of our knowledge, there does not exist any EHR-based benchmark dataset that incorporates instruction following. The significant costs of clinician review present barriers not only for de novo dataset generation, but also for reliable evaluation of new methods on existing datasets. Automated metrics for evaluating Natural Language Generation (NLG) systems have shown moderate to high correlation with human judgments on tasks like machine translation [10], but it is unclear whether these findings extend to other domains and tasks. While there is precedent [17] for applying automated metrics like BLEU [28], ROUGE-L [18], METEOR [1], and BERTScore [50] to NLG tasks in the clinical domain, there is comparatively very little work assessing correspondence between these metrics and human judgment on clinical NLG tasks. Thus not only do we have a poor understanding of how LLMs perform on EHR-based instruction-following tasks, but also we do not know whether it is possible to reliably automate such evaluations. Automation could substantially reduce the “barrier to entry” for research teams with limited resources.
Electronic Health Records (EHRs) EHR systems are software for managing patient medical record data. From a clinician’s view, a patient EHR is accessed via a graphical user interface that provides access to data elements associated with medical care, e.g., medication lists and treatment plans. These data are stored as a collection of timestamped structured (tabular) and unstructured (text) events, which when ordered by time form a patient’s longitudinal EHR timeline. Our EHR data is represented using the OMOP CDM [42], a standardized schema for exchanging medical data, translated into a single, XML markup document per record (example provided in Figure S4) to enable simple data exploration via an XML viewer. Figure 1 outlines the workflow for building MedAlign including (1) pairing clinician-generated instructions with patient EHR markup, and (2) evaluating language model responses against gold responses written by clinicians.
Collection Protocol Reviewing patient medical data requires adhering to strict security protocols to protect patient privacy and prevent protected health information (PHI) leaks. This motivated our 3-stage curation process: (1) online instruction collection from clinicians; (2) instruction-EHR matching; and (3) response generation. Note we deliberately decouple instruction collection from response generation. This enables sampling a larger set of instructions from a more diverse set of clinician specialities while minimizing exposure to patient data. However, this approach requires defining a matching function to pair instructions with relevant patient EHRs, a process which may generate errors due to irrelevant instruction-EHR pairings. We discuss the performance of a retrieval-based matching system below.
Stage 1: Collecting Instructions Clinicians were recruited in our academic medical center via email. Through the use of an online form, clinicians were asked to submit instructions as posed to a hypothetical AI assistant designed to facilitate EHR-based tasks. Participants were instructed to envision a clinical vignette typical of their daily practice and to formulate an instruction that the AI could perform to make their work easier, faster, and less stressful. For each instruction, participants were asked to provide metadata to assist in matching the instruction to a patient, including pertinent clinical characteristics and the clinical context where the instruction could be used, e.g., “when deciding whether to use contrast in a CT scan”. See Appendix C for all collected fields.
Stage 2: Instruction-EHR matching All submitted instructions include metadata information on their intended clinical context and target patient population. We used instructions tagged “applicable to patients generally” to maximize their relevance in EHR matching. We evaluated two methods for matching instructions with EHRs: (1) a simple baseline based on uniform random sampling; and (2) a retrieval-based method using BM25Okapi [41].
For the retrieval approach, we concatenated every instruction with its corresponding patient characteristics and clinical context to construct a search query. We used this query to retrieve the 5 most relevant EHRs within a randomly selected subsample of 77200 patients from our hospital database. This same subsample was used to match patients for our baseline uniform random sample. After matching, the authors conducted a manual review to assess binary relevance of all generated instruction-EHR pairs.
Stage 3: Instruction Response Generation For this stage, clinicians were tasked with reviewing the instruction and associated EHR data, then writing a response to that instruction. Whenever feasible, instructions were assigned to clinicians within the same specialty as the original submitter but not the original submitter themselves. In cases where this was not possible, the instruction was randomly assigned to a clinician, in any specialty, that did not submit the instruction. Clinicians were asked whether the instruction could be feasibly applied to the patient in the EHR (e.g., not asking about smoking history in an infant) and if the EHR contained all necessary information to answer the instruction. They then manually generated an expert response to the instruction. This response was intended to be brief and clinically relevant, drawing on any information available in the supplied EHR, as well as any appropriate external references. The most recent timestamp in the EHR was designated as the “time anchor”, meaning the response was written as if the instruction had been posed at that point in time.
Table 2: MedAlign instruction categories and example instructions.
Instructions Collected A total of 15 clinicians submitted instructions during the data collection process. These medical practitioners represented 7 distinct specialties, which included Internal Medicine (492 instructions submitted), Neurology (320), Radiology (402), Cardiology (71), Oncology (14), Surgery (12), and Primary Care (3). Clinicians provided a varying number of instructions ranging from 1 to 278 with a mean of 87 instructions per clinician (see Figure S3). From the 1314 instructions collected, 455 were marked as applicable to patients generally and 859 were relevant only to patients with specific clinical characteristics. We removed near-identical instructions (defined by a ROUGE-L similarity above 0.7), yielding 983 instructions of which 407 were marked as applicable to patients generally.
Table 3: Human evaluation of LLM responses. Context: The model’s context length, using its native tokenizer. Correct: The percentage of model responses deemed correct by clinicians. WR: Average win rate marginalizing over model pairings. Rank: Empirical mean of human-assigned rankings. †With multi-step refinement the effective context length is infinite, as the model observes the entire EHR albeit in small chunks at a time. ∗For GPT-4 (2k) we used the GPT-4 32k models from OpenAI but restricted its context length using the Vicuña-native tokenizer for direct comparison.
Instruction-EHR Matches Based on evaluation by the authors, for 240 (59%) of the instructions applicable to “patients in general” the first record retrieved by BM25 was relevant. For 303 instructions (74%), at least one of the top 5 EHRs returned by BM25 was relevant. In contrast, only 38% of EHRs retrieved via uniform random sampling were deemed relevant.
Instruction Taxonomy To better understand higher-level themes within the instructions submitted, a practicing clinician developed a taxonomy of instructions. This taxonomy, described in detail in Table S2, includes 6 categories spanning 20 subcategories. We summarize the distribution of instruction categories across the set of all instructions submitted and those that received responses from a clinician in Table 2.
LLM Selection We evaluated six distinct LLMs, chosen to capture both state-of-the-art, closed-source LLM capabilities available to consumers via an API as well as smaller, open-source and user-modifiable LLMs with more lenient commercial licensing (e.g., MosaicML’s MPT-7B-Instruct model). Additionally, we designed our experiments to directly evaluate the impact of model parameters and context length.
For a state-of-the-art LLM, we selected GPT-4 (through Microsoft’s Azure OpenAI HIPAA compliant gpt-4-32k-0301 API) due to its state-of-the-art performance on various medical tasks, its long 32k context length, and its availability to researchers and clinics. However, despite this context length, it proved insufficient for accommodating full EHRs (more than 80% of EHRs in MedAlign contain more than 32k tokens, see see Table S5). To address this limitation, we explored a multi-step refinement (MR) approach [38] to maximize effective context length. In this approach, the EHR is divided into “chunks” designed to be as big as possible (30k tokens, without concern for maintaining valid XML structure) while still fitting within the model’s context length. A response to the instruction is generated using the chronologically first/earliest EHR “chunk” as context, then the second “chunk” is given to the model and the model is instructed to update its response if appropriate or maintain the same response otherwise, and so on, until the entire EHR has been fed through the model. We acknowledge the potential effectiveness of other methods, such as Retrieval Augmented Generation (RAG), in answering questions regarding long documents. However, our primary interest was in measuring the LLMs’ abilities to discern and utilize clinically relevant material when answering questions about the EHR. While methods such as RAG would likely be performant in this area, they would not have enabled us to directly assess the LLMs’ innate abilities to ignore irrelevant material and find details pertinent to the instruction.
For smaller, open-source models we evaluated Vicuña-7B and Vicuña-13B [4] as well as MPT-7B-Instruct [20]. These models are widely available and user-modifiable with favorable licensing agreements, but they have considerably smaller context lengths (2048 tokens) compared to GPT-4. To enable more direct comparisons, we assessed GPT-4 under a restricted context length designed to exactly match the context length of the Vicuña model.
Figure 2: (Left) Head-to-head comparison of model performance based on human ranks. The number in row i, column j indicates the proportion of instructions for which the response generated by the model in row i was strictly preferred over the model in column j. (Right) Head-to-head evaluation of model performance using COMET Ranks. Represents the same matrix structure and interpretation as on the left, but using rankings derived from COMET, an automated metric, rather than clinician-generated rankings. Model win rates using COMET follow a similar pattern as to model win rates using human rankings.
Generating LLM Responses to EHR-based Questions and Instructions Using a standard prompt template (see Figure S9), each model was tasked to fulfill the given instruction grounded on its corresponding EHR pair. Due to current models’ context length restrictions, EHRs needed to be truncated. To calculate the number of tokens of EHR context to include in the prompt, we took each model’s maximum context length (in terms of the number of tokens under that model’s specific tokenizer), reserved 256 tokens for generation, and subtracted any tokens used for the corresponding structured prompt and instruction. This truncation was performed by counting tokens from the end of the record, ensuring that as much recent information as possible was retained.
Clinician Evaluation of LLM Responses Nine clinicians were asked to evaluate and rank the responses generated by 6 separate LLMs. Clinicians did not evaluate their own responses or responses to instructions that they submitted. When feasible, clinicians evaluated responses to instructions that were written by a clinician in their same specialty. The instructions and EHRs reviewed by the clinicians were exactly the same in structure and content as those provided to the LLMs (albeit the EHRs reviewed by clinicians were never truncated, whereas the EHRs ingested by the LLMs were truncated according to their respective context lengths). Clinicians recorded a binary evaluation of whether the response was correct or incorrect, with “incorrect” defined as meeting at least one of the following criteria:
Responses not marked as “incorrect” were deemed to be “correct”. Clinicians then ranked the quality of the LLM responses based on which provided the most clinically relevant and appropriate response. Ties were permitted. The clinicians were blinded to which LLM generated each output, and the order of LLM output was reshuffled for each instruction. Each clinician reviewed 49 instruction-patient pairs on average, yielding 303 pairs reviewed overall with 50 instruction-EHR pairs being reviewed by three clinicians.
Overall, we found that more than half of the responses generated by the GPT-4 variants we tested were deemed correct by clinicians (65% for GPT-4 (32k + MR), 60.1% for GPT-4 (32k), 51.8% for GPT-4 (2k)). By contrast, only about one in three responses generated by the Vicuña and MPT-7B models were considered correct (35% for Vicuña-13B, 33.3% for Vicuña-7B, 31.7% for MPT-7B-Instruct; see Table 3). In head-to-head comparisons, GPT-4 without context length restriction was preferred over the Vicuña-13B model in 72% of instances, and preferred over MPT-7B-Instruct 81% of the time (see Figure 2). The GPT-4 model with 32k context length and no multi-step refinement had the highest overall average win-rate against all other models (0.676).
Table 4: Correlation (mean Kendall’s Tau) between ranking automated metrics’ ranking and human ranking of LLM outputs. Mean Kendall’s Tau between human reviewers (inter-rater reliability) was 0.43.
With the aim to to find an automated proxy for clinician-in-the-loop evaluation, we analyzed the correlation between a suite of automated metrics and human preference rankings using the Kendall’s Rank Correlation (“Kendall’s Tau”) [15]. We also calculated the inter-rater correlation between human rankers, yielding a mean Kendall’s Tau coefficient of 0.44. The average correlations between metrics and human rankings is shown in Table 4. As noted by previous studies [23], the majority of these metrics have shown moderate correlation with human preference and are widely reported in NLG tasks.
We evaluated each model output using both source-free (SF) and source-augmented (SA) automated metrics. Source-free metrics compare a model’s output to a gold standard reference answer (in our case generated by a clinician) without the use of any additional context or sources (i.e., without any information from the EHR). We selected BERTScore [50], METEOR [1], chrF++ [29], GoogleBLEU [46], and ROUGE-L [18] due to their availability and wide use. Source-augmented metrics consider source (e.g., the EHR) in addition to the reference answer and the model response. The SA metrics we considered (and the LMs they use) include UniEval (T5-large) [53] and COMET (XLM-RoBERTa) [31]. As these models have limited context length we used the BM25Okapi algorithm to retrieve relevant snippets from within the patient’s EHR using the instruction as a search query.
Overall, COMET [31] exhibited the strongest correlation with clinician preference rankings, approaching the level of human inter-reviewer reliability (0.37 vs. 0.44). As seen in Figure 2, the overall trends of head-to-head comparisons were preserved when using COMET as the source of model output rankings vs. clinician-generated rankings. Specifically, GPT-4 was consistently preferred over the Vicuña and MPT-7B models by both COMET and clinicians, and the Vicuña models were consistently preferred over the MPT-7B model. Within the GPT-4 variants and between the two Vicuña models considered, win-rate preferences were not necessarily preserved, suggesting utility of COMET as a reasonable but perhaps coarse measure of model performance in this setting. The next most correlated metric with human rankings after COMET was BERTScore, a source-free metric, with an average correlation coefficient of 0.34.
Using our best performing automated metrics, COMET and BERTScore, we evaluated four recently released instruction-tuned medical LLMs (all based on Llama 2 [40]): AlpaCare [51], ClinicalCamel [37] and Med42 [5]. Figure 3 shows that, controlling for model size, current medical instruction tuning approaches largely yield worse performance in MedAlign vs. the base Llama 2 Chat model.
Figure 3: Automated evaluation of medical instruction-tuned LLMs vs. general instruction-tuned counterparts using the best-performing metrics (COMET and BERTScore).
Readily available datasets and benchmarks for easy-to-evaluate tasks like closed-form question answering have helped to measure the remarkable progress of LLMs, even in medical domains [16]. However, logistical difficulties and significant labeling costs have hindered progress towards establishing a shared dataset and benchmark for tasks amenable to LLMs and which truly represent clinician needs. We share such a benchmark dataset with the research community, which takes a novel approach towards instruction gathering by modularizing and isolating the process of instruction solicitation and EHR pairing. To the best of our knowledge, our dataset is the first to evaluate LLM performance on clinician-generated instructions and instructions using comprehensive, longitudinal EHRs. This affords several new insights.
The Importance of Context Length. While GPT-4 with a restricted context length of 2048 tokens achieved a correctness rate of 51.8%, the exact same GPT-4 model given 32000 tokens of context from the EHR achieved a correctness rate of 60.1%. Thus the additional context length yielded an additional 8.3% in the proportion of correct responses. Given the sheer quantity of tokens and concepts contained within comprehensive EHRs, including in MedAlign (see Appendix N), it is perhaps not surprising that instruction following performance was poor with a limited context length. Indeed, not a single EHR in MedAlign can fit entirely within the Vicuña or MPT-7B’s 2048 context length, and only 19.6% of these records can entirely fit within the 32k context length afforded by GPT-4. This highlights the importance of context length in applying LLMs to EHR-based tasks and motivates efforts to increase context lengths via e.g., methods that Base vs. Base + Medical Instruction Tuning do so implicitly via position interpolation [3] or approaches that explicitly improve the training efficiency of mathematical operations [8].
Misalignment with Current Benchmarks Medical instruction tuning in academic models currently favors shorter contexts, optimizing for tasks like MedQA and MMLU. MedQA, consisting of USMLE-style questions covering diagnosis support and care planning, is a popular choice for assessing the medical skills of an LLM [22, 24, 34, 45, 47]. However, USMLE-style questions only comprise 17% of the instructions submitted by clinicians to MedAlign while 68% of instructions involve retrieving and summarizing data from the EHR. Our results highlight that current medical instruction tuning practices often result in significant performance degradation in longer context tasks, with base Llama-2 models outperforming medical instruction-tuned LLMs in most cases. Given the importance of longer contexts and complex summarization skills in addressing clinician information needs, our work underscores the need to evaluate instruction tuning tasks beyond MedQA and similar narrow benchmarks.
Limitations. Our approach of first soliciting instructions and then pairing these instructions to EHRs can increase the scale and diversity of instructions collected, but at a cost. Despite yielding almost twice as many relevant pairings as simply randomly selecting an EHR for each instruction, our BM25 approach did not yield a relevant match for approximately 30% of instructions. In other words, while an instruction submitted by a clinician was of course relevant to the hypothetical patient they had in mind at the time of submission, it frequently ended up not being relevant to an actual patient EHR. There are potential ways to improve this matching process e.g., by using vector databases powered by BERT-style models which could better capture semantic alignment between queries and EHRs relative to BM25 [44]. Additionally, while we solicited instructions from a large number of clinicians at our academic medical center with diverse specialties and backgrounds, the clinicians who submitted data to MedAlign represent only a small fraction of the overall clinician workforce.
Conclusion. This work establishes, for the first time, the performance of some of the most capable LLMs available — GPT-4, LLaMA, and MPT-7B-Instruct — on EHR-based instruction-following tasks. We find that approximately one-third of the best-performing LLM’s responses are incorrect. The benchmark dataset we share, MedAlign enables researchers to measure what matters and focus on tasks that are clinically relevant with significant potential positive impact. In addition, our findings establishing significant correlation between human preference and existing automated metrics provide a path for researchers to make technical progress without requiring the organizational infrastructure for clinical labeling. Finally, our novel approach towards soliciting clinician instructions paves the way for even larger-scale data collection efforts, both for training and evaluation purposes.
Security and Compliance. A university institutional review board granted approval for this study (reference number 57916). All authors handling data individually completed institutional HIPAA and data privacy training prior to engagement with the data. All models exposed to data were deployed within HIPAA-compliant compute infrastructure.
Privacy and Data Deidentification All data were de-identified using a “hiding in plain sight” protocol wherein protected health information (PHI) is replaced by coherent synthetic alternatives [2], e.g., tagging all person names and replacing them with a randomly generated name. For the research release of the MedAlign dataset, all documents will undergo human review to minimize risk of inadvertently exposing PHI. The dataset will be hosted in an university-approved, secure data portal and will require user credentialing to access, i.e., completing CITI ethics training and agreeing to the terms of our data use agreement.
Patient Consent Every patient at our medical center has provided their signature on a privacy notice, which explains that their medical records could be utilized for research. This data, once de-identified, is accessible to researchers under a comprehensive IRB protocol of the university.
Societal impact. LLMs could streamline clinician workflows within the EHR by replacing clunky point-and-click interfaces with natural language interactions, improving clinician efficiency. Muhiyaddin et al. [21] found EHR-related documentation tasks to be a leading cause of physician burnout, resulting in low-quality care, costly turnover, and a decline in patient safety. By easing documentation burden, LLMs could thus increase care quality, decrease clinician turnover, and improve patient safety. MedAlign provides a way to assess whether LLMs are safe and ready for the deployments necessary to realize these potential benefits.
Introducing LLMs into the clinic also poses potential risks. Even the best-performing model of those we assessed (GPT-4) produced incorrect responses for more than 33% of the clinician-generated instructions. These errors could decrease patient safety by leading to poor clinical decision making. More insidiously, a recent study by Omiye et al. [26] noted that commercial LLMs propagate harmful race-based stereotypes in medicine. We analyzed LLM performance differences across race in MedAlign (see Appendix) and found minimal disparities, but more work is needed. Additionally, we did not measure the prevalence of specific failure modes like hallucination and leave this for future work.