Electronic Health Records and Augmented Data
Roblin and colleague’s retrospective cohort study uses Kaiser Permanente Georgia’s EHR system from 2006 to 2015 to potentially identify TG patients (n=271).13 The authors describe a 3-step algorithm, which included an initial EHR search for International Classification of Disease (ICD)-9 diagnosis codes (Supplemental Table 1 ) and key text-strings relevant to TG status from supplemental digitized provider notes, validation of TG status through having at least two diagnosis codes or validation by manual review of text-strings, then determination of patient sex assigned at birth after their inclusion in the cohort. After internal validation of patients through a committee manually reviewing the key-text strings, they found that the application of key text-strings only, diagnosis codes only, and both diagnosis and key text-strings led to positive predictive values (PPV) of 45%, 56%, and 100%, respectively. A similar study by Quinn and colleagues used Kaiser Permanente’s Georgia and California EHR system to identify potential TG individuals (n=6456) to build the Study of Transition, Outcomes and Gender (STRONG) cohort.14The study uses the same 3-step algorithm, and published their full extensive list of key text-strings. In this study, only 10% of patients were found from diagnosis codes alone, while 61% were found from both diagnosis codes and keywords. The PPV for key text-strings, diagnosis codes, and both were 26%, 54%, and 98% respectively.
Gerth and colleague’s study utilizes the STRONG cohort to assess agreement between medical records and a self-reported survey.15 The survey contained the recommended self-reported gender identity method that asks for sex assigned at birth and current gender identity.5,27 They distributed the survey to a subset of cohort members in order to confirm TG status (transmasculine or transfeminine) based on gender affirming treatment (e.g. testosterone, estrogen hormone therapy) and surgery (e.g. chest or genital reconstruction surgery) through Kaiser Permanente. They found high agreement between self-reported gender identity and gender affirming treatment records with a sensitivity of 99% and specificity of 99%.15
Guo et. al built upon Quinn and colleagues’ work to apply a CP within the University of Florida Health integrated data repository which included the Epic EHR system from 2012 to 2019.6,14They used gender identity information, ICD-9 and ICD-10 diagnosis codes, Current Procedural Terminology (CPT) codes, and key text-strings relevant to TG status in clinician text notes as potential mechanisms to find the best performing CP for their data. Authors validated their CPs through a manual chart review of selected samples and then identified subgroups and used natal sex assignment for confirmation of transmasculine or transfeminine gender identity. Guo and colleagues found 19,600 potential TG patients and their best performing CP for both structured and unstructured data was when a TG patient had a recorded TG gender identity or had at least one relevant diagnosis code and at least one relevant key text-string relevant to TG status, which led to an F1-score of 0.954.6
Foer and colleagues’ retrospective chart review used Epic data from two primary academic teaching institutions in Boston, MA from 2015-2019 to identify 13,424 potential TG patients.20 They were able to utilize key text-strings within clinician notes with TG-related text, as well as F64 ICD-10 diagnosis codes, and gender identity field entries. Manual chart reviews were performed on a subset to validate the classification of patients as gold standard. They were able to find all patients through a legal sex field (100%), while sex assigned at birth was available for 48.7% of patients, and 48% had a completed gender identity field. They found 15.7% of TG patients through diagnosis and key text-strings, 89% from key text-strings alone, 14% from a gender identity field (14%), 1.2% from ICD diagnosis codes, and 5.1% from TG status listing. After validation via chart review of a subset of 324 patients, they confirmed 8% of patients as TG. 24 patients with gender fields alone were misclassified as TG when they were cisgender based on chart reviews. However, they had a high specificity after applying their algorithm to a random set of patients and found none to be TG. In this study, key text-strings and diagnosis codes were more sensitive to identify TG patients than gender related fields.20
Blosnich and colleagues applied a CP of ICD-9 and ICD-10 diagnosis codes relevant to TG status to identify 7560 TG patients through the US Department of Veterans Affairs Corporate Data Warehouse from 2000 to 2016.16 Their validation method used a search algorithm of clinical text notes to find key text-strings related to TG status. Their search algorithm reached a sensitivity of 89.30%, with a specificity of 99.95%. False positives were similar to Roblin and colleagues of key text-strings that were discussions about TG relatives or friends of the patient.13 They were also able to find false negatives through key text-strings for 1.1% of patients.16
Wolfe and colleagues used EHR from the Veterans Health Administration from 2006 to 2018 to create their cohort of TG veterans (n=10,769).21 Their CP included: 1) 1 or more gender identity disorder diagnosis code in outpatient or inpatient data during the study period, 2) a diagnosis code of non-specified endocrine disorder, 3) change in sex marker field lasting at least 1 year to reflect stability, 4) sex hormone prescription discordant with sex, and 5) excluded those with specific non-diabetes endocrine code, such as adrenal or thyroid disease, and prostate cancer, as well as had minimum dosage levels for hormones. They used a hierarchal strategy that prioritized diagnosis codes or hormones, then non-specific endocrine disorder with hormone prescription, then endocrine disorders with change in sex markers, then hormone therapy with change in sex marker, to finally hormone prescriptions only, which is very similar to Jasuja et al.19 They validated the algorithm through performing a chart review of a random sample of veterans from each of the 5 groups. Wolfe and colleagues found that TG veterans with a gender identity disorder diagnosis code had the highest positive predictive value (83%) compared to non-gender identity disorder coded veterans (2%), and concluded that gender identity disorder diagnosis codes were the most reliable approach for identification of TG patients in the VHA.21
Alpert and colleague’s cross-sectional study utilized CancerLinQ data by the American Society of Clinical Oncology (ASCO) Learning HealthCare System to identify TG cancer patients (n=557).22 Their CP had three categories: category 1) diagnosis related to gender identity (transsexualism or gender identity disorder); (category 2) recorded gender male with at least one diagnosis code indicating cancer of the ovaries, cervix, vulva, vagina, uterus, placenta, or other related organs; and/or (category 3) recorded gender female with at least one diagnosis code indicating cancer of the prostate, testes, penis, or other related organs. 557 individuals matched their inclusion criteria within CancerLinQ data: 42 in category 1, 316 in category 2, and 199 in category 3. 76% of those with an ICD-9 or ICD-10 diagnosis code relevant to TG status were confirmed to be TG, while only 2% and 3% were identified through categories 2 and 3, respectively.22 There was very low specificity for categories 2 and 3, as many patients identified ended up being false positives (i.e. cisgender).
Chyten-Brennan and colleagues created a CP to identify TG patients (n=213) among people living with HIV through the Montefiore Health System in New York City from 1997 to 2017.23 Their CP contained: 1) ICD-9 or ICD-10 diagnosis codes; 2) gender-affirming medications; 3) key text-strings, and 4) gender identity variables (e.g., yes/no field for TG). After manual chart review to validate TG status, they were able to confirm 84% of patients (PPV). Only 13.5% were identified through ICD-9 or ICD-10 diagnosis codes alone, while 60% were found from multiple categories. They were not able to confirm the TG status of 22% of those found only through ICD-9 or ICD-10 diagnosis codes. However, they were able to accurately identify 15% of TG patients through HIV-funding related gender identity data, which is not found in other EHR-based algorithms. Without this data, they would have differentially misclassified a large portion of TG people, which would lead to biased estimates.
EHR data was able to overcome the key limitation of validation for claims data by having access to conduct manual chart reviews, as well as self-reported gender identity when the data was collected and available. Similar to claim-based CPs, the strongest CPs in EHR data contained diagnosis codes accompanied by other information, which for EHR data was key text-strings relevant to TG status. If key text-strings were available, the PPV of the CP has the potential to be 100%.13 In terms of algorithm components to identify TG patients, Wolfe et. al and Alpert et. al were able to find the highest proportion of TG patients through diagnosis codes alone.21,22 However, Chyten-Brennan and colleagues were only able to identify 13.5% of TG patients through diagnosis codes, and Foer and colleagues found that key text-strings were able to identify almost 90% of patients.20,23 Additionally, Chyten-Brennan and colleagues access to self-reported gender identity data added a large amount of TG patients that would have otherwise been classified as cisgender through their medical records alone.