INTRODUCTION
There is an estimated 2 million adults who identify as transgender (TG) living in the United States.1 Unjust discrimination and violence have led to disproportionate health burdens among TG populations that have been consistently reported, such as higher prevalence of mental health distress, substance misuse, and HIV when compared to cisgender people (i.e. those whose sex assigned at birth aligns with their current gender identity).2,3 While health literature on TG individuals is growing, this population is largely overlooked in epidemiologic studies due to small sample size limitations and inconsistent gender identity data collection measures.4 Recruiting a large sample size of TG people is labor intensive and costly, leading researchers to resort to real-world data (RWD) sources like electronic healthcare databases to create efficient methods for identifying these patients.
Computational phenotypes (CPs) have become emerging tools to distinguish groups of patients with shared characteristics within electronic healthcare databases, and they have an important role in TG health-related research.5 In brief, CPs are algorithms that use a combination of diagnostic and procedure codes, medication records, and demographic characteristics to identify patient populations within healthcare utilization data.6 Given the varying data models from RWD sources, there is not a single standardized method to identify TG patients. A systematic review in 2016 assessing the variation in prevalence estimates of TG people using self-reported gender identity information from surveys and TG-related diagnosis codes from electronic healthcare data across the world highlighted the lack of standardization and significant heterogeneity of ascertainment of TG status across studies as an important barrier for research.4
To date, there has not been a review of published literature on CPs to better understand their ability to identify TG people and their health utilization patterns within electronic health care data. Similarly, there has not been a comprehensive assessment of validity of such CP algorithms in this setting. In this narrative review, we aim to discuss the existing literature that has utilized CPs to identify TG people within electronic healthcare data and their validity, potential gaps, and a synthesis of future recommendations based on current knowledge.