Introduction
Legal rules and regulations keep healthcare data secure and prevent violations of patient privacy.1 While necessary, these precautions make it difficult to obtain permission to combine data from multiple sources, increase the time required to conduct straightforward analyses and make more complex analyses impossible.2-4As a result, data on population-level drug safety and effectiveness generally come from a smattering of single-database studies with limited precision and in-sample diversity applying differing analytic approaches and statistical analyses.5
As healthcare data were digitized and information technology advanced, an alternative approach was proposed: analyses using distributed data. In 2008, the FDA launched the Sentinel Initiative to explore a system where database custodians, called “partners,” maintained ownership of their data as separate “nodes” of the network but transformed it into a common data model to be analyzed in a consistent way.6 A similar effort started in Canada with the Canadian Network for Observational Drug Effect Studies (CNODES),7, 8 formally funded in 2011, and the Patient-Centered Outcomes Research Institute (PCORI) began to design its own distributed network of partner organizations, PCORnet (the National Patient Centered Clinical Research Network), in 2013.9All three networks focus on generating one “network-wide” effect estimate in some fashion from the node-level data. Other distributed networks include the Data Analysis and Real World Interrogation Network (DARWIN-EU) project in Europe;10 a network that leverages the infrastructure built by the Observational Health Data Sciences and Informatics (OHDSI) community;11 the Asian pharmacoepidemiology network (AsPEN);12the Vaccine Safety Datalink (VSD);13 and a distributed network created for the purposes of pregnancy research titled ConcePTION.14
Much has already been written about the steps these and other networks take to reduce confounding and information bias15-1718in analyses within the individual nodes; after all, internal validity within nodes is necessary to generate unbiased estimates in nonexperimental research.19 Concepts related to external validity – such as effect measure modification, target populations, generalizability, and transportability – have received comparably less attention in methodologic work on distributed data. Here, we describe the unique roles external validity and related concepts play in analyses of distributed data networks, especially those that seek to obtain a single “network-wide” effect estimate. We then provide an overview of the structure of Sentinel, CNODES, and PCORnet and describe how each network deals with these concepts.