Main Text
Introduction
Cardiotocography (CTG) is currently the main method of fetal monitoring
in labour. Introduced in the 1960s to detect fetal heart rate (FHR)
patterns thought to indicate hypoxia, its use increased rapidly, before
evidence established either efficacy or safety. Various methods of
classifying FHR abnormalities have been described, but none has shown
adequate levels of both sensitivity and specificity. In the 1990s, the
shortcomings of CTG were highlighted by a study showing its limited
power to accurately predict cerebral palsy
(CP)1 However, since
only around 15% of cases of CP are attributable to intrapartum
hypoxia-ischemia and CP is only one of the possible outcomes of .birth
asphyxia, differential rates of CP are not an ideal metric for the
assessment of FHR monitoring.2 Some specific
abnormalities e.g. absent or minimal baseline variability, and late or
prolonged decelerations can achieve higher power for the prediction of
fetal asphyxia. 3,
4 Nevertheless, the relative rarity of
important adverse outcomes does hinder the positive predictive value of
even the most specific FHR abnormalities. Metanalysis has shown that the
use of continuous CTG results in increased rates of caesarean section,
marginal reductions in the incidence of neonatal seizures and no
improvement in other neonatal
outcomes.5 For fetal
monitoring to help improve clinical outcomes, adverse outcomes must be
predicted with acceptable accuracy before hypoxia results in neuronal
injury.
Most studies of the diagnostic accuracy of intrapartum CTG monitoring
have focused on relatively common outcomes such as low Apgar scores or
umbilical cord blood acidosis. There is a lack of studies which have
tested the power of specific FHR patterns to predict neonatal
encephalopathy. While rare, neonatal encephalopathy is an important
clinical outcome. Therapeutic hypothermia has decreased rates of
mortality, cerebral palsy, and intellectual impairment in
childhood.6,
7 However, the risk of complications
remains significant especially among those most severely affected.
Intellectual impairment is now recognised as a possible complication
even in those babies who have been spared other sequelae. Pappas et al.
found an IQ score < 70 in 96% of survivors with cerebral
palsy (CP) and 9% of those without CP, and IQ scores <84 in
52% of participants treated with
hypothermia.8 As a
group, babies with mild encephalopathy have lower cognitive tests scores
at 5 years than healthy
babies.9
Design and Methodology
Aim
To determine the accuracy of intrapartum fetal heart rate abnormalities
as defined by NICE guidelines for the prediction of moderate to severe
hypoxic-ischemic neonatal encephalopathy.
Study Population
Subjects were identified from a case-control database used to identify
risk factors for neonatal encephalopathy. Eligible subjects were born in
the Rotunda hospital between September 2006 and November 2017 at ≥35+0
weeks’ gestational age and had no major congenital anomalies. Cases were
diagnosed with antenatally-acquired moderate or severe hypoxic ischemic
neonatal encephalopathy by the attending consultant neonatologist. In
all cases, the timing of the injury was thought to be intrapartum.
Controls were the first eligible babies born before and after each case
who was not admitted to the neonatal unit and had Apgar scores ≥5 at
1-minute and ≥7 at 5-minutes. For this study, those women in the
database who were admitted to the delivery suite in labour and had
electronic fetal heart rate monitoring were included.
Data Handling
Maternal and Neonatal Clinical
Details
Maternal and neonatal clinical details were collected from documentation
made by the clinical teams and available in the patient records.
Fetal Heart Rate Pattern
Analysis
Cardiotocograph (CTG) traces were exported from the hospital’s Athena
archive (K2 Medical Systems Ltd, Plymouth, United Kingdom), stripped of
any identifiers besides the date and time. The hospital’s guidelines
state that CTG monitoring should be recommended to all mothers who have
risks identified during the antenatal or intrapartum period. The
electronic fetal monitors used in the delivery suite were GE Corometric
170 series models. Fetal heart rate pattern features were marked
according to NICE-UK criteria and
definitions.10 The
traces were marked by Dr. Adam Reynolds, blind to all clinical data, in
order of a randomly assigned database number and using a unique user
interface developed within Matlab (MathWorks, Natick, Massachusetts,
USA). With this interface, each recording was displayed in consecutive
15-minute segments with an additional 7.5-minute window visible on
either side. For each segment, the following fetal heart rate features
were manually marked: interpretable (yes or no), baseline, variability,
each acceleration, each early, variable, deep variable, prolonged
variable, late, and prolonged deceleration, as well as the presence of
any bradycardia or sinusoidal pattern. Based on the NICE guideline
criteria the baseline, variability, and deceleration pattern for each
15-minute segment were automatically classed as reassuring,
non-reassuring or abnormal and based on this assessment each segment was
then categorised as normal, suspicious or pathological.
Statistical Analysis
All calculations were performed in in SPSS v26.0 (IBM Corp., Armonk, New
York, U.S.) unless otherwise stated. Categorical and continuous
variables were compared by Fisher’s exact test or Mann-Whitney U-test,
respectively. For each variable and model, the area under the receiver
operating characteristic curve (AUROCC) and its asymptotic 95%
confidence interval were calculated. From the ROC analysis, the point
with the maximum Youden index was selected as the split
point.11 AUROCCs were
compared using R (R Foundation for Statistical Computing, Vienna,
Austria) and the pROC
package.12 Multivariate
logistic regression was used to estimate odds ratios (ORs) and 95%
confidence intervals (CIs) for moderate or severe encephalopathy.
Results
Description of Cohort
The total number of live births over the study period was 99,046.
Eighty-eight cases and one hundred and seventy-six controls were
included in the database. Seventy-one (81%) cases and one hundred and
forty-six (83%) controls were in labour and admitted to the delivery
suite (Chi-squared p=0.649). Of that group, 52 (73%) cases and 121
(83%) controls had intrapartum electronic fetal heart rate monitoring
(Chi-squared p=0.098). Three controls were then excluded because the
duration of monitoring was less than 15 minutes.
Selected demographic and
obstetrical characteristics are shown in Table 1.
Thirty-eight and fourteen cases had moderate and severe neonatal
encephalopathy, respectively. In the cases, the median arterial pH was
7.16 (n=45, IQR: 7.02-7.21). The median 5-minute Apgar score was 6
(n=52, IQR=2-7) in the cases and 10 (n=117, IQR=10-10) in the controls.
Forty-four (85%) of the cases had a 5- or 10-minute Apgar score ≤5, or
an umbilical cord or early postnatal blood sample which showed a pH
<7 or a base excess <-12. When the pH and base
excess thresholds were adjusted to 7.1 and <-8 respectively,
fifty-one (98%) of the cases met the criteria. The remaining single
subject had normal cord gases but a history of a significant sentinel
event, was intubated for apnoea and had an initially severely abnormal
amplitude-integrated electroencephalogram (aEEG).
Fetal Heart Rate
Analysis
Univariate Analysis
The main results of the univariate
analysis of individual FHR features are presented in Table
2. The largest number of
consecutive segments with the baseline FHR above a threshold was
statistically significant for >160, >150, and
>140bpm. However, following correction for the total number
of segments, >160bpm was the only remaining statistically
significant predictor.
Decelerations were analysed both in terms of total number and in terms
of rate. (Supplemental Material Table 1). The total number of variable
decelerations showed a statistically significant difference between
cases and controls, but the frequency of variable decelerations did not
(p=0.076). There were no significant differences between the
deceleration rate AURROC and the deceleration number AURROC for any
other type of deceleration. However, logistic regression models which
incorporated the rate of decelerations and the total length of tracing
outperformed the respective univariate models based solely on the number
of decelerations of that type.
The results of the univariate analysis of FHR categories are presented
in Table 3. AUROCC was higher for the number of suspicious segments
compared to whether any single suspicious segment was observed
(p<0.001). The AUROCC for the number of suspicious segments
was higher than for the number of pathological segments but the
difference did not meet the level of statistical significance (p=0.088).
The unadjusted odds ratio for the number of suspicious segments was 1.31
(95% CI: 1.17-1.47) compared to 1.47 (1.18-1.84) for the number of
pathological segments.
Multivariate Analysis
The
multivariate logistic regression models are detailed in Supplemental
Material Table 2. The best performing multivariate model incorporated
the total number of fifteen-minute segments, the percentage of segments
classed as suspicious, and the percentage of segments classed as
pathological (AUROCC: 0.782 [95% CI: 0.704-0.861], sensitivity:
69%, specificity: 80%). The AUROCC of this model was superior to that
of the best univariate predictor (number of consecutive segments with
baseline FHR >160bpm), but the difference did not meet the
threshold for statistical significance (p=0.063). The best logistic
regression model using FHR segment categories was essentially identical
(p=0.9162) in overall performance to the best performing model which
used individual FHR features. Figure 1 shows the ROC curve for the
logistic regression model based on FHR categories along with the 95%
confidence interval for sensitivity at a given specificity.
Fetal scalp blood
sampling
Nine cases (17%) and 8 controls (7%) had one fetal blood sample taken
for pH testing. Six cases (12%) and no controls had two or more samples
taken. The overall Fisher’s exact test p value for the number of fetal
scalp blood samples was <0.001. One case had a pH
<7.2, but no subject had a pH <7.1.
Discussion
Main Findings
As expected, no FHR pattern feature or category achieved both high
sensitivity and specificity. Multivariate models performed better than
any single variable, but still did not achieve high accuracy. The best
logistic regression model using FHR segments categorised according to
NICE criteria was essentially identical in performance to the best
performing multivariate model which used individual fetal heart rate
features. This finding suggests that the current categories are
appropriately capturing the predictive value inherent in the underlying
features.
Strengths and
Limitations
One of the strengths of this study is the CTG assessment method. The
entire recording from delivery suite admission up to birth was analysed
and each 15-minute interval was classified. Analysis of the entire
length of the CTG rather than a specified period pre-delivery allowed
determination of the duration of abnormalities. Features were identified
blind to clinical outcome and segments were algorithmically categorised.
Another strength is the use of moderate-severe neonatal encephalopathy
as the outcome. Most studies of FHR patterns have either focused on
acidaemia or Apgar scores neither of which are highly specific or
sensitive, or cerebral palsy which is etiologically diverse and not
usually associated with intrapartum anomalies. Not all of the babies
included in this study had abnormal cord gases, but all had features
consistent with peripartum hypoxia-ischemia. This is in keeping with
published data. In a study published in 2012 and based on data from the
Vermont Oxford Network, 54% of the babies diagnosed with neonatal
encephalopathy who had cord blood sampling had a pH
<7.09.13 In
this study, diagnosis was made by the attending neonatologist based on
history, examination and amplitude-integrated electroencephalography
findings. The inter-rater reliability of clinical examination to assess
neonatal encephalopathy is generally
good.14 Nevertheless,
it is possible that knowledge of the FHR patterns could have biased the
attending clinicians towards a diagnosis of neonatal encephalopathy and
therefore increased the observed predictive power.
We did not measure FHR deceleration area. In a cohort study employing
manual assessment of FHR traces in the last two hours before delivery,
Cahill et al. showed that the total deceleration area (AUROCC: 0.76
[95% CI: 0.72-0.80]) was more predictive of umbilical cord
acidaemia than “always ACOG grade 2” (0.61 [0.56-0.65]), “any
ACOG grade 3” (0.62 [0.57-0.66]), or the total number of
decelerations (0.66
[0.62-0.71]).15However, with regard to a composite measure of neonatal morbidity, the
AUROCC for deceleration area was less (0.66 [0.64-0.68]) and similar
to the values for ACOG FHR categories. A 2014 study showed that, when
used in isolation, manual estimation of the total area of decelerations
in the hour before delivery has an AUROCC of 0.68 (0.56–0.79) for
detection of babies with moderate-severe
encephalopathy.16 The
controls in that study were matched for mode of delivery which may have
resulted in a higher rate of fetal heart rate abnormalities and
therefore lower specificity for a given sensitivity than would be found
in the general population.
We did not employ fully automated analysis. Attempts to show a benefit
to automated interpretation based on replication of existing
classification schemes have been hampered by an unsatisfactory incidence
of false positive
alarms.17 Overall,
existing methods of artificial interpretation of FHR traces have aimed
to reproduce human methods and have therefore inherited the problems of
poor agreement and unproven benefit for the reduction of neonatal
acidaemia.18 Methods
which train convolutional neural networks to predict adverse outcomes
without resorting to existing classification schemes have shown
promising accuracy in early
studies.19 Such methods
require large datasets to train and establish predictive accuracy before
trials of clinical utility can be considered.
Owing to resource limitations, we were not able to analyse changes in
the pattern of abnormalities over time. This may be of interest. Murray
et al. identified three patterns of FHR abnormalities in babies with
neonatal encephalopathy (group 1: abnormal CTG on admission; group2:
normal CTG on admission with gradual deterioration; group 3: normal CTG
on admission with acute sentinel
events.)20 In that
study, babies in group 3 had more severe encephalopathy. Our study only
features women in the latter two groups. Due to sample size limitations
we were not able to establish the relationship between patterns of
abnormality and the severity of encephalopathy.
Interpretation
To our knowledge, this is the first study to assess the accuracy of NICE
criteria for the prediction of encephalopathy. NICE criteria have been
shown to result in more traces classified as either normal or
pathological and fewer classified as suspicious compared to ACOG,
resulting in overall relatively higher sensitivity but lower specificity
for the prediction of umbilical artery pH values
≤7.05.21
Suspicious and pathological segments were found in ½ and ¼ of control
labours, respectively. Despite the higher sensitivity of pathological
segments, the overall predictive power of the number of suspicious
segments was actually higher than that of the number of pathological
segments even if the difference did not quite reach the level of
statistical significance., Furthermore, the unadjusted odds ratio for
the number of pathological segments was only slightly higher than for
the number of suspicious segments. This is a surprising finding, but it
is important to consider the clinical context which is that pathological
traces will usually prompt immediate delivery, potentially partially
uncoupling the relationship between that classification and adverse
outcomes. In contrast, suspicious traces do not usually prompt immediate
delivery and therefore may persist for longer.
While most studies focus on the severity of the FHR abnormality and
therefore the severity of insult, the duration of hypoxia-ischemia is
also important. Frey et al. showed that ACOG category II traces in the
last half hour before delivery are extremely common and do not help to
identify labours associated with neonatal
encephalopathy.22 Our
data shows that it is important to consider not only the presence of
abnormalities but also the duration of those abnormalities. Data from
animal models support this conclusion. In a rabbit model of preterm
acute placental insufficiency, insults of ≥37 minutes produced increased
rates of stillbirth and neuronal injury, but exposures of 30 minutes did
not. In our study, 52% of cases and just 14% of controls had at least
one hour of suspicious FHR traces. FHR pattern features which accurately
predict HIE but require a long duration to produce fetal asphyxia may
offer more opportunity to prevent injury than more severe and acute
abnormalities. In the current NICE guidelines, the duration of
abnormalities is considered as part of the feature assessment e.g. the
presence of late decelerations for 30 minutes is classed as abnormal.
However, there is no mention of how the duration of suspicious FHR
patterns affect assessment or management.
To be clinically useful any intrapartum monitoring technique must not
only predict neonatal encephalopathy but also help to prevent it.
Despite extensive investigation, continuous FHR monitoring has never
been proven to provide such a benefit. (However, it should be noted that
due to a lack of equipoise, trials have only compared different forms of
monitoring rather than featured a control group without any monitoring.)
The apparent failure of FHR monitoring to prevent injury is possibly due
to the fact that the most predictive patterns are often associated with
severe acute insults such as placental abruption which would often be
detected without continuous FHR monitoring and which, even with
emergency delivery, can result in poor outcomes. In short, the risk is
not alterable at the time of detection. The challenge in fetal
monitoring is to recognise impending neurological injury early enough
that it is still preventable without an unacceptable proportion of false
positives. Recent evidence from randomised controlled trials of
intrapartum sildenafil suggest that it reduces the rate of abnormal FHR
traces.23 It is
possible that by reducing the incidence of FHR abnormalities in labours
which would have had normal outcomes anyway, sildenafil could increase
the positive predictive value of persistent abnormalities.
Fetal acidaemia has been shown to have extremely poor sensitivity for
adverse neonatal
outcomes.24 Fetal scalp
blood sampling was uncommon in this population. No fetuses in this study
had an abnormal scalp blood pH. There was a strong relationship between
the number of samples taken and the risk of neonatal encephalopathy.
Obviously, this relationship is dependent on clinical practice and could
vary depending on the setting. Therefore, this result is not necessarily
generalisable. Nevertheless, our data supports the idea that a normal
scalp pH does not ensure a normal outcome and that it is not generally
prudent to use repeated sampling in an attempt to avoid interventions
such as operative delivery.
It is unreasonable to expect any method of FHR interpretation used in
isolation to have both high sensitivity and specificity for the
diagnosis of neonatal encephalopathy. Neonatal encephalopathy is a
heterogenous condition which is influenced by diverse risk factors some
of which e.g. chorioamnionitis do not usually result in fetal heart rate
changes.25 Models which
incorporate multiple factors such as markers of placental function,
labour progression, and uterine activity, in addition to FHR
abnormalities and their durations may improve our ability to predict and
possibly even to prevent HIE.
Conclusion
While the CTG is a useful aid in intrapartum management its limitations
need to be appreciated and it is important that it is interpreted while
taking all other fetal, maternal and partogram factors into account. In
addition, the power of fetal heart rate abnormalities to predict HIE is
not fixed or necessarily generalisable. For instance, it depends on the
distribution of the aetiologies of HIE in the study population i.e. it
is likely to have less power to prevent HIE in populations where
sentinel events account for higher proportions of the cases of HIE.
Since the incidence of HIE is alterable, it is also dependant on the
clinical context e.g. the local caesarean delivery rate.