ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Sleep monitoring with the Apple Watch: comparison to a clinically validated actigraph

[version 1; peer review: 2 approved with reservations, 1 not approved]
PUBLISHED 29 May 2019
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

Background: We investigate the feasibility of using an Apple Watch for sleep monitoring by comparing its performance to the clinically validated Philips Actiwatch Spectrum Pro (the gold standard in this study), under free-living conditions.
Methods: We recorded 27 nights of sleep from 14 healthy adults (9 male, 5 female). We extracted activity counts from the Actiwatch and classified 15-second epochs into sleep/wake using the Actiware Software. We extracted triaxial acceleration data (at 50 Hz) from the Apple Watch, calculated Euclidean norm minus one (ENMO) for the same epochs, and classified them using a similar algorithm. We used a range of analyses, including Bland-Altman plots and linear correlation, to visualize and assess the agreement between Actiwatch and Apple Watch.
Results: The Apple Watch had high overall accuracy (97%) and sensitivity (99%) in detecting actigraphy-defined sleep, and adequate specificity (79%) in detecting actigraphy defined wakefulness. Over the 27 nights, total sleep time was strongly linearly correlated between the two devices (r=0.85). On average, the Apple Watch over-estimated total sleep time by 6.31 minutes and under-estimated Wake After Sleep Onset by 5.74 minutes. The performance of the Apple Watch compares favorably to the clinically validated Actiwatch in a normal environment.
Conclusions: This study suggests that the Apple Watch could be an acceptable alternative to the Philips Actiwatch for sleep monitoring, paving the way for larger-scale sleep studies using Apple’s consumer-grade mobile device and publicly available sleep classification algorithms. Further study is needed to assess longer-term performance in natural conditions, and against polysomnography in clinical settings.

Keywords

Apple Watch, Sleep Study, Actigraphy

Introduction

Good sleep is vital for our health and wellbeing. Without it, our body and mind function poorly, with consequences that include risk of obesity1, diabetes2 and cardiovascular disease3. Polysomnography (PSG) is currently the gold standard method to monitor sleep. During PSG, subjects spend a night in a dedicated sleep lab, hooked up to a range of devices to measure physiological signals. However, information-rich PSG is expensive, laborious and intrusive. Because subjects are attached to many electrodes, sleep is disturbed by the recording, so that it is better at looking at qualitative abnormalities such as sleep apnea or narcolepsy.

Widespread adoption of smartphones, smartwatches and fitness devices opens up new opportunities for monitoring sleep and physical activity. Many of these consumer devices come with a built-in accelerometer, gyroscope, magnetometer, and in some cases, a heart rate sensor—more sensors than wrist-worn actigraphs, which only use an accelerometer to measure movement to infer sleep and wake states. Nevertheless, actigraphs are regarded as useful tools in clinical practice when measuring sleep and wakefulness in normal living environments, especially in the context of assessing regularity and overall diurnal fluctuations47.

The affordability of consumer-grade mobile devices has driven their popularity and the abundance of applications (“apps”) available—including apps for health and sleep. These devices have the potential to be used in sleep studies and to inform clinical diagnosis. However, their performance needs to be properly validated in comparison to accepted methodologies such as actigraphy and PSG. This is difficult because the sleep classification methods (e.g., algorithms and analysis techniques) used in consumer-grade devices are proprietary, so the relationship between the underlying physiological measurements and the sleep state reported by an app is unclear.

We chose to investigate the feasibility of the Apple Watch as a sleep monitoring device because the manufacturer allows software developers to access the device’s triaxial accelerometer data. To the best of our knowledge, this is the first study to compare the Apple Watch against an actigraph.

Methods

Evaluation framework

Figure 1 sets out the framework for validating the Apple Watch against an actigraph. First, raw acceleration measurements are collected from each device and transformed into activity counts for the actigraph, and the “Euclidean Norm Minus One” (ENMO) for the Apple Watch. We then explore statistical relationships between activity counts and ENMO. Next, we determine a threshold to classify each epoch of ENMO values as “sleep” or “wake” and evaluate the sleep outputs of the two devices using Pearson correlation and Bland-Altman plots.

fea3ba8a-1e47-4ed7-b430-10ccb0b85ae6_figure1.gif

Figure 1. Evaluation framework for wrist-worn accelerometer devices.

Participants

In total, 14 healthy participants (9 males, 5 females) were recruited by word of mouth and direct approach. Data were recorded from April to May 2018, participants wore the two devices for two consecutive nights at home. The inclusion criterion for participants was age of at least 18 years old. Exclusion criteria were a previously diagnosed sleep disorder, or any condition that would lead to difficulty/discomfort while wearing the devices. The Queensland University of Technology Human Research Ethics Committee (#1800000242) approved all procedures, and all participants gave their signed consent prior participating the study. They were asked to wear both the Apple Watch and Actiwatch on their non-dominant wrist for two consecutive nights and sleep as they normally would. All wearable devices were then returned for data extraction. One participant forgot to charge the Apple Watch, so we lost one night of data and were left with 27 nights from 14 participants.

This study was designed as a proof of concept for whether it is possible to use the Apple Watch for sleep monitoring. Power calculations are appropriate when the distribution of the underlying data is known (and ideally, normal). At that point in time, this was not applicable given that there was no prior study for us to confidently characterize the distribution of the differences between the measurements obtained using different platforms. Sample size was therefore determined pragmatically8.

Wrist-worn devices

Table 1 shows specifications of the two wrist-worn devices used in this study: the Apple Watch Series 1 (Apple Inc., California, United States) and the Actiwatch Spectrum Pro (Philips, Bend OR). The Apple Watch has limited data storage, so data was downloaded to an iPhone via Bluetooth. We used the Core Motion Framework to develop an app to record triaxial accelerometer data at 50 Hz. The Actiwatch Spectrum Pro is a clinical-grade actigraph used for sleep and activity monitoring. It samples accelerometer data at 32 Hz and we set a 15-second epoch for processing this raw data. Processed data were downloaded using Philips’ Actiware Software (version 6.0.9). Outputs were activity counts and sleep/wake stage at each epoch.

Table 1. Device specifications.

SpecificationApple WatchActiwatch
ModelSeries 1Spectrum Pro
Price$AU 359-559$AU 3470
Battery life1 day50 days
SensorAccelerometer,
Gyroscope, Heart rate
Accelerometer,
Light sensor
Accelerometer
sampling rate
50 Hz32 Hz
Recording Time
(Accelerometer)
3 days50 days

Data processing and analysis

Accelerometer. Raw acceleration data from the Apple Watch was downloaded and processed using R statistical software. We calculated ENMO, the Euclidean Norm (magnitude) of the triaxial acceleration vector A = (Ax, Ay, Az) minus 1 gravitational unit. ENMO is used widely in physical activity and sleep monitoring913 and defined as:

ENMO(A)=Ax2+Ay2+Az2-1

We compared the mean ENMO of each 15-second epoch against activity counts from the actiwatch. We ensured that the clocks of both devices were synchronized and compared recordings of vigorous movement applied simultaneously to both devices to check that each devices’ timestamps were in sync.

Sleep algorithm. Philips’ Actiware software computes the total activity counts at epoch e using a weighted sum:

Total_counts(e) = 0.04i=8i=5ae+i+0.2i=4i=1ae+i+4ae+0.04i=1i=4ae+i+0.2i=5i=8ae+i

where ae is the activity count of epoch e. We used the shortest (15-second) epoch so that data could be converted to longer epochs if required. Each epoch was classified as sleep if its total activity counts were less than or equal to a threshold; epochs with counts above the threshold were classified as wake. Our study used a low and medium threshold (20 and 40, respectively) derived from Actiware 6.0.9 software.

Statistical analysis. The R program (version 3.3.2) is used for statistical analysis and visualization. To measure agreement between Apple Watch and Actiwatch, we used a range of statistical methods including:

  • i Calculating the Pearson Correlation between total activity counts and ENMO.

  • ii Using a receiver operating characteristic (ROC) analysis, taking the Actiwatch as ground truth.

  • iii Using Bland-Altman plots to measure the agreement between two measurements by quantifying the mean bias and constructing an agreement interval.

  • iv Computing the overall accuracy and performance the devices’ sleep/wake classifications using a confusion matrix, again taking the Actiwatch as ground truth.

The confusion matrix of sleep-wake classifications has four outcomes: true positive (TP), true negative (TN), false positive (FP), and false negative (FN), with sleep positive, and wake negative. Using these outcomes, we calculated accuracy, sensitivity, specificity, and F1 score as defined in Table 214.

Table 2. Classification performance statistics.

MeasureFormula
Accuracy(TP +TN)/(TP+TN+FN+FP)
Sensitivity or RecallTP/(TP+FN)
SpecificityTN/(TN+FP)
PrecisionTP/(TP+FP)
F1 Score2*Precision*Recall/
(Precision+Recall)

We also calculated measures of sleep quality of interest in sleep studies: total sleep time (TST), wake after sleep onset (WASO), and number of awakenings. TST is the total duration of epochs classified as sleep; WASO is the total duration of wake epochs. Number of awakenings is the number of wake events of at least 30 seconds duration15.

Results

Sleep-wake agreement

Figure 2 displays measurements over the first night of randomly selected participant: overall patterns of Actiwatch and Apple Watch are very similar with clearly aligned periods of movement. For example, around 22:10, both activity counts and ENMO were quite active with high peaks, then both signals gradually declined. During a sleep period from 02:00–02:30, both features remained steady with no obvious movements. Raw measurements are available as Underlying data16.

fea3ba8a-1e47-4ed7-b430-10ccb0b85ae6_figure2.gif

Figure 2. One night of measurements at 15 second epochs.

ENMO is shown in turquoise (left) and Activity Counts in pink (right).

Pearson correlation was computed to assess the relationship between activity counts and ENMO shown in Figure 3. Overall, there was strong, positive correlation between activity counts and ENMO (r = 0.85, nights = 27, p<0.001). There were similar patterns across 15-second, 30-second, and 60-second epochs. We used 15-second-epochs in the remainder of this analysis.

fea3ba8a-1e47-4ed7-b430-10ccb0b85ae6_figure3.gif

Figure 3. Apple Watch mean ENMO versus Actiwatch activity counts for all 15s-epochs over 27 nights of data.

The diagonal line is the standardized major axis fit17 of activity counts and mean ENMO, constrained to pass through the origin.

Optimal wake threshold

Unlike the Actiwatch, there is no pre-established wake threshold for the Apple Watch ENMO data. To identify an optimal threshold, we isolated 11 nights without any missing data, and created all the possible training sets containing k nights, where k varies from 1 to 10. For instance, for k = 1 or k = 10 there are 11 such training sets, while for k = 5 or k = 6 there are 462 sets. In total, we considered 2046 distinct training sets. For each of these sets, we identified the threshold that maximized the F1 score on the nights not used for training. Figure 4 shows how this threshold converges to 0.0608 when comparing against the medium Actiwatch threshold. This ‘optimal’ threshold was used for the rest of our analysis. A similar pattern was observed for the low Actiwatch threshold, with the Apple Watch threshold converging to 0.0523. We also measured the impact of the threshold choice through a ROC curve for the medium Actiwatch setting (Figure 5).

fea3ba8a-1e47-4ed7-b430-10ccb0b85ae6_figure4.gif

Figure 4. Box plot of the distributions of 'optimal' thresholds based on 1–10 nights of training data.

As the number of training nights increase, the distributions converge around a value of 0.0608.

fea3ba8a-1e47-4ed7-b430-10ccb0b85ae6_figure5.gif

Figure 5. ROC curve for varying threshold, Apple Watch threshold taking the Actiwatch at a 15 s-epoch medium threshold as truth.

Overall, there is good agreement between the Apple Watch and Actiwatch for both medium and low setting thresholds. Significantly, the ability to detect sleep (sensitivity) are higher than 98%. However, the ability to detect awake ranges from 60% to 79% for both thresholds. The overall accuracy and F1 score were consistent for both settings. The results are summarized in Table 3.

Table 3. Overall performance in accuracy, sensitivity, specificity and F1 in comparison with the Actiwatch in medium and low setting mode.

VariableMedium thresholdLow threshold
Accuracy97.11% ± 0.53%93.53% ± 0.86%
Sensitivity99.28% ± 0.21%98.97% ± 0.35%
Specificity78.94% ± 1.95%63.50% ± 2.40%
F1 Score98.35% ± 0.32%96.16% ± 0.57%

Bland-Altman plots

Figure 6 plots the difference versus the mean of TST, WASO and numbers of awakenings for Apple Watch and Actiwatch. We used a 15-second epoch and medium threshold for our main analysis. For TST, most nights were within the limits of agreement and close to perfect agreement (the black line, Figure 6a)—three nights of TST were in perfect agreement. Only one night was outside the agreement intervals and was overestimated TST by 30 minutes. The overall bias of TST was 6.31 minutes.

fea3ba8a-1e47-4ed7-b430-10ccb0b85ae6_figure6.gif

Figure 6. Bland-Altman plots show the agreement between Apple Watch and Actiwatch in medium threshold.

The red dashed lines represent in the upper and lower agreement limits (95% confidence intervals). (a) Total sleep time (minutes). (b) Wake after sleep onset (minutes). (c) Number of awakenings.

In terms of WASO, differences fall mostly within the levels of agreements (Figure 6b)—two nights were in perfect agreement. One night stood outside the agreements which an underestimation of WASO by 30 minutes. This night was the same night that we found in TST. The overall bias of estimating WASO was -5.74 minutes.

For the total number of awakenings, the overall bias was -4.56. Only one night was in perfect agreement but all nights lie within the upper and lower agreement levels (Figure 6c).

Discussion

This study compares measurements of sleep and wake obtained from Apple Watch, a consumer-grade device, and Philips Actiwatch Spectrum Pro, a clinically validated actigraphy device, in a set of healthy adults. We found that ENMO and activity counts were highly correlated. By using a combination of ten-night training data and ROC plot, we identified an optimal threshold for Apple Watch ENMO data in comparison to the Actiwatch for both medium and low settings (Table 3).

In the medium threshold setting, the sleep parameters of TST, WASO, and number of awakenings were comparable to that of the Actiwatch with no significant differences. The discrepancy between the two measurements appeared to be clinically acceptable as a difference of TST and WASO did not exceed 30 minutes6,18. The Apple Watch performs best in comparison to the Actiwatch at medium threshold, consistent with Quante’s recommendation19.

To the best of our knowledge, this is the first study to evaluate the Apple Watch, a popular-consumer grade device, against a clinical-grade actigraph device for sleep monitoring. We have compared the two devices at high resolution (i.e., 15-second epochs) and low-resolution sleep parameters (i.e., TST, WASO, and number of awakenings). A similar study compared a consumer fitness tracking device (Fitbit charge HR) against Philips’ Actiwatch 2: the accuracy of sleep parameters was good20. However, Montgomery-Downs et al. suggested that both Fitbit and Actiwatch tended to have limited specificity21. Therefore, care must be paid in validation. Furthermore, both studies compared only low-resolution sleep parameters due to the limited information of the type of feature, and proprietary sleep algorithm22. Our study provides additional support for high-resolution data (i.e., ENMO), which could be further used in a sleep-wake algorithm validated against gold standard PSG.

We note that ENMO and activity counts are dominated by very small values (i.e., points near or at (0, 0)), as shown in Figure 3. This is a consequence of doing a study in which participants are moving very little for relatively long periods. While this is reasonable in assessing whether the Apple Watch produces comparable results to the Actiwatch for sleep monitoring, this study cannot draw conclusions about whether the Apple Watch is comparable to the Actiwatch across a broader range of activity levels (e.g., during exercise). The main benefit of assessing sleep using a well-known consumer-grade device lies in increasing the opportunity for longitudinal studies in a wider population. Over 30 million Apple Watches have been sold23. The availability of advanced sensor technology in smart watches (e.g., heart-rate sensors) opens up possibilities for improved sleep monitoring with consumer wearable devices.

We faced some practical challenges in implementing sleep monitoring on the Apple Watch, including constraints of power, memory management, data transfer, and sensor capabilities. In total, 16 nights had missing data from the Apple Watch. In each of these nights, data was missing in one contiguous block of 20–60 minutes in duration. We needed to impute data based on the average of previous and next available data. At this stage, the cause of the missing data is still to be determined.

Recording was limited to two consecutive nights per participant due to memory and data transfer constraints. With these limitations in mind, we suggest future studies are needed to carefully monitor and deal with data loss where more nights of recording are investigated. Lastly, our study assessed only one sleep-wake algorithm based on the Cole-Kripke algorithm17; future studies could assess other sleep-wake algorithms that use a combination of weighted sum activity with the cut-off wake threshold to classify sleep or wake stages15.

Conclusion

Our study lays down a foundation for using accelerometer data of consumer-grade devices for sleep monitoring. Our experiments show that Apple Watch provides sleep measures comparable to the Philips Actiwatch, a clinical gold standard, with greatest similarity at the medium threshold of activity counts. These findings increase our confidence in using the consumer grade Apple Watch for sleep monitoring and open up possibilities for much larger-scale sleep studies. We also hope this work will serve as a basis for sleep clinicians in the use of data extracted from this device in their patients.

To further our research, we now intend to compare the Apple Watch against PSG and consider incorporating other types of physiological sensors from consumer-grade devices (e.g. heart-rate sensor). These findings add to a growing body of literature on the use of consumer-based accelerometer for sleep studies and will assist other researchers in establishing validation parameters for the use of other types of consumer devices.

Data availability

QUT Research Data Finder: Sleep Data. https://doi.org/10.25912/5cc28f62e81ad16.

Underlying data are contained within SleepDataset.zip. There are 27 csv files in this archive, with each file corresponding to a single night of data. The files have four columns: timestamp, Actiwatch activity counts, Actiware classification (1 for wake, 0 for sleep), and ENMO value calculated from the Apple Watch data. Each row corresponds to the data for a 15-second epoch. Dates have been modified to preserve privacy, with times are unchanged.

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 29 May 2019
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Roomkham S, Hittle M, Cheung J et al. Sleep monitoring with the Apple Watch: comparison to a clinically validated actigraph [version 1; peer review: 2 approved with reservations, 1 not approved] F1000Research 2019, 8:754 (https://doi.org/10.12688/f1000research.19020.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 29 May 2019
Views
38
Cite
Reviewer Report 27 Aug 2019
Vincenzo Natale, Department of Psychology, University of Bologna, Bologna, Italy 
Approved with Reservations
VIEWS 38
The aim of this work was to compare the measurements of sleep and wake obtained from Apple Watch and Philips Actiwatch Spectrum. To this aim, 14 participants (9 males) were asked to wear on non-dominant wrist both equipments for two ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Natale V. Reviewer Report For: Sleep monitoring with the Apple Watch: comparison to a clinically validated actigraph [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2019, 8:754 (https://doi.org/10.5256/f1000research.20845.r49255)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
309
Cite
Reviewer Report 19 Aug 2019
Lillian Skeiky, Sleep and Performance Research Center (SPRC), Elson S. Floyd College of Medicine, Washington State University (WSU), Spokane, WA, USA 
Devon A. Hansen, Sleep and Performance Research Center (SPRC), Elson S. Floyd College of Medicine, Washington State University (WSU), Spokane, WA, USA 
Not Approved
VIEWS 309
Over the last 10 years, a large number of commercially available wearable activity trackers have been offered to the public due to the growing request for individualized health monitoring. While these commercially available devices certainly have a number of advantages ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Skeiky L and Hansen DA. Reviewer Report For: Sleep monitoring with the Apple Watch: comparison to a clinically validated actigraph [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2019, 8:754 (https://doi.org/10.5256/f1000research.20845.r52113)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
58
Cite
Reviewer Report 10 Jul 2019
Sean P. A. Drummond, Turner Institute for Brain and Behaviour, School of Psychological Sciences, Monash University, Clayton, Vic, Australia 
Approved with Reservations
VIEWS 58
In this this paper, the authors attempt to optimize an algorithm to convert accelerometer data from the Apple Watch to match that provided by a research grade actigraph, the Actiwatch Spectrum Pro. This is a proof of concept study, and ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Drummond SPA. Reviewer Report For: Sleep monitoring with the Apple Watch: comparison to a clinically validated actigraph [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2019, 8:754 (https://doi.org/10.5256/f1000research.20845.r50110)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 29 May 2019
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.