Deep Learning on Hester Davis Scores for
Inpatient Fall Prediction
Abstract
Fall risk prediction among hospitalized patients is a critical aspect of patient safety in clinical settings, and accurate models can help prevent adverse events. The Hester Davis Score (HDS) is commonly used to assess fall risk, with current clinical practice relying on a threshold-based approach. In this method, a patient is classified as high-risk when their HDS exceeds a predefined threshold. However, this approach may fail to capture dynamic patterns in fall risk over time. In this study, we model the threshold-based approach and propose two machine learning approaches for enhanced fall prediction: One-step ahead fall prediction and sequence-to-point fall prediction. The one-step ahead model uses the HDS at the current timestamp to predict the risk at the next timestamp, while the sequence-to-point model leverages all preceding HDS values to predict fall risk using deep learning. We compare these approaches to assess their accuracy in fall risk prediction, demonstrating that deep learning can outperform the traditional threshold-based method by capturing temporal patterns and improving prediction reliability. These findings highlight the potential for data-driven approaches to enhance patient safety through more reliable fall prevention strategies.
Index Terms:
Fall risk, fall prediction, Hester Davis score, machine learning.I Introduction
Fall risk assessment is a critical process in healthcare, aimed at identifying hospitalized patients who are at higher risk of falling during their stay [1]. This assessment typically involves evaluating a combination of factors such as age, mobility, mental status, use of certain medications, and previous fall history. Tools like the Hester Davis Score (HDS) are widely employed to quantify fall risk based on these factors, allowing healthcare providers to classify patients into low, moderate, or high-risk categories [2, 3]. By accurately identifying high-risk patients, hospitals can implement preventive measures, such as increasing monitoring, modifying the patient’s environment, or providing assistive devices to prevent falls. Effective fall risk assessment is key to improving patient safety, reducing fall-related injuries, and minimizing healthcare costs associated with prolonged hospital stays and related complications [1, 4].
The HDS is a widely used tool in clinical settings for fall risk evaluation, offering a structured and standardized scoring system that assesses key factors such as age, mental status, mobility, medication usage, continence, recent fall history, and behavioral tendencies. Each factor is assigned a weighted score based on its contribution to fall risk, producing a cumulative score that categorizes patients into risk levels. The HDS allows for near real-time reassessment, as it incorporates both static and dynamic characteristics of the patient. This facilitates timely interventions, such as bed alarms or increased supervision, to mitigate fall risks in hospitalized patients [5].
Despite the utility of the HDS and other threshold-based models, these approaches often fail to capture the evolving risk patterns over time. They rely on instantaneous values to trigger preventive measures, which may not reflect the subtle, progressive changes in a patient’s condition. To address this limitation, machine learning models can offer a more dynamic and data-driven approach to fall risk prediction by incorporating the sequential pattern in the data. Similar approaches have been proposed for other challenges in healthcare, such as early warning systems [6, 7], hypertension detection [8], and human activity recognition [9, 10].
Machine learning, and particularly deep learning, has demonstrated superior performance in a variety of healthcare applications, from COVID-19 lung prognosis detection using chest computed tomography (CT) scans [11], to cervical spine fracture detection [12], and in-hospital mortality prediction among diabetic intensive care unit (ICU) patients [13]. In fall prediction, these models can be used to analyze complex, non-linear interactions between clinical variables, offering enhanced predictive power.
In this paper, we model the traditional threshold-based fall risk assessment approach using HDS and propose two machine learning-based alternatives: One-step ahead fall prediction and sequence-to-point fall prediction using deep learning. The former uses the HDS at a given time to predict fall risk at the next timestamp, while the latter leverages all preceding samples in a time series to forecast fall events. Sequence-to-point prediction is particularly important, as it captures the entire sequence of events leading up to a fall, allowing the model to identify temporal patterns and trends that threshold-based methods might overlook. For example, a gradual increase in HDS values over time may signify rising fall risk, even if the individual scores do not exceed predefined thresholds. This approach enables more accurate and timely predictions, enhancing the ability to intervene before a fall occurs. Particularly, recurrent neural networks (RNNs) [14, 15], long short-term memory (LSTM) [16] networks, and gated recurrent unit (GRU) [17] networks are proposed to learn from temporal pattern in the HDSs. We compare the performance of these methods to evaluate their potential in improving the accuracy and timeliness of fall risk predictions in clinical settings.
The source code used in this project and data can be made available on reasonable request and approval of corresponding authorities by contacting the corresponding author.


II Fall Prediction Models
In this section, fall prediction using HDS is modeled in two schemes, as illustrated in Figure 1. Let represent HDSs of an individual from admission to discharge, where is a prediction timestamp, is the total number of retrospective samples, and is the total number of individuals. The HDSs of patients are calculated every hours. In the One-step ahead fall prediction scheme, in order to make a prediction for an individual at the future timestamp , the last HDS is used. In the sequence-to-point fall prediction scheme, the entire HDS samples since admission, are used to make a prediction for an individual at the future timestamp .
II-A One-Step ahead Fall Prediction
The current clinical practice approach involves comparing the HDS to a predefined threshold. In this section, we mathematically model this approach and then propose machine learning models to learn from the HDS at the current timestamp in order to predict the outcome at the subsequent timestamp .
II-A1 Threshold-based Method
Most clinical providers use an absolute number threshold to determine if a patient is at a high risk of fall and needs extra care and increased monitoring. In this approach, if at any time the HDS value for patient exceeds the threshold , the patient is classified as high-risk for falls, defined as
(1) |
where means the patient is at high-risk of fall at the future timestamp and means otherwise.
II-A2 Machine Learning Methods
It is possible to build a binary fall prediction model about the fall outcome at a future timestamp based solely on the current available HDS sample . The task is to predict a binary label at time , using only the value of , the sample immediately preceding . For each sample in the retrospective dataset, we pair it with a corresponding label , which represents the outcome at the next time step. The prediction model is built using a binary classifier , which maps each to a binary outcome as
(2) |
where the training set consists of pairs and each serves as a feature to predict the binary outcome . The classifier is trained to minimize the prediction error by adjusting its parameters to best capture the relationship between the single time series sample and the next time step’s binary fall label. Various machine learning models such as k-nearest neighbors (KNN) [18], support vector machine (SVM) [19], random forest (RF) [20], and extreme gradient boosting (XGB) [21] are evaluated in experiments section for this aim.
II-B Sequence-to-Point Fall Prediction
Sequence-to-point fall prediction, which utilizes all preceding samples in a time series to predict a fall event, can hold significant importance in clinical settings. Unlike traditional threshold-based methods that rely on single, instantaneous values, sequence-to-point prediction leverages the entire sequence of data leading up to the instant before which the fall event is predicted. This approach captures temporal patterns and trends that may be missed using isolated samples.
II-B1 Recurrent Neural Networks
The RNNs are one of the popular methods to model sequential dependencies within time series, making it suitable for tasks where the prediction at the final time step depends on the prior inputs. These networks leverages its hidden state to capture the temporal dynamics from sequential inputs, enabling the model to predict the risk of a fall at the final time step.
In order to model the sequence-to-point binary classification task for inpatient fall risk prediction using RNNs [15], let represent a HDS at time for an individual without loss of generality. The RNN processes each time series up to time step to predict whether a fall occurs at time . The hidden state at each time step is computed as
(3) |
where is the activation function, is the hidden state at time step , is the input weight matrix, is the recurrent weight matrix, and is the bias vector. The hidden state at time step captures the information of the HDSs up to that point.
At the time step , the hidden state is used to predict the occurrence of a fall at time step . The hidden state is passed through a fully connected layer followed by a Softmax activation to produce the output logits as
(4) |
where is the output weight matrix and is the bias term. The Softmax activation function is applied to the logits to obtain the probability distribution over the two classes (fall or no fall) as
(5) |
where is the logit corresponding to class (fall or no fall)and the predicted outcome is
(6) |
For simplicity, assume as the predicted probability of the fall outcome class. The networks is trained using backpropagation and cross-entropy loss function as
(7) |
where is the true label, is the predicted probability of the fall outcome class of individual .
II-B2 Long Short-Term Memory Networks
To address the issue of vanishing gradients commonly faced by standard RNNs, we implemented a LSTM network, which introduces gates to control information flow and maintain long-range dependencies across time steps [15]. It maintains an internal memory state along with the hidden state . At each time step , the LSTM computes the input gate as
(8) |
where is the weight matrix from the input layer to the input gate, is the weight matrix from hidden state to the input gate, and is the bias of the input gate. The forget gate is defined as
(9) |
where is the weight matrix from the input layer to the forget gate, is the weight matrix from hidden state to the forget gate, and is the bias of the forget gate. The cell gate as
(10) |
where is the weight matrix from the input layer to the cell gate, is the weight matrix from hidden state to the cell gate, and is the bias of the cell gate. The output gate is
(11) |
where is the weight matrix from the input layer to the output gate, is the weight matrix from hidden state to the output gate, and is the bias of the output gate.
II-B3 Gated Recurrent Unit
The GRU is a simplified variant of the LSTM that reduces the number of gates while retaining the ability to manage long-range dependencies [15]. GRUs simplify the gating mechanism by combining the forget and input gates into a single update gate. At each time step , the GRU computes an update gate as
(14) |
and the reset gate as
(15) |
and the candidate hidden state as
(16) |
where the hidden state is then updated as
(17) |
At time step , the hidden state is used to predict the fall event at timestamp similar to Eqs. (4) and (7) in training the RNN.
III Experiments
III-A Data
Our Institutional Review Board approved the study protocol. The dataset consisted of hospitalized patients, including who experienced a fall (median age ; male) and who did not (median age ; male). Retrospective data was collected from consecutive patients admitted between January 1, 2018, and May 23, 2023, for various medical and surgical conditions across 4 academic and 13 community hospitals in Arizona, Florida, Minnesota, and Wisconsin in the United States. Patients were identified using electronic medical records. Adults aged 18 years and older who had been hospitalized for at least one day were included in the study, while those admitted to critical care units, hospice, or psychiatric units were excluded.
Model Performance Metric (Avg.Std.) Accuracy F1 Score Specificity Sensitivity PPV AUC HDS 7 0.570.01 0.570.01 0.520.01 0.620.01 0.540.01 0.570.01 HDS 20 0.600.01 0.560.00 0.920.00 0.290.01 0.650.01 0.600.01 KNN 0.520.00 0.390.01 0.990.01 0.050.01 0.050.01 0.540.01 SVM 0.630.01 0.620.01 0.820.01 0.440.03 0.570.01 0.660.01 RF 0.630.01 0.620.01 0.800.01 0.460.01 0.620.01 0.700.01 XGB 0.630.01 0.620.01 0.810.01 0.460.01 0.620.01 0.700.01
Model Performance Metric (Avg.Std.) Accuracy F1 Score Specificity Sensitivity PPV AUC RNN 0.690.13 0.640.18 0.690.24 0.690.37 0.690.19 0.660.12 LSTM 0.700.12 0.660.18 0.640.22 0.440.27 0.760.12 0.700.10 GRU 0.740.20 0.670.28 0.940.07 0.530.44 0.530.18 0.770.09
III-B Evaluation Setup
A 10-fold cross-validation was performed, with the average (Avg.) and standard deviation (Std.) of each performance metric recorded. In each independent run, the models were trained from scratch on a randomly selected training dataset and evaluated on a randomly selected balanced test dataset. For each cross-validation fold, a balanced test set was created by randomly selecting 10% of the data from the fall event class and 10% from the no fall event class. This left the remaining dataset imbalanced. To address this, a balanced training dataset was constructed for each fold by including all remaining encounters from the fall event class and randomly selecting an equal number of encounters from the no fall event class. The combined samples were shuffled prior to each training iteration.
The machine learning models were evaluated using several metrics. Accuracy is defined as
(18) |
where is the true positive value, is the true negative value, is the number of true fall encounters, and is the number of true encounters without a fall. With a balanced test dataset, accuracy equals balanced accuracy. The F1 Score is given by
(19) |
where represents false positives (encounters incorrectly predicted as fall event) and denotes false negatives (encounters incorrectly predicted as not fall). Specificity, or true negative rate, is defined as
(20) |
and sensitivity, or true positive rate, is calculated as
(21) |
The Positive Predictive Value (PPV) is defined as the proportion of s out of the total number of positive results, calculated as
(22) |
III-C Training Setup
The hyperparameter tuning was conducted using of the training data as the validation dataset, which was different from the test dataset, using random serach. All the models were implemented in Python and PyTorch [22] and trained on two NVIDIA A6000 GPUs with GB of RAM and CPU cores.
The SVM [23] model was built with a radial basis function (RBF) with a regularization parameter of (grid searched in ). The KNN model was evaluated for various nearest neighbor values in and the 1-nearest neighbour was selected. The XGB model was trained with estimators, parallel trees, and regularization coefficient . The number of trees in RF was searched ranging from to with step , set to , and the maximum depth of trees was set to to prevent overfitting.
Hyperparameter tuning for RNNs, LSTMs, and GRUs involved selecting optimal values for each parameter to maximize model performance and efficiency. For the number of units, the search was in with one layer. Exponential adaptive learning rate with Adam optimizer was used initiated from and , with being a good starting point, with a batch size of . Early-stopping was applied with a patience of epoch given training epochs. Dropout [24] rate was set to . In LSTMs, setting the forget gate bias close to helped the model retain long-term dependencies, and in GRUs, adjusting the update gate similarly enhances performance. The rectified linear unit (ReLU) [25] activation function was used due to its non-linearity and faster convergence.
III-D Performance Results Analysis
Table I presents the performance results for various models in one-step-ahead fall prediction, with metrics normalized to a scale of 1 and averaged over 10-fold cross-validation. The models evaluated include two threshold-based methods, the HDS 7 and HDS 20, and several machine learning algorithms including KNN, SVM, RF, and XGB. The HDS 7 threshold-based method demonstrates moderate and consistent performance across all metrics, while HDS 20 achieves high specificity at but low sensitivity at , indicating its strength in identifying non-fall events at the expense of detecting actual falls. KNN exhibits the highest specificity at but suffers from extremely low sensitivity at , highlighting its poor performance in fall detection. In contrast, the machine learning models SVM, RF, and XGB show comparable performance, with accuracy around , balanced F1 scores, and the area under the curve (AUC) values ranging from to . These results indicate that these models are more effective in balancing sensitivity and specificity compared to the threshold-based methods, with RF and XGB providing the best overall discriminative power, as evidenced by their higher AUC scores at .
Table II presents the performance results for various models in sequence-to-point fall prediction. Among the LSTM, GRU, and RNN models, distinct differences in effectiveness are observed. The GRU model achieves the highest accuracy at , demonstrating its strong capability to classify instances correctly. It also shows a commendable F1 score of , reflecting a balance between precision and recall, alongside impressive specificity at and moderate sensitivity at . Conversely, the RNN model, with slightly lower accuracy at , demonstrates higher sensitivity at , suggesting better performance in identifying positive instances. The LSTM model exhibits competitive performance with an accuracy of and a favorable F1 score of , though its specificity and sensitivity indicate a trade-off in accurately identifying true negatives and positives.
The superior performance of the GRU compared to the RNN and LSTM can be attributed to its streamlined architecture, which employs fewer parameters while effectively capturing long-range dependencies. By combining the forget and input gates into a single update gate, the GRU simplifies the model, enhancing its learning efficiency. This design helps mitigate the vanishing gradient problem that often plagues traditional RNNs, allowing the GRU to converge faster during training. Additionally, the GRU typically requires less computational resources, making it an attractive option in scenarios where model efficiency is critical.

Figure 2 displays the receiver operating characteristic curve (ROC) curves of models. Overall, the GRU model stands out for its accuracy and specificity, making it a preferable choice for applications prioritizing precision. However, if maximizing the identification of positive cases is the primary goal, the RNN model may be more suitable due to its higher sensitivity. Therefore, the selection of the model should be guided by the specific requirements of the application, whether focusing on maximizing correct classifications or optimizing for sensitivity.
IV Conclusion
In conclusion, effective fall risk assessment is crucial for enhancing patient safety in healthcare settings, particularly for hospitalized individuals. Traditional methods, such as the threshold-based HDS, provide a structured approach to evaluating fall risk but often fall short in capturing the dynamic nature of patient conditions. This study highlights the limitations of threshold-based models, which may overlook subtle changes in risk factors over time. In contrast, machine learning approaches, including one-step ahead and sequence-to-point fall prediction methods, offer a more sophisticated framework for predicting fall risk by analyzing temporal patterns and interactions among clinical variables. The comparative analysis demonstrates that machine learning models, particularly the GRU, outperformed traditional methods and provided a more balanced sensitivity and specificity. By utilizing advanced algorithms, healthcare providers can achieve more accurate predictions, leading to timely interventions that can significantly reduce the incidence of falls and associated complications. Future research should continue to explore and refine these machine learning techniques to further enhance fall risk assessment strategies in clinical practice.
References
- [1] Karen L Perell, Audrey Nelson, Ronald L Goldman, Stephen L Luther, Nicole Prieto-Lewis, and Laurence Z Rubenstein. Fall risk assessment measures: an analytic review. The Journals of Gerontology Series A: Biological Sciences and Medical Sciences, 56(12):M761–M766, 2001.
- [2] Gideon Moseti Nyakundi. Use of the Hester Davis Falls Risk Assessment Scale in Medical-Surgical Patients. PhD thesis, Walden University, 2022.
- [3] Amelia Payne. Impact of the Hester Davis Fall Risk Scale on Inpatient Falls. University of Missouri-Saint Louis, 2020.
- [4] Veronica Strini, Roberta Schiavolin, and Angela Prendin. Fall risk assessment scales: A systematic literature review. Nursing Reports, 11(2):430–443, 2021.
- [5] Amy L Hester and Dees M Davis. Validation of the hester davis scale for fall risk assessment in a neurosciences population. Journal of Neuroscience Nursing, 45(5):298–305, 2013.
- [6] Hojjat Salehinejad, Anne M. Meehan, Parvez A. Rahman, Marcia A. Core, Bijan J. Borah, and Pedro J. Caraballo. Novel machine learning model to improve performance of an early warning system in hospitalized patients: a retrospective multisite cross-validation study. eClinicalMedicine, 66:102312, 2023.
- [7] Hojjat Salehinejad, Anne M. Meehan, Pedro J. Caraballo, and Bijan J. Borah. Contrastive transfer learning for prediction of adverse events in hospitalized patients. IEEE Journal of Translational Engineering in Health and Medicine, 12:215–224, 2024.
- [8] Navid Hasanzadeh, Shahrokh Valaee, and Hojjat Salehinejad. Hypertension detection from high-dimensional representation of photoplethysmogram signals. In 2023 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), pages 1–4, 2023.
- [9] Hojjat Salehinejad and Shahrokh Valaee. Litehar: Lightweight human activity recognition from wifi signals with random convolution kernels. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4068–4072, 2022.
- [10] Hojjat Salehinejad, Radomir Djogo, Navid Hasanzadeh, and Shahrokh Valaee. Smctl: Subcarrier masking contrastive transfer learning for human gesture recognition with passive wi-fi sensing. In 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), pages 1–5, 2024.
- [11] Edward H Lee, Jimmy Zheng, Errol Colak, Maryam Mohammadzadeh, Golnaz Houshmand, Nicholas Bevins, Felipe Kitamura, Emre Altinmakas, Eduardo Pontes Reis, Jae-Kwang Kim, et al. Deep covid detect: an international experience on covid-19 lung detection and prognosis using chest ct. NPJ digital medicine, 4(1):11, 2021.
- [12] Hojjat Salehinejad, Edward Ho, Hui-Ming Lin, Priscila Crivellaro, Oleksandra Samorodova, Monica Tafur Arciniegas, Zamir Merali, Suradech Suthiphosuwan, Aditya Bharatha, Kristen Yeom, et al. Deep sequential learning for cervical spine fracture detection on computed tomography imaging. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1911–1914. IEEE, 2021.
- [13] Julian Theis, William L Galanter, Andrew D Boyd, and Houshang Darabi. Improving the in-hospital mortality prediction of diabetes icu patients using a process mining/deep learning architecture. IEEE Journal of Biomedical and Health Informatics, 26(1):388–399, 2021.
- [14] Simon Haykin. Recurrent neural networks for. Digital Signal Processing Systems: Implementation Techniques: Advances in Theory and Applications, page 89, 1995.
- [15] Hojjat Salehinejad, Sharan Sankar, Joseph Barfett, Errol Colak, and Shahrokh Valaee. Recent advances in recurrent neural networks. arXiv preprint arXiv:1801.01078, 2017.
- [16] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. Lstm neural networks for language modeling. In Interspeech, volume 2012, pages 194–197, 2012.
- [17] Rahul Dey and Fathi M Salem. Gate-variants of gated recurrent unit (gru) neural networks. In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), pages 1597–1600. IEEE, 2017.
- [18] Jorma Laaksonen and Erkki Oja. Classification with learning k-nearest neighbors. In Proceedings of international conference on neural networks (ICNN’96), volume 3, pages 1480–1483. IEEE, 1996.
- [19] Alex J Smola and Bernhard Schölkopf. A tutorial on support vector regression. Statistics and computing, 14(3):199–222, 2004.
- [20] Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
- [21] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- [23] Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28, 1998.
- [24] A Labach, H Salehinejad, and S Valaee. Survey of dropout methods for deep neural networks. arxiv 2019. arXiv preprint arXiv:1904.13310.
- [25] Fazle Karim, Somshubra Majumdar, Houshang Darabi, and Shun Chen. Lstm fully convolutional networks for time series classification. IEEE access, 6:1662–1669, 2017.