Deep learning for the radiographic diagnosis of proximal femur fractures: Limitations and programming issues.

  • Guy Sylvain
  • Jacquet Christophe
  • Tsenkoff Damien
  • Argenson Jean-Noël
  • Ollivier Matthieu

  • Humans
  • Prospective Studies
  • Radiography
  • Traumatology
  • Femur
  • Artificial intelligence
  • Deep Learning
  • Proximal femur fracture


INTRODUCTION: Radiology is one of the domains where artificial intelligence (AI) yields encouraging results, with diagnostic accuracy that approaches that of experienced radiologists and physicians. Diagnostic errors in traumatology are rare but can have serious functional consequences. Using AI as a radiological diagnostic aid may be beneficial in the emergency room. Thus, an effective, low-cost software that helps with making radiographic diagnoses would be a relevant tool for current clinical practice, although this concept has rarely been evaluated in orthopedics for proximal femur fractures (PFF). This led us to conduct a prospective study with the goals of: 1) programming deep learning software to help make the diagnosis of PFF on radiographs and 2) to evaluate its performance. HYPOTHESIS: It is possible to program an effective deep learning software to help make the diagnosis of PFF based on a limited number of radiographs. METHODS: Our database consisted of 1309 radiographs: 963 had a PFF, while 346 did not. The sample size was increased 8-fold (resulting in 10,472 radiographs) using a validated technique. Each radiograph was evaluated by an orthopedic surgeon using RectLabel™ software (, by differentiating between healthy and fractured zones. Fractures were classified according to the AO system. The deep learning algorithm was programmed on Tensorflow™ software (Google Brain, Santa Clara, Ca, USA, In all, 9425 annotated radiographs (90%) were used for the training phase and 1074 (10%) for the test phase. RESULTS: The sensitivity of the algorithm was 61% for femoral neck fractures and 67% for trochanteric fractures. The specificity was 67% and 69%, the positive predictive value was 55% and 56%, while the negative predictive value was 74% and 78%, respectively. CONCLUSION: Our results are not good enough for our algorithm to be used in current clinical practice. Programming of deep learning software with sufficient diagnostic accuracy can only be done with several tens of thousands of radiographs, or by using transfer learning. LEVEL OF EVIDENCE: III; Diagnostic studies, Study of nonconsecutive patients, without consistently applied reference "gold" standard.