You are here

Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science

Thomas Huckle and Tobias Neckel
Publisher: 
SIAM
Publication Date: 
2019
Number of Pages: 
251
Format: 
Paperback
Price: 
44.00
ISBN: 
978-1-611975-55-0
Category: 
Monograph
BLL Rating: 

The Basic Library List Committee suggests that undergraduate mathematics libraries consider this book for acquisition.

[Reviewed by
Bill Satzer
, on
06/28/2019
]
This book takes a serious look at the bugs that arise in software for scientific computation. It is not a treatise how to improve software design, testing or quality assurance. Instead the authors focus on specific examples of software failures, some of which are catastrophic. The failures are each traced to a bug, defined as an existing fault in a software system leading to an erroneous state of the system and resulting in an observable failure.
 
The authors see their book as useful in two primary aspects. One is to provide illustrative examples of numerical principles such as computer representation of numbers, rounding errors, condition numbers, and algorithmic complexity. The other is to introduce elements of numerical methods to non-experts.
 
The book is divided into chapters each of which focuses on a certain class of bug. The chapters are largely independent of each other. The main thing that distinguishes this book from similar material that has appeared in the literature is that all the examples are about real systems where a good deal of information is available about what happened and why. Furthermore, the authors have examined each situation very thoroughly and carefully extracted the critical elements that led to failure. Each example is accompanied by its own extensive bibliography.
 
Each chapter concerns itself with one category of bug. These range from number representation and rounding error to bugs arising from faulty assumptions or interpretations of data, bugs arising from synchronization issues, bugs from software-hardware interactions, and bugs due to the high complexity of large projects. The authors note that a single failure may arise from multiple bugs. Failures are often very complex, sometimes with a single bug that produces a cascade of bad effects.
 
The real meat of the book is in the examples. Each example begins with a short overview that describes the failure. Introductory material follows and fills in the background of the application area. Next comes a timeline showing the sequence of events leading to the accident. Each example typically concludes with a detailed description of the error and an analysis of the root causes. To make the book more accessible to non-specialists, the authors have added “excursions” that explain important concepts where necessary. These include topics like machine number formats, rounding, the finite element method, condition numbers and a good deal more.
 
Examples range from aerospace to finance, structural mechanics, control systems, medical imaging, and airport luggage handling. For illustration, here are two summaries of examples that caught my attention:
 
The European Space Agency’s Ariadne 5 launcher was designed to place four satellites in near earth orbit. Because serious flight instability occurred it was destroyed shortly after launch. The proximate cause was an overflow when a 64-bit floating-point number was converted to a 16-bit integer. The cost was $500 million.
 
The concrete gas rig Sleipner A, an offshore drilling platform in the North Sea, failed during a controlled ballast test after experiencing critical hydrostatic pressures. It went underwater and sank quickly. The primary cause was a discretization error leading to a serious design flaw. The designer, using finite element software, chose finite elements with too high an aspect ratio and underestimated the boundary shear forces by more than 40%. The cost here was also substantial, but since failure occurred during testing, no lives were lost.
 
This would be an excellent adjunct for courses in numerical analysis or computational science. The detailed examples and thorough analysis of the origin and consequences of software failures are particularly valuable.

Bill Satzer (bsatzer@gmail.com) was a senior intellectual property scientist at 3M Company.  His training is in dynmaical systems and particularly celestial mechanics; his current interests are broadly in applied mathematics and the teaching of mathematics.