
"If you can’t test it, don’t build it. If you built it and can’t test it, rip it out."
Since Boris Beizer first wrote those words in the 1980s, they have become part of the DNA of software development and reflected in the adoption of best practices that integrate testing, like Test Driven Development for example.
Software testing has four basic responsibilities:
1. To ensure we are building the right software
2. To ensure that software functions correctly and to specification
3. To ensure the software meets the required quality standards
4. To identify and evaluate risks associated with the software
The mere idea of deploying untested software into an operational environment is pretty much considered insanity nowadays by the software industry as a whole. And yet, we are doing exactly this with machine learning and other AI applications on a regular basis, with predictably disappointing results. For example: the facial recognition system deployed for field testing by the New York Port Authority has a zero percent success rate; the IBM Watson machine learning system for cancer treatment offered advice that would kill patients in some cases; and the machine learning tool that was used by Amazon in its hiring process was scrapped because it was biased against women.
Machine learning applications are radically different both architecturally and operationally from traditional software applications, and they also have a very different development processes which involve activities like building and training models that are not present in traditional software applications. Because of these differences, software testing does not currently have the right tools for adequately meeting its four basic functions when it comes to machine learning applications.
However the history of software testing has been one of adaptation to meet the testing challenges presented by innovations in software development. Historically, this has often taken the form of borrowing and adapting testing techniques from other testing domains like engineering or medicine and social research.
There are whole domains of testing focused on testing models and the other core artifacts that make up machine learning applications, as well as acceptance testing, quality and risk analysis methods that are in use in other domains that can be easily adapted to the challenge of testing machine learning applications. The testing problem is not insolvable.
The ultimate success of machine learning applications in the marketplace will depend critically on the software testing community catching up to the changes in application development introduced by machine learning, and its ability to develop the necessary tools to ensure that the Beizer recommendation "If you can’t test it, don’t build it" can be applied to machine learning applications.