13 August 2011

Why software fails?

In my previous post I described a project that was about to fail. In this post I will try to discuss what does it mean that software development is successfull and what are the resons (both immediate and these less apparent) of some spectacular software failures.

To start with lets ask a question: "what does it mean that development was successful?" Firstly, we have to consider purly economical criteria i.e. cost in terms of time and money. Simply speaking, we need the software to be completed within a prescribed timeframe and within an accepted budget. Both are equally important, as our customers do not want to pay too much and they also do not want to wait for years.

Secondly, the software delivered has to meet the needs of its users. This one seems to be obvious - if the software does not do what it needs to do, than it is obviously useless. However in many cases it is hard to tell what software has to do (we will be discussing this issue soon). For now just assume that software has to perform certain functions required by the users (so it has correct scope) and it has to execute them correctly (so the quality is right).

Lets set up a suitable background for further discussion by presenting a few spectacular examples of software failures. We will start with one that is often presented on various occasions - Ariane 5 explosion (or self-destruction to be precise) on June, 4th 1996. The immediate reason of the crash was a software error. The Failure Report (available here as PDF: Ariane 5 Flight 501 Failure, Report by the Inquiry Board) states: "A data conversion from 64-bit floating point value to 16-bit signed integer value to be stored in a variable representing horizontal bias caused a processor trap (operand error)". The conversion caused an exception, simply because the smaller datatype (16 bit) was not able to store values of the operand (which was 64-bit long). As the software was originally written for Ariane 4, the values were not checked and protected prior the conversion due to assumption that they are  "physically limited or that there was a large margin of error". The check was not implemented to ensure meeting a requirement of sufficiently low load on the main flight computer. The funny thing is, that the piece of code that caused the explosion was not needed by Ariane 5 at all - it was there because of reusing of whole Ariane 4 subsystem...
The result was spectacular fireworks show worth 350 000 000$ founded by French Arianespace.

Less funny was however a similar failure of MM-104 Patriot system in Dhahran on February, 25th 1991, when an Iraqi Scud missile killed 28 soldiers. Even though the approaching Scud was detected, the Patriot failed to intercept it due to a software error. Put simple, the radar detected the missile, the software predicted where it will be next... and it was not there. Why? The prediction procedure was OK, however it relied on an accurate calculation of time in the systems clock - and this part was buggy (cumulating roundoff error) causing clock drift (i.e. clock was not running at the correct speed). As the MM-104 station was up for over 100 hours, the clock was off about 1/3 of a second - enough for the Scud to travel over 600 meters. As the subsequent scans of the radar have not detected the incoming missile in the predicted areas, the system assumed it was a false alarm... while the Scud has hit the barracks already. Sadly the problem was identified a few days earlier, with a temporary solution suggested (i.e. rebooting the station every few hours) and the patch was delivered by the producent a few hours after the people died.

Another example of failure (but with no explosion this time) is FBI Virtual Case File. As the bureau was having troubles with sharing the data among its agents and divisions (and it was strongly criticised for that) it was decided to develop a system, where all the case files will be stored and all the agents with suitable access level will be able to quickly find them. The project started in 2001 just before 9/11 and was supposed to be finished in 2003. After several delays it was finally completed in 2004, but ... had only partial functionality, that did not match the requirements emerging after the terrorist attacks. 170 000 000$ spent on a project that was scrapped eventually.

Next example - Sainsbury supply chain management. Prior 2000 Salisbury had a centralised mainframe warehouse management. The new CEO claimed the server has had low utilisation, it used too many applications and outsourcing the IT departament to Accenture would lead to a new open, scalable, high performance architecture with strong security at low cost. In 2004 the contract with Accenture was renewed until 2010 as "the relationship with Accenture worked so well", however 2 month later CEO resigns because of poor financial performance. Soon after it is discovered that the warehouse management system under the development is not even able to track the stock correctly. The project was cancelled and the in-house IT departament restored. 1.8 billion $ wasted, Accenture blames Salisbury and vice versa.

I can probably give many more examples - but there are books full of them already (and really good article by Robert Charette titled "Why software fails" published in IEE Spectrum a few years ago). It is much more interesting to look at the reasons of these failures. Before we do that, lets consider them with relation to the criteria we set in the beginning:

ProjectDelivered on time?Within budget?Meets requirements?
AraineYESProbablyNo
MM-104ProbablyProbablyNo
FBI Case FileNoNoNo
Salisbury warehouse mgmt.NoNoNo

So as we can see the failures are not always the same - even if the final user gets a software on time and for the money he intended to pay, it does not mean it is a successful project (although in general your PC will not blow up when programs fail, so small number of bugs is usually acceptable given very high cost of identifying them prior release).

So what are the reasons this failures happen? In case of Araine, we might say the reason was insufficient testing. However, if we take a deeper look we will see that it is not the only problem: we can point out insufficient/erroneous specification of requirements - the software provided did something completely uncesessary (as it was developed for different spacecraft, with completely different launch procedure); moreover the requirements were contradicting or impossible to meet (low load vs. need for error detection and correction). Secondly, a blind software reuse was most probably caused by an urge to meet deadlines - this software was proven and tested (but on different spacecraft!), so it was easier to copy&paste (regardless of its suitability) than write it from scratch - somebody however forgot to ask if it can be reused directly without modification (or asked, but had no time to re-test properly).

Similarly, we might say an insufficient testing was the reason of the Patriot failure. But the bug was known at this time, and the immediate problem was lack of information how to deal with it - effectively an insufficient training of the soldiers managing the station - which was a command chain/management issue.

In case of the FBI Case File the reasons are quite obvious:
  • poorly defined, changing requirements, 
  • overcommitment to an ambitious project, that was (in fact) ment to be a fast fix to a broader management/communitaction problem within the FBI itself,
  • poor quality of work done by contractors without sufficient expertise.
Moreover there were some 'political reasons' contributing to the failure:
  • neglecting existance of off-the-shelf solutions,
  • 14 (sic!) project managers over the 3 years of project life time (so a manager changing something like every 2.5 months),
  • hardware setup, waiting for software (hence pressure for fast delivery on unrealistic deadlines)
Similar problems might be found in case of Salisbury:
  • weak outsourcing strategy,
  • 'big-bang' approach,
  • politics, 
  • software meant to be a fix for poor business management
In addition loss of staff with knowledge about exisiting solution contributed to some of the problems (Accenture claimed the software failed due to faulty Salisbury subsystems that were not outsourced to them).

I will also quickly list the problems with the project I wrote about last time. As I said, the first problem was lack of understanding of the customer needs - poor specification of requirements. Secondly, the project was approached in 'big-bang' fashion and without any suitable management policy. Well, there was lack of management at all. At the time I took over, the pressure to deliver fast was already growing, with an urge for scallable and efficient solution resulting with insufficient testing and poor overall quality. Thankfully we managed to sort it out - and the lesson was learnt.

We can generally say, that failure or success of the project depends critically on following factors:
  • organisational - being the culture and workflow within the organization  the software is developed for. In general the software should not be a fix to the problems of completely different nature (such as lack of suitable procedures, communication, poor business management);
  • project management - overcommitment and pressures of any kind do not help in efficient project management. Moreover lack of suitable management, broad overview and understanding of the project and its scope also contributes to failure. 
  • conduct of the project - errors in each of the phases of the project development - including initial stages (e.g. underestimating complexity, lure for a particular solution without a sensible reason), specification and design (e.g. poor requirements engineering, poor design, poor consulatation), and later in development (e.g. poor contractors quality, lack of suitable knowledge, communication etc.) and implementation (e.g. insufficient testing and users training) all contribute to a final result of the project.
From now on I will try to write more 'practical' and less theoretical posts that will address various issues presented here. I will try not to stick to 'natural' waterfall model (i.e. presentation of each of the phases one-by-one), but rather write about stuff I encounter everyday either in my company, or at the university (yes we also have to manage our research projects there). Hence, if I encouter an interesting issue - it is likely to be presented in an upcoming post (with some additional examples).

Next time I will be talking about requirements engineering or "what customer wants and why he does not really need it":)


Labels: , , ,

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

Links to this post:

Create a Link

<< Home