Software Bug crashes European Rocket, Ariane 5. 
Author Message
 Software Bug crashes European Rocket, Ariane 5.

[big snip]

Quote:
>     In the earlier versions of the Ada programming language, Ada '83,
> where the variable's range had to be specified, it was difficult
> to allow for an exception or to protect the variable in case it exceeded
> the range. In this case because of the range specified, the compiler
> thought it was safe to covert the piece of data from a 64 bit to
> a 16 bit value, causing the overflow error. In Ada '95 it is easier
> and more efficient for programmers to protect variables for
> exceeding limit errors, and allow for graceful recovery.

[big snip]

In what otherwise looked to be a reasonable discussion of the Ariane 5
crash, I stumbled onto this. Does anyone know what was the "difficulty"
in
Ada83 with putting an exception handler in the right place to catch
this error, or for that matter how Ada95 makes it easier? Certainly,
Ada95
makes it easier to annotate the exception, but that wouldn't have made
a difference here, would it?

The real issue, IMHO, is that a deliberate decision was made not to use
an
Ada feature, supposedly to improve processing throughput. The decision
worked on Ariane 4 (at least, up to now!) and didn't on Ariane 5

--
LMTAS - The Fighter Enterprise - "Our Brand Means Quality"
For job listings, other info: http://www.*-*-*.com/ or
http://www.*-*-*.com/



Mon, 19 Jul 1999 03:00:00 GMT  
 Software Bug crashes European Rocket, Ariane 5.

                                  4:45:33am     Saturday January 18, 1997.

Subject:  Re: Software Bug crashes European Rocket, Ariane 5.


Command-Control-Communications-Computers-Intelligence
Surveillance Reconnaissance (C4ISR) forum:

Quote:
>Just saw an incredible article from a December edition of the New York
>Times.

>Essentially, a very common bug (speaking as a software engineer)
>literally crashed (as in fire and explosion) a European rocket which had
>cost $7 billion to produce.  On one hand, unimaginable that a simulator
>wasn't able to create the scenario which lead to the crash; and on the
>other hand, only too easy to imagine that it is quite possible.

>Be warned that the NYT Web site requests "registration" data -- the URL for
>this article  is:
>" http://www.*-*-*.com/ {*filter*}/week/1201gleick....".

    What Carrie Swaby is talking about is that On 4 June 1996 the Ariane 5
prototype European space launcher veered off course and was destroyed
40 seconds after blast-off. Its payload, four expensive and uninsured
scientific satellites were lost in the explosion.

    The article referenced was a New York Times, {*filter*} Times,
Fast Forward piece by James Gleick, "1 {*filter*}sy Little Bug, 1 Humongous Crash",
01-Dec-96. Also reference New York Times, "Costly Failure: Space launch is
aborted", 05-Jun-96, page D1.

    I was surprised that the Ariane 5 Explosion was not covered by
the C4ISR forum before, it has been extensively covered on
Peter G. Neumann's ACM Risks forum during June and July.

    I disagree with the posting and the emphasis of Mr. Gleick's article,
"that this very common software bug caused the crash".
The problem was much more negligent, and much more complex.

    Please reference the "Findings of the Ariane 501 inquiry board",
it was available over the summer at
http://www.*-*-*.com/
Aviation Week and Space Technology, reprints, and examines the
findings in the following issues 29-Jul-96 page 33, 09-Sep-96 page 79,
16-Sep-96 page 55. And in Nature, "Software Testing blamed for Failure",
01-Aug-96 page 386.

    About 30 seconds into flight, the Sextant Avionique Guidance system,
both the primary and backup failed, the inertial reference system (SRI)
software had crashed, and started to transmit diagnostic data to the
main computer, which interpreted the diagnostic data as flight data.
The On-board Computer (OBC) tried to correct for this erroneous course data,
by making abrupt course corrections that were not needed.

    What happened was that horizontal velocity exceeded the boundary
range of the software variable, went outside the limit, and caused an
exception, the excessive value exception was not taken into
account or protected in the inertial reference system software as
were other variables.

    The reason for this was that the software was written for the
Ariane 4 rocket, which could not exceed that velocity, the Ariane 5
is a faster rocket. Common and legacy software was used. In fact,
the software that crashed, a special feature for the Ariane 4 rocket
to calibrate and align the system during the first 40 seconds of
flight, in the case of a brief countdown hold, was not needed and
served no purpose for the Ariane 5. The software was maintained for
commonality reasons, presumably based on the view that it was not wise
to make any changes in the software that worked so well in the previous
system.

    In the earlier versions of the Ada programming language, Ada '83,
where the variable's range had to be specified, it was difficult
to allow for an exception or to protect the variable in case it exceeded
the range. In this case because of the range specified, the compiler
thought it was safe to covert the piece of data from a 64 bit to
a 16 bit value, causing the overflow error. In Ada '95 it is easier
and more efficient for programmers to protect variables for
exceeding limit errors, and allow for graceful recovery.

    Yet, this was not the "bug" that caused the crash, this
SRI software worked perfectly well for the system it was designed
for, the Ariane 4, over 23 successful flights. Not all data
conversions were protected because a maximum workload target
of 80% had been set for the SRI computer.    

    Also, complete testing, to the Ariane 5 trajectory, was not done.
The reasons seem to be that the SRI was previously flight tested with
the Ariane 4, and budgetary concerns.

    The report essentially assesses the failure causes as :

1 -    More extensive realistic testing and simulations should have
       been performed using Ariane 5 trajectory data.

2 -    All implicit assumptions made in the code should have been
       identified, reviewed, and analyzed in terms of using the
       SRI Ariane 4 software for the Ariane 5.
       A specific qualification review for the software should
       have been held in terms of Ariane 5 specifications.
       No software function should run during flight unless it is needed!

    In assessing this fiasco in terms of lessons learned for C4ISR
software engineering, when upgrading a system, going to a new block
upgrade, or basing a new project on an earlier project's software:

    *    All legacy software or "common modules" should be analyzed
         and reviewed in terms of the new specification. If the
         software contains code that serves no purpose for the new
         system, or is less efficient for the new system; Then that
         software should be redesigned and re-written.

         Many managers object to this, because it essentially
         removes any cost savings obtained by software re-use,
         or the use of object-oriented class libraries.

         By having up to date documentation, specifically
         deatiled justification and assumption documents,
         with each piece of software.
         While this would increase the initial cost of the
         software, it would make it easier to review modules or
         objects in terms of re-use.

    *    Complete testing and simulations should be done on the new version
         using the new specifications. Everything should be re-tested from
         the ground up. Functions should not be exempt from testing
         because of previous performance on the old system.

-----------------

Robert J. Perillo, CCP
Staff Computer Scientist
Richmond, Va.



Mon, 19 Jul 1999 03:00:00 GMT  
 Software Bug crashes European Rocket, Ariane 5.

I do not understand the following paragraph at all!

"    In the earlier versions of the Ada programming language, Ada '83,
where the variable's range had to be specified, it was difficult
to allow for an exception or to protect the variable in case it exceeded
the range. In this case because of the range specified, the compiler
thought it was safe to covert the piece of data from a 64 bit to
a 16 bit value, causing the overflow error. In Ada '95 it is easier
and more efficient for programmers to protect variables for
exceeding limit errors, and allow for graceful recovery."

I see no relevant changes in Ada 95!!!



Mon, 19 Jul 1999 03:00:00 GMT  
 Software Bug crashes European Rocket, Ariane 5.



Quote:
> The real issue, IMHO, is that a deliberate decision was made not to use
> an
> Ada feature, supposedly to improve processing throughput. The decision
> worked on Ariane 4 (at least, up to now!) and didn't on Ariane 5

According to another part of the first report it went deeper than that.
In some senses they had assumed that a hardware failure would be the
most likely cause of a systems failure, and so the policy was to
shutdown the errant processor immediately in the event of an exception.

Someone else chipped in when I last asked about this and said that in
multi-CPU and voting mission critical systems this isn't so unusual.
Unfortunately identical code on identical hardware is a common mode failure.

It was very sad that it was not a requirement for the Ariane 4 code reused
in Ariane 5 to be able to handle (and be tested) on the new launch trajectory.

Regards,
--

Scientific Software Consultancy             /^,,)__/



Tue, 20 Jul 1999 03:00:00 GMT  
 Software Bug crashes European Rocket, Ariane 5.

Quote:

> Someone else chipped in when I last asked about this and said that in
> multi-CPU and voting mission critical systems this isn't so unusual.
> Unfortunately identical code on identical hardware is a common mode failure.

In my experience with multi-CPU voting safety-critical systems (12 years
and
counting), we _never_ shut down a system with software. With independent
hardware
fault logic, yes, but never software, preceisely because the same
software is
in all CPUs. We always attempt to restart the system, filter out the
failure, do
_something_, if the software detects an internal failure. We also do our
best
to make sure the fault isn't there in the first place. In fact, AFISC
SSH 1-1
makes it explicit that safety-critical systems should never attempt to
halt.

Quote:
> It was very sad that it was not a requirement for the Ariane 4 code reused
> in Ariane 5 to be able to handle (and be tested) on the new launch trajectory.

A triumph of cost-consciousness over common sense.

Quote:

> Regards,
> --

> Scientific Software Consultancy             /^,,)__/

--
LMTAS - The Fighter Enterprise - "Our Brand Means Quality"
For job listings, other info: http://www.lmtas.com or
http://www.lmco.com


Wed, 21 Jul 1999 03:00:00 GMT  
 Software Bug crashes European Rocket, Ariane 5.

I've had a read of the report and one thing puzzles me. The report
essentially said that there were three problems:

  The reuse of legacy software without re-verification of all the
  assumptions made based on the Ariane 4 Functional specs.

  The lack of full system testing based on the assumed reliablility of
  the software module.

  The running of software functionaly during launch that was not
  required.

Isn't there a fourth problem? The On-Board Computer received as
valid telemetry data the diagnostic output of the Inertial Guidance
System and acted on this data. Shouldn't there be some sort of checking
on a data source such as this to validate it?

I'm new to this area so maybe I've missed something here?

--
brian wallis...                 TUSC Computer Systems Pty.Ltd

Phone: +61 3 9840 2222          Doncaster, Victoria
Fax: +61 3 9840 2277            Australia 3108



Fri, 23 Jul 1999 03:00:00 GMT  
 Software Bug crashes European Rocket, Ariane 5.



Quote:

> I've had a read of the report and one thing puzzles me. The report
> essentially said that there were three problems:

[snip]

Quote:
> Isn't there a fourth problem? The On-Board Computer received as
> valid telemetry data the diagnostic output of the Inertial Guidance
> System and acted on this data. Shouldn't there be some sort of checking
> on a data source such as this to validate it?

> I'm new to this area so maybe I've missed something here?

I think that although this was a problem it was a bad design decision
conciously made on the grounds that if the IGS was utterly wrecked the
diagnostic information needed to be sent to the telemetry stream.
Why they decided to allow the computer to try and act on this data is
beyond me- it would have been safer to ignore failed sensor unit inputs,
and any data packets tagged as diagnostic data.

It is always easier to be wise after the event though.

Regards,
--

Scientific Software Consultancy             /^,,)__/



Fri, 23 Jul 1999 03:00:00 GMT  
 Software Bug crashes European Rocket, Ariane 5.

I have a direct link from my web site http:\\www.relisoft.com to the actual
report by the Inquiry Board into the failure of the Ariane 5 flight. It's
interesting reading. Highly recommended! (Yeah, it's also a thinly veiled
plug for my web site).
Bartosz



Mon, 26 Jul 1999 03:00:00 GMT  
 
 [ 8 post ] 

 Relevant Pages 

1. Ariane Crash (Was: Adriane crash)

2. Papers on the Ariane-5 crash and Design by Contract

3. Papers on the Ariane-5 crash and Design by Contract

4. GNAT and Ariane 5 crash

5. Ariane V: Maybe not flight control software?

6. European Software Festival

7. Bug: WinTk crashing menu bug

8. linux has bugs - all software has bug i think

9. Crash-Free Software

10. BUG: DialogView mutation to a ShellView crashes Dolphin...

11. Big Bug - System Crashes

12. BUG: System Crash trying to import Visual Foxpro Data format (2nd Request)

 

 
Powered by phpBB® Forum Software