Ariane Crash (Was: Adriane crash) 
Author Message
 Ariane Crash (Was: Adriane crash)


        >JOINT ESA/CNES PRESS RELEASE N  33-96  -  Paris, 23 July 1996

        >Ariane 501 - Presentation of Inquiry Board report

        >-------------------------------------------------------------------

        >Hope this is useful. So basically it _was_ a software fault

---Is this a euphemism for a programming error?  because that's
what it was -- a programming error.

   The error was in assuming that a value would not overflow.
The specific error was that a conversion of a double-precision
floating-point value (~58 significant bits) to 15 significant
bits caused fixed-point overflow.  The conversion was not
checked for overflow.  It should have been.  This is, after all,
a real-time system.  It's a fundamental check that a programmer
experienced in real-time systems should have carried out.

   Control was then passed to the interrupt handler, which
shut down the system.

   The question is, basically, why was Ada used for this work?
PL/I has specific facilities for real-time programming,
and especially for simulating exactly this (and other)
exceptions -- as if the exceptions had actually occurred.
The SIGNAL statement is designed for this purpose.  The
programmer would have discovered this problem the FIRST time
he used it!  And he could have included an exception handler
for this and other similar kinds of trivial errors.  These
exception handlers would have returned control to the code.

   A PL/I programmer and/or a real-time systems programmer
would have OBJECTED to the stupid requirement of shutting
down the system when a trivial error occurred.

        >What I want to know is, who wrote that software, and if their was an
        >ESA representative responsible for it, who was he!
        >Not that I want to apportion blame of course, just interested!




Tue, 12 Jan 1999 03:00:00 GMT  
 Ariane Crash (Was: Adriane crash)


Quote:

> ---Is this a euphemism for a programming error?  because that's
> what it was -- a programming error.

>    The error was in assuming that a value would not overflow.

The error was assuming that the Ariane 4 design would be adaquate
for the Ariane 5 system.

Quote:
> The specific error was that a conversion of a double-precision
> floating-point value (~58 significant bits) to 15 significant
> bits caused fixed-point overflow.  The conversion was not
> checked for overflow.  It should have been.

It was checked, hence the exception and an exception handler to
take corrective action.  Unfortunately the corrective action was
to assume that the SRI had failed and to shut it down.  The
software performed exactly as designed.

Quote:
>  This is, after all,
> a real-time system.  It's a fundamental check that a programmer
> experienced in real-time systems should have carried out.

>    Control was then passed to the interrupt handler, which
> shut down the system.

Exactly as designed.

Quote:
>    The question is, basically, why was Ada used for this work?

The failure is not a language issue, this is not the question.

-Bob



Fri, 15 Jan 1999 03:00:00 GMT  
 Ariane Crash (Was: Adriane crash)


Quote:

>    >JOINT ESA/CNES PRESS RELEASE N  33-96  -  Paris, 23 July 1996
>    >Ariane 501 - Presentation of Inquiry Board report
>    >-------------------------------------------------------------------
>    >Hope this is useful. So basically it _was_ a software fault
>---Is this a euphemism for a programming error?  because that's
>what it was -- a programming error.

Having read the report, I don't consider it to be a programming error,
it was a design and management error. It sounds like whoever designed
the system didn't pay enough attention to the requirements, and
whoever was managing it didn't pay enough attention to its conformance
to the requirements.

I think the fact that the overflow occurred was not due to a
programming oversight, after all the analyses had been done and a
decision to not check that variable had been made (*see additional
note below), but seeing as that variable should not have been in use
at that point, I don't think you can blame whoever wrote that code.

Quote:
>   The error was in assuming that a value would not overflow.
>The specific error was that a conversion of a double-precision
>floating-point value (~58 significant bits) to 15 significant
>bits caused fixed-point overflow.  The conversion was not
>checked for overflow.  It should have been.  This is, after all,
>a real-time system.  It's a fundamental check that a programmer
>experienced in real-time systems should have carried out.
>   Control was then passed to the interrupt handler, which
>shut down the system.
>   The question is, basically, why was Ada used for this work?

ESA Ada preference/mandate(?).

<..snip..>

*Note: I hope this makes ESA llok a bit closer at why they want to
limit processor loading and how the margin should be reduced through
the design and development phases. My own project has an ESA enforced
limit of 70% which is quite ridiculous given the equipment we're using
(GPS MA31750 10MHz MIL-STD-1750 processor). We cannot meet that but
have requested a waiver on that - I believe that's much better than
compromising the safety of the mission.

ESA's loading margins are really supposed to take account of a
requirement for future modifications to software once it has been
delivered. There's no way this should have been enforced for Ariane 5.

From the sound of the report,I think a pretty poor job has been done,
not by the programmers who wrote the code and performed the analysis
of what variables could safely be left unchecked, instead I think
whoever performed the requirement analysis and all levels of
management / reviewers above that havebeen completely negligent.

Best Regards



Fri, 15 Jan 1999 03:00:00 GMT  
 Ariane Crash (Was: Adriane crash)



        >>
        >> ---Is this a euphemism for a programming error?  because that's
        >> what it was -- a programming error.
        >>
        >>    The error was in assuming that a value would not overflow.

        >The error was assuming that the Ariane 4 design would be adaquate
        >for the Ariane 5 system.

        >> The specific error was that a conversion of a double-precision
        >> floating-point value (~58 significant bits) to 15 significant
        >> bits caused fixed-point overflow.  The conversion was not
        >> checked for overflow.  It should have been.

        >It was checked, hence the exception and an exception handler to
        >take corrective action.

---The SRI computer (& its backup) had an exception
handler, to be sure, but it did not have an exception
handler to take corrective action.  The exception handler
shut the computer down.

        > Unfortunately the corrective action was
        >to assume that the SRI had failed and to shut it down.  The
        >software performed exactly as designed.

---The software did not performed as designed.  It was
intended to shut down the computer only in the event of
a hardware error.  The software shut down the computer
because of a programming error.  The software performed
only as written!

        >>  This is, after all,
        >> a real-time system.  It's a fundamental check that a programmer
        >> experienced in real-time systems should have carried out.
        >>
        >>    Control was then passed to the interrupt handler, which
        >> shut down the system.

        >Exactly as designed.

---Again, not as designed.  It was designed to shut down only
in the event that the SRI computer failed.  Then the backup
would be used.



Sat, 16 Jan 1999 03:00:00 GMT  
 Ariane Crash (Was: Adriane crash)


Quote:

>    >The error was assuming that the Ariane 4 design would be adaquate
>    >for the Ariane 5 system.

>    >> The specific error was that a conversion of a double-precision
>    >> floating-point value (~58 significant bits) to 15 significant
>    >> bits caused fixed-point overflow.  The conversion was not
>    >> checked for overflow.  It should have been.

>    >It was checked, hence the exception and an exception handler to
>    >take corrective action.

> ---The SRI computer (& its backup) had an exception
> handler, to be sure, but it did not have an exception
> handler to take corrective action.  The exception handler
> shut the computer down.

Which was the specified corrective action.

Quote:
>    > Unfortunately the corrective action was
>    >to assume that the SRI had failed and to shut it down.  The
>    >software performed exactly as designed.

> ---The software did not performed as designed.  It was
> intended to shut down the computer only in the event of
> a hardware error.

The out of bounds data was considered to be indictative of a random hardware
fault, at least for the Ariane 4.  Perhaps this was not a valid method
of determining a hardware fault, but it was the design decision.

- Show quoted text -

Quote:
>  The software shut down the computer
> because of a programming error.  The software performed
> only as written!

>    >>  This is, after all,
>    >> a real-time system.  It's a fundamental check that a programmer
>    >> experienced in real-time systems should have carried out.

>    >>    Control was then passed to the interrupt handler, which
>    >> shut down the system.

>    >Exactly as designed.

> ---Again, not as designed.  It was designed to shut down only
> in the event that the SRI computer failed.  Then the backup
> would be used.

Again, the (wrongly assumed) SRI failure was determined by the detection
of out of bounds data.  It was a requirements oversight, not a programming
oversight, and most certainly not influenced by the programming language used.

To quote the report:

  Although the source of the Operand Error has been identified, this in
  itself did not cause the mission to fail. The specification of the
  exception-handling mechanism also contributed to the failure. In the
  event of any kind of exception, the system specification stated that:
  the failure should be indicated on the databus, the failure context
  should be stored in an EEPROM memory (which was recovered and read out
  for Ariane 501), and finally, the SRI processor should be shut down.

The last sentence of the above is what the requirements stated, and
exactly what the software did, exactly as designed.

-Bob



Sun, 17 Jan 1999 03:00:00 GMT  
 Ariane Crash (Was: Adriane crash)

One point I would like to emphasize is that the out of bounds error
occured in a portion of the software that was not useful once launch
commenced. This has several implications

1. Utilization of this software should have ceased as soon after
launch as possible, freeing computational resources as soon as
possible.

2. The effect of exception handling on processor utilization for this
portion of the software should have been important only during the
prelaunch phase, when I suspect processor utilization would have
been minimal.

3. The proper action to take in the event of an exception in this
portion of the software should be based on what the proper action
should be before launch.  I would not be surprised to discover that
the proper action would be to shut down the processor at that stage.

--

William B. Clodius              Phone: (505)-665-9370

Los Alamos, NM 87545



Sun, 17 Jan 1999 03:00:00 GMT  
 Ariane Crash (Was: Adriane crash)




        >>
        >>        >The error was assuming that the Ariane 4 design would be adaquate
        >>        >for the Ariane 5 system.
        >>
        >>        >> The specific error was that a conversion of a double-precision
        >>        >> floating-point value (~58 significant bits) to 15 significant
        >>        >> bits caused fixed-point overflow.  The conversion was not
        >>        >> checked for overflow.  It should have been.
        >>
        >>        >It was checked, hence the exception and an exception handler to
        >>        >take corrective action.
        >>
        >> ---The SRI computer (& its backup) had an exception
        >> handler, to be sure, but it did not have an exception
        >> handler to take corrective action.  The exception handler
        >> shut the computer down.

        >Which was the specified corrective action.

---Calling it "corrective" action is stretching the English
Language a bit.  In no way shape or form was the
action "corrective".

        >>        > Unfortunately the corrective action was
        >>        >to assume that the SRI had failed and to shut it down.  The
        >>        >software performed exactly as designed.
        >>
        >> ---The software did not performed as designed.  It was
        >> intended to shut down the computer only in the event of
        >> a hardware error.

        >The out of bounds data was considered to be indictative of a random hardware
        >fault, at least for the Ariane 4.  Perhaps this was not a valid method
        >of determining a hardware fault, but it was the design decision.

---Please read what I wrote.  The overflow was not a hardware
fault.  It was a programming error that should not have occurred,
bearing in mind the "sudden death" nature of the shutdown in the
event of any kind of interrupt..

        >>  The software shut down the computer
        >> because of a programming error.  The software performed
        >> only as written!
        >>
        >>        >>  This is, after all,
        >>        >> a real-time system.  It's a fundamental check that a programmer
        >>        >> experienced in real-time systems should have carried out.
        >>        >>
        >>        >>    Control was then passed to the interrupt handler, which
        >>        >> shut down the system.
        >>
        >>        >Exactly as designed.
        >>
        >> ---Again, not as designed.  It was designed to shut down only
        >> in the event that the SRI computer failed.  Then the backup
        >> would be used.

        >Again, the (wrongly assumed) SRI failure was determined by the detection
        >of out of bounds data.  It was a requirements oversight, not a programming
        >oversight, and most certainly not influenced by the programming language used.

---If you make an assumption about the range of data,
and you are wrong, it is a programming error.

        >To quote the report:

        >  Although the source of the Operand Error has been identified, this in
        >  itself did not cause the mission to fail. The specification of the
        >  exception-handling mechanism also contributed to the failure. In the
        >  event of any kind of exception, the system specification stated that:
        >  the failure should be indicated on the databus, the failure context
        >  should be stored in an EEPROM memory (which was recovered and read out
        >  for Ariane 501), and finally, the SRI processor should be shut down.

        >The last sentence of the above is what the requirements stated, and
        >exactly what the software did, exactly as designed.

---Again, the interrupt for fixed-point overflow was
not expected to happen.  The software DID NOT OPERATE
AS DESIGNED.  It failed.  You're placing too literal an
interpretation on the first sentence.



Mon, 18 Jan 1999 03:00:00 GMT  
 Ariane Crash (Was: Adriane crash)



<snip>

Quote:

>---Please read what I wrote.  The overflow was not a hardware
>fault.  It was a programming error that should not have occurred,
>bearing in mind the "sudden death" nature of the shutdown in the
>event of any kind of interrupt..

 ++robin, please read what the poster wrote ... he was describing a
 situation where, by spec, the event was deemed to indicate a hardware
 fault. We can all see clearly that it was not a hardware fault in this
 case; however that does not relieve the s/w of it's requirement to
 treat the event as indicative of a hardware fault.

 btw: A 'spec' is when a customer tells you what he thinks he wants.
      You may or may not agree with his interpretation of what he wants,
      but if you want the work, you promise to deliver what he SAYS! he
      wants - even if it is wrong - unless you can convince him to fix
      his wrong 'spec'. The embedded systems world uses 'spec' to
      define a 'design'; then customer gets to{*filter*}in the design as well.

<snip>

Quote:
>---If you make an assumption about the range of data,
>and you are wrong, it is a programming error.

 Unless the 'spec'/'design' require you to make that assumption ...

<snip>

Quote:
>---Again, the interrupt for fixed-point overflow was
>not expected to happen.  The software DID NOT OPERATE
>AS DESIGNED.  It failed.  You're placing too literal an
>interpretation on the first sentence.

 I believe the report clearly indicates that software operated per design.
 The fault lies with adapting existing software to a new mission, without
 doing sufficient system engineering to see where the old design needed
 to be beefed up to meet the new mission!

 Re: your favorite language & embedded systems ... is that all a troll,
     or what ?

                                           regards



Tue, 19 Jan 1999 03:00:00 GMT  
 
 [ 8 post ] 

 Relevant Pages 

1. Adriane crash

2. Papers on the Ariane-5 crash and Design by Contract

3. Papers on the Ariane-5 crash and Design by Contract

4. Software Bug crashes European Rocket, Ariane 5.

5. GNAT and Ariane 5 crash

6. I am running a vi that crashes after a period of time with no error message

7. wish84t.exe crashes. threading build of wish crash.

8. Image Crash in VA4.02

9. Dolphin Crash, any idea ?

10. BUG: DialogView mutation to a ShellView crashes Dolphin...

11. Dolphin Crash

12. D5 crashes fatally when Tools+ is installed.

 

 
Powered by phpBB® Forum Software