Ariane Crash (Was: Adriane crash)
Author |
Message |
++ rob #1 / 8
|
 Ariane Crash (Was: Adriane crash)
>JOINT ESA/CNES PRESS RELEASE N 33-96 - Paris, 23 July 1996 >Ariane 501 - Presentation of Inquiry Board report >------------------------------------------------------------------- >Hope this is useful. So basically it _was_ a software fault ---Is this a euphemism for a programming error? because that's what it was -- a programming error. The error was in assuming that a value would not overflow. The specific error was that a conversion of a double-precision floating-point value (~58 significant bits) to 15 significant bits caused fixed-point overflow. The conversion was not checked for overflow. It should have been. This is, after all, a real-time system. It's a fundamental check that a programmer experienced in real-time systems should have carried out. Control was then passed to the interrupt handler, which shut down the system. The question is, basically, why was Ada used for this work? PL/I has specific facilities for real-time programming, and especially for simulating exactly this (and other) exceptions -- as if the exceptions had actually occurred. The SIGNAL statement is designed for this purpose. The programmer would have discovered this problem the FIRST time he used it! And he could have included an exception handler for this and other similar kinds of trivial errors. These exception handlers would have returned control to the code. A PL/I programmer and/or a real-time systems programmer would have OBJECTED to the stupid requirement of shutting down the system when a trivial error occurred. >What I want to know is, who wrote that software, and if their was an >ESA representative responsible for it, who was he! >Not that I want to apportion blame of course, just interested!
|
Tue, 12 Jan 1999 03:00:00 GMT |
|
 |
Bob Gilbe #2 / 8
|
 Ariane Crash (Was: Adriane crash)
Quote: > ---Is this a euphemism for a programming error? because that's > what it was -- a programming error. > The error was in assuming that a value would not overflow.
The error was assuming that the Ariane 4 design would be adaquate for the Ariane 5 system. Quote: > The specific error was that a conversion of a double-precision > floating-point value (~58 significant bits) to 15 significant > bits caused fixed-point overflow. The conversion was not > checked for overflow. It should have been.
It was checked, hence the exception and an exception handler to take corrective action. Unfortunately the corrective action was to assume that the SRI had failed and to shut it down. The software performed exactly as designed. Quote: > This is, after all, > a real-time system. It's a fundamental check that a programmer > experienced in real-time systems should have carried out. > Control was then passed to the interrupt handler, which > shut down the system.
Exactly as designed. Quote: > The question is, basically, why was Ada used for this work?
The failure is not a language issue, this is not the question. -Bob
|
Fri, 15 Jan 1999 03:00:00 GMT |
|
 |
John McCa #3 / 8
|
 Ariane Crash (Was: Adriane crash)
Quote:
> >JOINT ESA/CNES PRESS RELEASE N 33-96 - Paris, 23 July 1996 > >Ariane 501 - Presentation of Inquiry Board report > >------------------------------------------------------------------- > >Hope this is useful. So basically it _was_ a software fault >---Is this a euphemism for a programming error? because that's >what it was -- a programming error.
Having read the report, I don't consider it to be a programming error, it was a design and management error. It sounds like whoever designed the system didn't pay enough attention to the requirements, and whoever was managing it didn't pay enough attention to its conformance to the requirements. I think the fact that the overflow occurred was not due to a programming oversight, after all the analyses had been done and a decision to not check that variable had been made (*see additional note below), but seeing as that variable should not have been in use at that point, I don't think you can blame whoever wrote that code. Quote: > The error was in assuming that a value would not overflow. >The specific error was that a conversion of a double-precision >floating-point value (~58 significant bits) to 15 significant >bits caused fixed-point overflow. The conversion was not >checked for overflow. It should have been. This is, after all, >a real-time system. It's a fundamental check that a programmer >experienced in real-time systems should have carried out. > Control was then passed to the interrupt handler, which >shut down the system. > The question is, basically, why was Ada used for this work?
ESA Ada preference/mandate(?). <..snip..> *Note: I hope this makes ESA llok a bit closer at why they want to limit processor loading and how the margin should be reduced through the design and development phases. My own project has an ESA enforced limit of 70% which is quite ridiculous given the equipment we're using (GPS MA31750 10MHz MIL-STD-1750 processor). We cannot meet that but have requested a waiver on that - I believe that's much better than compromising the safety of the mission. ESA's loading margins are really supposed to take account of a requirement for future modifications to software once it has been delivered. There's no way this should have been enforced for Ariane 5. From the sound of the report,I think a pretty poor job has been done, not by the programmers who wrote the code and performed the analysis of what variables could safely be left unchecked, instead I think whoever performed the requirement analysis and all levels of management / reviewers above that havebeen completely negligent. Best Regards
|
Fri, 15 Jan 1999 03:00:00 GMT |
|
 |
++ rob #4 / 8
|
 Ariane Crash (Was: Adriane crash)
>> >> ---Is this a euphemism for a programming error? because that's >> what it was -- a programming error. >> >> The error was in assuming that a value would not overflow. >The error was assuming that the Ariane 4 design would be adaquate >for the Ariane 5 system. >> The specific error was that a conversion of a double-precision >> floating-point value (~58 significant bits) to 15 significant >> bits caused fixed-point overflow. The conversion was not >> checked for overflow. It should have been. >It was checked, hence the exception and an exception handler to >take corrective action. ---The SRI computer (& its backup) had an exception handler, to be sure, but it did not have an exception handler to take corrective action. The exception handler shut the computer down. > Unfortunately the corrective action was >to assume that the SRI had failed and to shut it down. The >software performed exactly as designed. ---The software did not performed as designed. It was intended to shut down the computer only in the event of a hardware error. The software shut down the computer because of a programming error. The software performed only as written! >> This is, after all, >> a real-time system. It's a fundamental check that a programmer >> experienced in real-time systems should have carried out. >> >> Control was then passed to the interrupt handler, which >> shut down the system. >Exactly as designed. ---Again, not as designed. It was designed to shut down only in the event that the SRI computer failed. Then the backup would be used.
|
Sat, 16 Jan 1999 03:00:00 GMT |
|
 |
Bob Gilbe #5 / 8
|
 Ariane Crash (Was: Adriane crash)
Quote:
> >The error was assuming that the Ariane 4 design would be adaquate > >for the Ariane 5 system. > >> The specific error was that a conversion of a double-precision > >> floating-point value (~58 significant bits) to 15 significant > >> bits caused fixed-point overflow. The conversion was not > >> checked for overflow. It should have been. > >It was checked, hence the exception and an exception handler to > >take corrective action. > ---The SRI computer (& its backup) had an exception > handler, to be sure, but it did not have an exception > handler to take corrective action. The exception handler > shut the computer down.
Which was the specified corrective action. Quote: > > Unfortunately the corrective action was > >to assume that the SRI had failed and to shut it down. The > >software performed exactly as designed. > ---The software did not performed as designed. It was > intended to shut down the computer only in the event of > a hardware error.
The out of bounds data was considered to be indictative of a random hardware fault, at least for the Ariane 4. Perhaps this was not a valid method of determining a hardware fault, but it was the design decision. Quote: > The software shut down the computer > because of a programming error. The software performed > only as written! > >> This is, after all, > >> a real-time system. It's a fundamental check that a programmer > >> experienced in real-time systems should have carried out. > >> Control was then passed to the interrupt handler, which > >> shut down the system. > >Exactly as designed. > ---Again, not as designed. It was designed to shut down only > in the event that the SRI computer failed. Then the backup > would be used.
Again, the (wrongly assumed) SRI failure was determined by the detection of out of bounds data. It was a requirements oversight, not a programming oversight, and most certainly not influenced by the programming language used. To quote the report: Although the source of the Operand Error has been identified, this in itself did not cause the mission to fail. The specification of the exception-handling mechanism also contributed to the failure. In the event of any kind of exception, the system specification stated that: the failure should be indicated on the databus, the failure context should be stored in an EEPROM memory (which was recovered and read out for Ariane 501), and finally, the SRI processor should be shut down. The last sentence of the above is what the requirements stated, and exactly what the software did, exactly as designed. -Bob
|
Sun, 17 Jan 1999 03:00:00 GMT |
|
 |
William Clodi #6 / 8
|
 Ariane Crash (Was: Adriane crash)
One point I would like to emphasize is that the out of bounds error occured in a portion of the software that was not useful once launch commenced. This has several implications 1. Utilization of this software should have ceased as soon after launch as possible, freeing computational resources as soon as possible. 2. The effect of exception handling on processor utilization for this portion of the software should have been important only during the prelaunch phase, when I suspect processor utilization would have been minimal. 3. The proper action to take in the event of an exception in this portion of the software should be based on what the proper action should be before launch. I would not be surprised to discover that the proper action would be to shut down the processor at that stage. -- William B. Clodius Phone: (505)-665-9370
Los Alamos, NM 87545
|
Sun, 17 Jan 1999 03:00:00 GMT |
|
 |
++ rob #7 / 8
|
 Ariane Crash (Was: Adriane crash)
>> >> >The error was assuming that the Ariane 4 design would be adaquate >> >for the Ariane 5 system. >> >> >> The specific error was that a conversion of a double-precision >> >> floating-point value (~58 significant bits) to 15 significant >> >> bits caused fixed-point overflow. The conversion was not >> >> checked for overflow. It should have been. >> >> >It was checked, hence the exception and an exception handler to >> >take corrective action. >> >> ---The SRI computer (& its backup) had an exception >> handler, to be sure, but it did not have an exception >> handler to take corrective action. The exception handler >> shut the computer down. >Which was the specified corrective action. ---Calling it "corrective" action is stretching the English Language a bit. In no way shape or form was the action "corrective". >> > Unfortunately the corrective action was >> >to assume that the SRI had failed and to shut it down. The >> >software performed exactly as designed. >> >> ---The software did not performed as designed. It was >> intended to shut down the computer only in the event of >> a hardware error. >The out of bounds data was considered to be indictative of a random hardware >fault, at least for the Ariane 4. Perhaps this was not a valid method >of determining a hardware fault, but it was the design decision. ---Please read what I wrote. The overflow was not a hardware fault. It was a programming error that should not have occurred, bearing in mind the "sudden death" nature of the shutdown in the event of any kind of interrupt.. >> The software shut down the computer >> because of a programming error. The software performed >> only as written! >> >> >> This is, after all, >> >> a real-time system. It's a fundamental check that a programmer >> >> experienced in real-time systems should have carried out. >> >> >> >> Control was then passed to the interrupt handler, which >> >> shut down the system. >> >> >Exactly as designed. >> >> ---Again, not as designed. It was designed to shut down only >> in the event that the SRI computer failed. Then the backup >> would be used. >Again, the (wrongly assumed) SRI failure was determined by the detection >of out of bounds data. It was a requirements oversight, not a programming >oversight, and most certainly not influenced by the programming language used. ---If you make an assumption about the range of data, and you are wrong, it is a programming error. >To quote the report: > Although the source of the Operand Error has been identified, this in > itself did not cause the mission to fail. The specification of the > exception-handling mechanism also contributed to the failure. In the > event of any kind of exception, the system specification stated that: > the failure should be indicated on the databus, the failure context > should be stored in an EEPROM memory (which was recovered and read out > for Ariane 501), and finally, the SRI processor should be shut down. >The last sentence of the above is what the requirements stated, and >exactly what the software did, exactly as designed. ---Again, the interrupt for fixed-point overflow was not expected to happen. The software DID NOT OPERATE AS DESIGNED. It failed. You're placing too literal an interpretation on the first sentence.
|
Mon, 18 Jan 1999 03:00:00 GMT |
|
 |
roo #8 / 8
|
 Ariane Crash (Was: Adriane crash)
<snip> Quote: >---Please read what I wrote. The overflow was not a hardware >fault. It was a programming error that should not have occurred, >bearing in mind the "sudden death" nature of the shutdown in the >event of any kind of interrupt..
++robin, please read what the poster wrote ... he was describing a situation where, by spec, the event was deemed to indicate a hardware fault. We can all see clearly that it was not a hardware fault in this case; however that does not relieve the s/w of it's requirement to treat the event as indicative of a hardware fault. btw: A 'spec' is when a customer tells you what he thinks he wants. You may or may not agree with his interpretation of what he wants, but if you want the work, you promise to deliver what he SAYS! he wants - even if it is wrong - unless you can convince him to fix his wrong 'spec'. The embedded systems world uses 'spec' to define a 'design'; then customer gets to{*filter*}in the design as well. <snip> Quote: >---If you make an assumption about the range of data, >and you are wrong, it is a programming error.
Unless the 'spec'/'design' require you to make that assumption ... <snip> Quote: >---Again, the interrupt for fixed-point overflow was >not expected to happen. The software DID NOT OPERATE >AS DESIGNED. It failed. You're placing too literal an >interpretation on the first sentence.
I believe the report clearly indicates that software operated per design. The fault lies with adapting existing software to a new mission, without doing sufficient system engineering to see where the old design needed to be beefed up to meet the new mission! Re: your favorite language & embedded systems ... is that all a troll, or what ? regards
|
Tue, 19 Jan 1999 03:00:00 GMT |
|
|
|