Weird problem, advice requested 
Author Message
 Weird problem, advice requested

I'm working on an application with a VisualWorks 2.51 front end and a
GemStone 5.1 backend.  The frontend runs on Win32 systems, the backend
on HPUX.

Our testing PCs are mostly relatively new Pentium 2s.  The problem has
definitely occured on recent P2s (I think 300s, though I won't
swear to that).

Our application used custom marshalling code, written in Smalltalk.
I've recently replaced the GemStone Smalltalk marshalling code with
user actions in a C shared library. I also upgraded the VisualWorks
client (general cleanup, also I made a few modifications to the
marshalling format).

We've tested the frontend on Windows NT, 95a, 95b and 98.
On NT, our primary development platform, after some bug fixes,
the app ran perfectly on the new library.

On Win9x, our primary deployment platform, we've had very mixed
results.  On some machines, the application fails while opening a UI,
reporting a primitive failure.  The failure is somewhat intermitttent;
we've seen cases in which an image that was failing begins to work.

The app, rebuilt from Envy with the old all Smalltalk marshalling
code, works on all platforms.  Remember, though, the only new C
code is a HPUX shared library that's called from GemStone; there's
no new non-Smalltalk client code.

Although the actual error reported is a primitive failure, I don't
think that's the real problem.  Here's why:

The most common failure (the one we use as our typical test case)
reports a primitive failure while displaying a border around a text
widget. It's trying to draw a rediculously large X extent.

If you then look at the calculated display box, it's clearly wrong.
This particular widget is described using a float representing the
percentage of its surrounding window to be used for the X extent
of the widget (0.40....).  If you inspect that Float, the bytes appear
correct, but the printString method generates a string with extra
leading zeroes.  If you reinspect it a time or two, you get a walkback
indicating a problem normalizing a number.  After that, the number
returns to normal.

If you recompute the display box, it generates the correct result,
and the window can be opened.

As a check on this display box calculation, I had the developer
change the widget to use a pixel based layout.  I expected the
application to fail when computing the next widget that used a
Float percentage for extent.  It didn't; the app returned to normal.
(The window did have another widget, later in the opening sequence,
whose width was described by a Float).

That image later failed on other machines, though, so this was not a
general fix.

We've also seen the version number of our application 5.0, stored in
a Float, misprinted in our error log.  Again the problem was a large
number of leading zeroes.  Again the inspector behaved in the same
peculiar way if we inspected the application version in an image that
had experience the primitive failure.

I've also seen a failure in the development environment browsing
for a method that has a Float literal (the literal is used in the hash
method for methods).

The peculiar nature of these failures leads me to suspect an
occasional fault in the Float multiplication by an Integer primitive.
It's the common operation in calculating the display box, printString
on Float, Float normalization, and hash of methods with a Float
literal.

Even in an image that has experienced the primitive failure, though,
floating point calculations do not generally fail.

These problems suggest a memory overwrite of some sort, though only
on Win9x, and never on WinNT.  We only use 2 simple user DLLs on the
client, though, along with the standard GemStone DLLs.  The user DLLs
were not modified for this application upgrade and have been running
in their current configuration for over a year.

Just in case, though, I removed the client user DLLs from the
application, but that did not prevent the same failure from recurring.

The application has no other means of directly manipulating memory.
Everything else is standard Smalltalk and GemStone.

Also: the application has successfully retrieved some data from
GemStone via the new marshalling layer under Win9x before failing.

It would be quite difficult to modifiy the application to eliminate
any data transfer before opening the window that causes the failure.

Cincom support is still thinking matters over.

Any suggestions on debugging technique?

One possibility that comes to mind is getting Cincom to create a
VisualWorks 2.51 engine that does extra checking in the Float multiply
primitives, so that we can at least catch the actual error as it
occurs.

Has anyone seen these sort of peculiar floating point errors on recent
P2s?

Has anyone used a low level machine de{*filter*} to look for memory
overwrites in a VisualWorks application?



Wed, 18 Jun 1902 08:00:00 GMT  
 
 [ 1 post ] 

 Relevant Pages 

1. Rand 64-bit problem (requesting advice)

2. weird error (X error of failed request)

3. Smalltalk on the small end (was: Advice requested: GUI project beginning)

4. Quick Query Response Time - Advice Requested

5. Request advice about C++ and Visual C++ topics

6. New User requesting programming advice

7. Request for Advice -- Undefined Symbol

8. Advice requested: EE crash course for EDA engineer

9. Advice requested: EE crash course for EDA engineer

10. Request for advice

11. OO compiler design: Request references/advice

12. Professional advice requested by an underachiever from this group

 

 
Powered by phpBB® Forum Software