Converting unsigned char to signed char invokes UB? 
Author Message
 Converting unsigned char to signed char invokes UB?

"C Unleashed" Chapter 2, page 67 says:

Here's an illustration of how you can ensure that your code is robust:

void MakeStringUpperCase(char *s)
{
    while(*s)
    {
        *s = toupper((unsigned char)*s);
        ++s;
    }

Quote:
}
}

Apart from the extra closing brace, which is just a typo (and NOT on the
Errata list), I have a question about the code itself.

In the following I will assume unqualified `char` is signed.

As C99 7.4 says of the character handling functions, "In all cases the
argument is an int, the value of which shall be representable as an unsigned
char or shall equal the value of the macro EOF. If the argument has any
other value, the behavior is undefined."

I have no problem with the cast to unsigned char for the reasons stated in
the text.

As C99 6.3.1.3 says: "Otherwise, if the new type is unsigned, the value is
converted by repeatedly adding or subtracting one more than the maximum
value that can be represented in the new type until the value is in the
range of the new type."

Therefore all the chars which are between SCHAR_MIN and -1 are remapped from
SCHAR_MAX+1 to UCHAR_MAX by the cast.

But C99 7.4.2.2 says, "If the argument is a character for which islower is
true and there are one or more corresponding characters, as specified by the
current locale, for which isupper is true, the toupper function returns one
of the corresponding characters (always the same one for any given locale);
otherwise, the argument is returned unchanged."

Therefore in many cases, the return value from toupper, which has type int,
will be unchanged, in the range SCHAR_MAX+1 to UCHAR_MAX, which will not fit
into the (signed) char object which `s` is pointing to.

As C99 6.3.1.3 goes on, "Otherwise, the new type is signed and the value
cannot be represented in it; either the result is implementation-defined or
an implementation-defined signal is raised."

... And I hope you're still with me when I therefore state the conclusion:
Richard's code is flawed.

My questions are: Did anyone follow that? Did it make sense? Have I made one
or more terrible mistakes? Is this code actually flawed?

--
Simon.



Sat, 19 Jun 2004 04:37:00 GMT  
 Converting unsigned char to signed char invokes UB?

Quote:

> "C Unleashed" Chapter 2, page 67 says:

> Here's an illustration of how you can ensure that your code is robust:

> void MakeStringUpperCase(char *s)
> {
>     while(*s)
>     {
>         *s = toupper((unsigned char)*s);
>         ++s;
>     }
> }
> }

> Apart from the extra closing brace, which is just a typo (and NOT on the
> Errata list),

Damn. Nice spot. (Writes down a mental note to self...)

Quote:
> I have a question about the code itself.

> In the following I will assume unqualified `char` is signed.

> As C99 7.4 says of the character handling functions, "In all cases the
> argument is an int, the value of which shall be representable as an unsigned
> char or shall equal the value of the macro EOF. If the argument has any
> other value, the behavior is undefined."

> I have no problem with the cast to unsigned char for the reasons stated in
> the text.

Okay so far, then...

Quote:

> As C99 6.3.1.3 says: "Otherwise, if the new type is unsigned, the value is
> converted by repeatedly adding or subtracting one more than the maximum
> value that can be represented in the new type until the value is in the
> range of the new type."

> Therefore all the chars which are between SCHAR_MIN and -1 are remapped from
> SCHAR_MAX+1 to UCHAR_MAX by the cast.

Yes...

- Show quoted text -

Quote:

> But C99 7.4.2.2 says, "If the argument is a character for which islower is
> true and there are one or more corresponding characters, as specified by the
> current locale, for which isupper is true, the toupper function returns one
> of the corresponding characters (always the same one for any given locale);
> otherwise, the argument is returned unchanged."

> Therefore in many cases, the return value from toupper, which has type int,
> will be unchanged, in the range SCHAR_MAX+1 to UCHAR_MAX, which will not fit
> into the (signed) char object which `s` is pointing to.

> As C99 6.3.1.3 goes on, "Otherwise, the new type is signed and the value
> cannot be represented in it; either the result is implementation-defined or
> an implementation-defined signal is raised."

> ... And I hope you're still with me when I therefore state the conclusion:
> Richard's code is flawed.

Well, it's not impossible; but if that is the case, then the clc
consensual view is also flawed.

My own opinion (FWIW) is that, if the "C" locale is in effect, the
behaviour is well-defined because the only characters which can be
converted by toupper() are all representable as positive chars.

Quote:
> My questions are: Did anyone follow that? Did it make sense? Have I made one
> or more terrible mistakes? Is this code actually flawed?

If it /is/ flawed, then it is only flawed in that it does not take into
account the possibility of a different locale than "C" being in effect.
Since I've never, ever, ever been required to use setlocale() in any C
program I've ever written, I wouldn't be at all surprised if that
particular part of my C knowledge is full of holes. I therefore await
followup discussions with interest (and trepidation!).

--

"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton



Sat, 19 Jun 2004 07:04:29 GMT  
 Converting unsigned char to signed char invokes UB?

Quote:


> > "C Unleashed" Chapter 2, page 67 says:

> > Here's an illustration of how you can ensure that your code
> > is robust:

> > void MakeStringUpperCase(char *s)
> > {
> >     while(*s)
> >     {
> >         *s = toupper((unsigned char)*s);
> >         ++s;
> >     }
> > }
[snip]
> > But C99 7.4.2.2 says, "If the argument is a character for which
> > islower is true and there are one or more corresponding
> > characters, as specified by the current locale, for which isupper
> > is true, the toupper function returns one of the corresponding
> > characters (always the same one for any given locale); otherwise,
> > the argument is returned unchanged."

> > Therefore in many cases, the return value from toupper, which has
> > type int, will be unchanged, in the range SCHAR_MAX+1 to UCHAR_MAX,
> > which will not fit into the (signed) char object which `s` is
> > pointing to.

> > As C99 6.3.1.3 goes on, "Otherwise, the new type is signed and the
> > value cannot be represented in it; either the result is
> > implementation-defined or an implementation-defined signal is
> > raised."

> > ... And I hope you're still with me when I therefore state the
> > conclusion: Richard's code is flawed.

> Well, it's not impossible; but if that is the case, then the clc
> consensual view is also flawed.

Until and unless I'm presented with evidence to the contrary, my point is
that the clc consensual view may be flawed.

Quote:
> My own opinion (FWIW) is that, if the "C" locale is in effect, the
> behaviour is well-defined because the only characters which can be
> converted by toupper() are all representable as positive chars.

Ah, but that's where you didn't get it. The very cases I'm talking about are
those where the character is not converted by toupper(). As you say above on
that page, the user might have used ALT-keypad to enter in negative chars,
which will be positive above the signed char range when returned unchanged
by toupper(), and therefore cause implementation-defined behaviour when
converted back to char.

Quote:
> If it /is/ flawed, then it is only flawed in that it does not
> take into account the possibility of a different locale than
> "C" being in effect. Since I've never, ever, ever been required
> to use setlocale() in any C program I've ever written, I wouldn't
> be at all surprised if that particular part of my C knowledge is
> full of holes. I therefore await followup discussions with
> interest (and trepidation!).

I don't think it's necessary to change the locale to observe this effect.

/* test toupper() */

#include <ctype.h>
#include <stdio.h>
#include <limits.h>

int main(void)
{
    char a = -5;
    int b = toupper((unsigned char)a);
    printf("SCHAR_MAX = %d, UCHAR_MAX = %d\n", SCHAR_MAX, UCHAR_MAX);
    printf("toupper((unsigned char)-5) = %d\n", b);
    printf("or, converted to char, %d\n", (char)b);
    return 0;

Quote:
}

On my system I get the output:

SCHAR_MAX = 127, UCHAR_MAX = 255
toupper((unsigned char)-5) = 251
or, converted to char, -5

But my point is simply that it might not convert correctly back to -5, since
251 is above 127.

--
Simon.



Sat, 19 Jun 2004 08:25:04 GMT  
 Converting unsigned char to signed char invokes UB?
Kinda silly, when theories aren't applied correctly.
Programming is logical, so lets take this a step at a time

signed char    -5
signed char bits 11111010
unsigned char 250 (Not 251 as you posted)
unsigned char bits 11111010

a logical operator of not

The NOT operator
-5 = 5

Bitwise
11111010 = 00000101

Even when coversion to integer

(int)x=(char)-5
(1111111111111010) = (11111010)

And this is why there is no flaw =)

Quote:

> Until and unless I'm presented with evidence to the contrary, my point is
> that the clc consensual view may be flawed.

> > My own opinion (FWIW) is that, if the "C" locale is in effect, the
> > behaviour is well-defined because the only characters which can be
> > converted by toupper() are all representable as positive chars.

> Ah, but that's where you didn't get it. The very cases I'm talking about are
> those where the character is not converted by toupper(). As you say above on
> that page, the user might have used ALT-keypad to enter in negative chars,
> which will be positive above the signed char range when returned unchanged
> by toupper(), and therefore cause implementation-defined behaviour when
> converted back to char.

> > If it /is/ flawed, then it is only flawed in that it does not
> > take into account the possibility of a different locale than
> > "C" being in effect. Since I've never, ever, ever been required
> > to use setlocale() in any C program I've ever written, I wouldn't
> > be at all surprised if that particular part of my C knowledge is
> > full of holes. I therefore await followup discussions with
> > interest (and trepidation!).

> I don't think it's necessary to change the locale to observe this effect.

> /* test toupper() */

> #include <ctype.h>
> #include <stdio.h>
> #include <limits.h>

> int main(void)
> {
>     char a = -5;
>     int b = toupper((unsigned char)a);
>     printf("SCHAR_MAX = %d, UCHAR_MAX = %d\n", SCHAR_MAX, UCHAR_MAX);
>     printf("toupper((unsigned char)-5) = %d\n", b);
>     printf("or, converted to char, %d\n", (char)b);
>     return 0;
> }

> On my system I get the output:

> SCHAR_MAX = 127, UCHAR_MAX = 255
> toupper((unsigned char)-5) = 251
> or, converted to char, -5

> But my point is simply that it might not convert correctly back to -5, since
> 251 is above 127.



Sat, 19 Jun 2004 14:33:27 GMT  
 Converting unsigned char to signed char invokes UB?

Quote:

> Kinda silly, when theories aren't applied correctly.
> Programming is logical, so lets take this a step at a time

> signed char    -5
> signed char bits 11111010

That's one's complement notation, where (~5 == -5)
In two's complement notation, (~5 + 1 == -5), 11111011 or 251
I've worked with machines that use two's complement but
I've never worked with any that used one's complement.

Quote:
> unsigned char 250 (Not 251 as you posted)
> unsigned char bits 11111010

> a logical operator of not

> The NOT operator
> -5 = 5

logical not !
bitwise not ~

--
 pete



Sat, 19 Jun 2004 20:17:10 GMT  
 Converting unsigned char to signed char invokes UB?


Quote:

> > My own opinion (FWIW) is that, if the "C" locale is in effect, the
> > behaviour is well-defined because the only characters which can be
> > converted by toupper() are all representable as positive chars.

> Ah, but that's where you didn't get it. The very cases I'm talking about
are
> those where the character is not converted by toupper(). As you say above
on
> that page, the user might have used ALT-keypad to enter in negative chars,
> which will be positive above the signed char range when returned unchanged
> by toupper(), and therefore cause implementation-defined behaviour when
> converted back to char.

Surely the point is that 251 does fit in a char, albeit a negative value. So
the unchanged value (-5 or whatever) is just that, unchanged.

John



Sat, 19 Jun 2004 22:54:10 GMT  
 Converting unsigned char to signed char invokes UB?


Quote:




>> > My own opinion (FWIW) is that, if the "C" locale is in effect, the
>> > behaviour is well-defined because the only characters which can be
>> > converted by toupper() are all representable as positive chars.

>> Ah, but that's where you didn't get it. The very cases I'm talking about
>are
>> those where the character is not converted by toupper(). As you say above
>on
>> that page, the user might have used ALT-keypad to enter in negative chars,
>> which will be positive above the signed char range when returned unchanged
>> by toupper(), and therefore cause implementation-defined behaviour when
>> converted back to char.

>Surely the point is that 251 does fit in a char, albeit a negative value. So
>the unchanged value (-5 or whatever) is just that, unchanged.

Well, it doesn't fit in a char if 251 is > SCHAR_MAX.  C99 seems to
say that when this happens (when converting to a new type that is
signed), it's up to the implementation as to how to proceed: the new
value is -5 (or whatever) or an implementation-defined signal (yuck!)
is raised (per 6.3.1.3).

How about this?

#include <ctype.h>
#include <limits.h>

void MakeStringUpperCase(char *s)
{
   int tmp;
   while(*s)
   {
      *s = (tmp=toupper((unsigned char)*s)) > SCHAR_MAX ?
             SCHAR_MAX : tmp;
      ++s;
   }

Quote:
}

I don't like it, but at least it avoids the (apparent)
implementation-defined behavior.

Russ



Sun, 20 Jun 2004 02:22:25 GMT  
 Converting unsigned char to signed char invokes UB?

Quote:

> Surely the point is that 251 does fit in a char, albeit a
> negative value. So the unchanged value (-5 or whatever)
> is just that, unchanged.


Quote:
> Well, it doesn't fit in a char if 251 is > SCHAR_MAX.  C99
> seems to say that when this happens (when converting to a
> new type that is signed), it's up to the implementation as
> to how to proceed: the new value is -5 (or whatever) or an
> implementation-defined signal (yuck!) is raised (per 6.3.1.3).

That's my understanding.

[Snip Russ's code that sets to SCHAR_MAX when above that value]

Quote:
> I don't like it, but at least it avoids the (apparent)
> implementation-defined behavior.

I don't like it either.

What about:

#include <ctype.h>
#include <limits.h>

char ConvertToChar(int a)
{
   while(a<SCHAR_MIN) a += UCHAR_MAX+1;
   while(a>SCHAR_MAX) a -= UCHAR_MAX+1;
   return (char)a;

Quote:
}

void MakeStringUpperCase(char *s)
{
   int tmp;
   while(*s)
   {
      *s = ConvertToChar(toupper((unsigned char)*s)));
      ++s;
   }

Quote:
}

--
Simon.


Sun, 20 Jun 2004 03:14:25 GMT  
 Converting unsigned char to signed char invokes UB?


Quote:

> >Surely the point is that 251 does fit in a char, albeit a negative value.
So
> >the unchanged value (-5 or whatever) is just that, unchanged.

> Well, it doesn't fit in a char if 251 is > SCHAR_MAX.  C99 seems to
> say that when this happens (when converting to a new type that is
> signed), it's up to the implementation as to how to proceed: the new
> value is -5 (or whatever) or an implementation-defined signal (yuck!)
> is raised (per 6.3.1.3).

But 251 == -5 for a signed char, so is still within a signed char's range.
No?

I'm not sure quite how a compiler would deal with that statement from the
C99 standard.

toupper will just return -5 so then the compiler (implementation-defined)
will ordinarily set the signed char to -5, except that a perverse compiler
;) will raise a signal, right?

As you say, yuck!

John



Sun, 20 Jun 2004 05:16:28 GMT  
 Converting unsigned char to signed char invokes UB?

Quote:

> But 251 == -5 for a signed char, so is still within a signed
> char's range. No?

251 is not equal to -5. 251 cannot be represented in a signed char.

Let's give an example using one's complement.

The signed char value -5 is represented as 11111010 in 8-bit one's
complement.

When converting to unsigned char, C99 6.3.1.3 says: "Otherwise, if the new
type is unsigned, the value is converted by repeatedly adding or subtracting
one more than the maximum value that can be represented in the new type
until the value is in the range of the new type."

So, we add 256 to -5, getting 251, which has representation 11111011 in
8-bit unsigned. Note that the representation changes when we use one's
complement. It does not change when two's complement is used.

Now, 251 is converted to an int, which changes nothing since it fits in an
int (on our hypothetical platform, at least; it's not guaranteed by the
standard [*]). The int 251 is passed to toupper, which finds it's not a
convertable character, and returns it unchanged.

The value returned by toupper is still the int 251.

We now convert the int 251 to signed char. 251 does not fit within signed
char, so it may be converted in an implementation-defined manner, or a
signal may be raised.

As far as I can see, there may be two common ways the implementation might
convert 251 to signed char.

1. It could do as it did in the signed to unsigned conversion, so subtract
256, to get -5, which has representation 11111010 in one's complement. This
maintains the correct value.

2. It could just leave the representation unchanged, that is, 11111011 which
is -4 in one's complement. This gives a 'wrong' value.

Notice that case 2 means the least work for the compiler, as there is no
'conversion' necessary. It's also perfectly valid standard-wise. But, it
means that our character value has changed! The character has been mangled
in the conversion to unsigned and back again. This is a potential case where
my "YUCK" may occur.

Quote:
> I'm not sure quite how a compiler would deal with that statement from the
> C99 standard.

It's not likely to be a problem on two's complement machines, since the
representation can be left alone both ways.

But on a one's complement machine, the standard guarantees that the
representation must change when converted from signed to unsigned. There is
a natural choice whether to change it back or leave it alone when converted
back to signed.

And on a signed magnitude implementation, the possibilities are even more
open. A complete representation change is necessary when converting to
unsigned char (inverting most of the bits). But, who knows what happens on
the way back?

Quote:
> toupper will just return -5 so then the compiler (implementation-defined)
> will ordinarily set the signed char to -5, except that a perverse compiler
> ;) will raise a signal, right?

toupper will not return -5. It will return the argument unchanged, that is
251 (as an int). That int is then converted to a signed char in the
compiler's implementation-defined way, or a signal is raised.

Quote:
> As you say, yuck!

Yes.

[*] Of course, 251 must fit into an int since INT_MAX >= 32767. But, on a
platform where sizeof(int)==1, the negative char values will not fit into a
signed int, and so are invalid inputs to the toupper function.

--
Simon.



Sun, 20 Jun 2004 10:29:10 GMT  
 Converting unsigned char to signed char invokes UB?


Quote:




>> > My own opinion (FWIW) is that, if the "C" locale is in effect, the
>> > behaviour is well-defined because the only characters which can be
>> > converted by toupper() are all representable as positive chars.

>> Ah, but that's where you didn't get it. The very cases I'm talking about
>are
>> those where the character is not converted by toupper(). As you say above
>on
>> that page, the user might have used ALT-keypad to enter in negative chars,
>> which will be positive above the signed char range when returned unchanged
>> by toupper(), and therefore cause implementation-defined behaviour when
>> converted back to char.

>Surely the point is that 251 does fit in a char, albeit a negative value. So
>the unchanged value (-5 or whatever) is just that, unchanged.

251 and -5 are different values. If you change 251 to -5 then you've done
just that: changed the value. If CHAR_MAX happens to be 127 on a particular
implementation then char simply cannot represent the value 251 on that
implementation. It will be able to represent the value -5 but -5 is not 251.

C has a problem on 1's complement and sign-magnitude in that a conversion
of a negative value from a signed integer type to the corresponding
unsigned type must change the representation in the byte. The question is
what the implementation does for the reverse conversion. The simplest thing
to do would be to preserve the representation but in that case the
conversion sequence  signed char-->unsigned char-->signed char may produce
a result that is different to the original value. Such implementations
would have to consider performing the reverse change of representation.
Even this fails for "negative zero" although that isn't really usable for
representing a separate character.

--
-----------------------------------------


-----------------------------------------



Sun, 20 Jun 2004 00:51:44 GMT  
 Converting unsigned char to signed char invokes UB?

Quote:


> > But 251 == -5 for a signed char, so is still within a signed
> > char's range. No?

> 251 is not equal to -5. 251 cannot be represented in a signed char.

(on a system with 8-bit chars)


Sun, 20 Jun 2004 14:05:18 GMT  
 Converting unsigned char to signed char invokes UB?


Quote:


> C has a problem on 1's complement and sign-magnitude in that a conversion
> of a negative value from a signed integer type to the corresponding
> unsigned type must change the representation in the byte. The question is
> what the implementation does for the reverse conversion. The simplest
thing
> to do would be to preserve the representation but in that case the
> conversion sequence  signed char-->unsigned char-->signed char may produce
> a result that is different to the original value. Such implementations
> would have to consider performing the reverse change of representation.
> Even this fails for "negative zero" although that isn't really usable for
> representing a separate character.

Thanks. My brain does not work in 1's complement mode, yet! :)

John



Sun, 20 Jun 2004 15:03:17 GMT  
 Converting unsigned char to signed char invokes UB?


Thanks for the explanation. So the original code posted in this tread will
not be safe then.

John



Sun, 20 Jun 2004 15:05:04 GMT  
 Converting unsigned char to signed char invokes UB?

Quote:

> "C Unleashed" Chapter 2, page 67 says:

> Here's an illustration of how you can ensure that your code is robust:

> void MakeStringUpperCase(char *s)
> {
>     while(*s)
>     {
>         *s = toupper((unsigned char)*s);
>         ++s;
>     }
> }
> }

> Apart from the extra closing brace, which is just a typo (and NOT on the
> Errata list), I have a question about the code itself.
> As C99 6.3.1.3 goes on, "Otherwise, the new type is signed and the value
> cannot be represented in it; either the result is implementation-defined or
> an implementation-defined signal is raised."

> ... And I hope you're still with me when I therefore state the conclusion:
> Richard's code is flawed.

Yeessss... on the other hand, it's probably the least flawed way of
writing this.
Passing a plain, possibly signed, char to toupper() is wrong to begin
with, and rather more likely to bomb than the code as stands - consider
the very common implementation of toupper() using an array.
Casting the return value of toupper() to char buys you, of course,
exactly nuffink, as do similar methods; you still end up with a possibly
out-of-range unsigned char to be converted to a possibly signed char.

If anyone knows of a way to do this that doesn't invoke UB anywhere,
I'll be happy to try it, but I'm not sure it can be done.

Richard



Sun, 20 Jun 2004 16:41:58 GMT  
 
 [ 21 post ]  Go to page: [1] [2]

 Relevant Pages 

1. char, unsigned char, signed char

2. signed char & unsigned char

3. converting const char* to unsigned char[1024]

4. Convert from char to unsigned char

5. Newbie: Can not convert for char to unsigned char

6. Convert from char to unsigned char

7. Q: How can i convert a char* to unsigned char*

8. Sorting a Huge Unicode File use strcmp(unsigned char *, unsigned char *)

9. Sorting a Huge Unicode File use strcmp(unsigned char *, unsigned char *)

10. Sorting a Huge Unicode File use strcmp(unsigned char *, unsigned char *)

11. How to convert unsigned long to unsigned char?

12. To convert unsigned char to unsigned short in VC++

 

 
Powered by phpBB® Forum Software