UNICODE file handling - possible bug 
Author Message
 UNICODE file handling - possible bug

I am using the following program to read in the first line of a text file
created using notepad on Windows 2000 SP3. Compiler is VC7.

Input file unicode.txt is:
1234 line 1
line 2
line 3

If the file is saved as ANSI, the output is correct as:
1234 line 1

If the file is saved as UNICODE, the output is:
<garbage character>1

If the file is saved as UTF-8, the output is:
<2 garbage characters>1234 line 1

Is the file is saved as UNICODE big-endian, the output is:
<garbage character>

This was coded in a 'Console' project and is a UNICODE project (defined in
project properties).
int _tmain(int argc, _TCHAR* argv[])

{

FILE *stream;

TCHAR line[100];

if( (stream = _tfopen( _T("unicode.txt"), _T("r") )) != NULL )

{

if( _fgetts( line, 100, stream ) == NULL)

_tprintf(_T("fgets error\n"));

else

_tprintf(_T("_fgetts: %s\n"),line);

fclose( stream );

Quote:
}

Sleep(5000); // so we can see the output when running in debug mode

return 0;

Quote:
}



Sat, 08 Oct 2005 18:56:45 GMT  
 UNICODE file handling - possible bug
        A unicode text file is going to start with 0xFFFE or 0xFEFF to define
the endian-ness of the file, as well as helping tell the world 'hey this
file is unicode!'   Both combinations of this value are reserved out of
the unicode character set for this purpose.
Quote:

> I am using the following program to read in the first line of a text file
> created using notepad on Windows 2000 SP3. Compiler is VC7.

> Input file unicode.txt is:
> 1234 line 1
> line 2
> line 3

> If the file is saved as ANSI, the output is correct as:
> 1234 line 1

> If the file is saved as UNICODE, the output is:
> <garbage character>1

> If the file is saved as UTF-8, the output is:
> <2 garbage characters>1234 line 1

> Is the file is saved as UNICODE big-endian, the output is:
> <garbage character>

> This was coded in a 'Console' project and is a UNICODE project (defined in
> project properties).
> int _tmain(int argc, _TCHAR* argv[])

> {

> FILE *stream;

> TCHAR line[100];

> if( (stream = _tfopen( _T("unicode.txt"), _T("r") )) != NULL )

> {

> if( _fgetts( line, 100, stream ) == NULL)

> _tprintf(_T("fgets error\n"));

> else

> _tprintf(_T("_fgetts: %s\n"),line);

> fclose( stream );

> }

> Sleep(5000); // so we can see the output when running in debug mode

> return 0;

> }



Sat, 08 Oct 2005 19:04:53 GMT  
 UNICODE file handling - possible bug

Quote:

> Input file unicode.txt is:
> 1234 line 1
> line 2
> line 3

> If the file is saved as ANSI, the output is correct as:
> 1234 line 1

> If the file is saved as UNICODE, the output is:
> <garbage character>1

> If the file is saved as UTF-8, the output is:
> <2 garbage characters>1234 line 1

BOM...

Yes, BOM is the answer... (BOM: Byte order mark)
http://www.unicode.org/faq/utf_bom.html#22

--
Greetings
  Jochen

  Do you need a memory-leak finder ?
  http://www.codeproject.com/useritems/leakfinder.asp



Sat, 08 Oct 2005 19:07:07 GMT  
 UNICODE file handling - possible bug
Well spotted. Thanks!


Quote:
> A unicode text file is going to start with 0xFFFE or 0xFEFF to define
> the endian-ness of the file, as well as helping tell the world 'hey this
> file is unicode!'   Both combinations of this value are reserved out of
> the unicode character set for this purpose.


> > I am using the following program to read in the first line of a text
file
> > created using notepad on Windows 2000 SP3. Compiler is VC7.

> > Input file unicode.txt is:
> > 1234 line 1
> > line 2
> > line 3

> > If the file is saved as ANSI, the output is correct as:
> > 1234 line 1

> > If the file is saved as UNICODE, the output is:
> > <garbage character>1

> > If the file is saved as UTF-8, the output is:
> > <2 garbage characters>1234 line 1

> > Is the file is saved as UNICODE big-endian, the output is:
> > <garbage character>

> > This was coded in a 'Console' project and is a UNICODE project (defined
in
> > project properties).
> > int _tmain(int argc, _TCHAR* argv[])

> > {

> > FILE *stream;

> > TCHAR line[100];

> > if( (stream = _tfopen( _T("unicode.txt"), _T("r") )) != NULL )

> > {

> > if( _fgetts( line, 100, stream ) == NULL)

> > _tprintf(_T("fgets error\n"));

> > else

> > _tprintf(_T("_fgetts: %s\n"),line);

> > fclose( stream );

> > }

> > Sleep(5000); // so we can see the output when running in debug mode

> > return 0;

> > }



Sat, 08 Oct 2005 19:41:14 GMT  
 
 [ 4 post ] 

 Relevant Pages 

1. how can I handle UNICODE files with ReadString ?

2. HELP : How to handles Unicode BSTR in an ActiveX (see my tries in my message)

3. DDX_Text doesn't handle Unicode?

4. I need CRichEditCtrl to handle Unicode

5. HELP : How to handles Unicode BSTR in an ActiveX (see my tries in my message)

6. attributed perfmon/unicode - bug or oopsie?

7. Bug: ATL7, PerfMon attributes, Unicode

8. Possible for a string to be a handle

9. MFC - BUG in UNICODE version - VisualStudio.NET

10. When is it possible to close Thread handle?

11. Possible CDatabase Event Handle Leak

12. Unicode file vs. Multibyte file

 

 
Powered by phpBB® Forum Software