Converting file encoding 
Author Message
 Converting file encoding

Hi,

I have tones of files stored in ASCII encoding and would like to convert
them into
UTF-8 encoding. Can someone help me on this?

Things I think I need to do are:
1. check the file's current encoding -> how would I do this?
2. convert only those files in ASCII encoding to UTF-8  -> how would I go
about this?

Would really appreciate your help on this.

Thanks!
-Yasutaka



Sun, 22 May 2005 14:46:37 GMT  
 Converting file encoding
Quote:
> 1. check the file's current encoding -> how would I do this?

    StreamReader asciiReader = new StreamReader(inputFilePath,
Encoding.Default, false);
    In fact, you do not have to search for the file encoding. You just
require the reader
    to use the appropriated one.

Quote:
> 2. convert only those files in ASCII encoding to UTF-8  -> how would I go

about this?
    StreamWriter utf8Writer = new StreamWriter(outputFilePath,
Encoding.UTF8)
    I'm not sure of the syntax for the UTF8 encoding...

Regards
Mehdi



Sun, 22 May 2005 16:43:08 GMT  
 Converting file encoding
Hi,

Thanks for the kind reply.

Quote:
>In fact, you do not have to search for the file encoding.
>You just require the reader to use the appropriated one.

You know the Character Encoding option when you do the Save As on
Notepad, right? The thing automatically detects, with which encoding a
specific file was saved the last. Will I be able to achieve the same
using your suggestion? I need to be able to detect this, as I won't have
to re-save a file if it is already saved in UTF8 encoding.

Quote:
>StreamWriter utf8Writer = new StreamWriter(outputFilePath,
>Encoding.UTF8)
>I'm not sure of the syntax for the UTF8 encoding...

Got it. I think this one will work for writing in UTF8 format.

Thanks,
-Yasutaka

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!



Sun, 22 May 2005 17:35:05 GMT  
 Converting file encoding
Quote:
> You know the Character Encoding option when you do the Save As on
> Notepad, right? The thing automatically detects, with which encoding a
> specific file was saved the last. Will I be able to achieve the same
> using your suggestion?

    yes ! In fact, the Encoding.Default parameter will force the reader to
get
    the right encoding (the one used by the file), but you don't know it as
far as
    you don't ask it.

Quote:
>I need to be able to detect this, as I won't have
> to re-save a file if it is already saved in UTF8 encoding.

    so, you must open the reader as I told you, and then, get the reader
encoding (you
    have a method for that). Then, test this encoding to know if it is UTF8,
UTF16, ASCII, ...

Mehdi



Sun, 22 May 2005 18:04:28 GMT  
 Converting file encoding

Quote:

> > 1. check the file's current encoding -> how would I do this?
>     StreamReader asciiReader = new StreamReader(inputFilePath,
> Encoding.Default, false);
>     In fact, you do not have to search for the file encoding. You just
> require the reader to use the appropriated one.

That naming is entirely inappropriate - the default encoding may well
*not* be ASCII (and is in fact unlikely to be) so calling the reader
asciiReader is misleading.

When you say "you just require the reader to use the appropriated one"
what exactly do you mean?

Quote:
> > 2. convert only those files in ASCII encoding to UTF-8  -> how would I go
> about this?
>     StreamWriter utf8Writer = new StreamWriter(outputFilePath,
> Encoding.UTF8)
>     I'm not sure of the syntax for the UTF8 encoding...

Fortunately you don't need to know what UTF8 encoding really looks like,
you just use the appropriate writer and it'll do it all for you :)

--

http://www.pobox.com/~skeet/
If replying to the group, please do not mail me too



Sun, 22 May 2005 18:35:04 GMT  
 Converting file encoding
Quote:
> That naming is entirely inappropriate - the default encoding may well
> *not* be ASCII (and is in fact unlikely to be) so calling the reader
> asciiReader is misleading.

    yes, you're right. Sorry...

Quote:
> When you say "you just require the reader to use the appropriated one"
> what exactly do you mean?

    I mean that the reader constructor will detect the encoding of the text
file, and will
    use it to read the characters. Unless, the reader will use the Windows
standard encoding,
    which is UTF8 for Windows 2000.

Quote:
> Fortunately you don't need to know what UTF8 encoding really looks like,
> you just use the appropriate writer and it'll do it all for you :)

    right again, UTF8 is the default encoding !

Mehdi



Sun, 22 May 2005 18:52:56 GMT  
 Converting file encoding

Quote:

> > You know the Character Encoding option when you do the Save As on
> > Notepad, right? The thing automatically detects, with which encoding a
> > specific file was saved the last. Will I be able to achieve the same
> > using your suggestion?
>     yes ! In fact, the Encoding.Default parameter will force the reader to
> get the right encoding (the one used by the file), but you don't know it as
> far as you don't ask it.

Again, I don't believe this is actually the case at all, as it's
fundamentally impossible to tell the difference between many different
encodings.

--

http://www.pobox.com/~skeet/
If replying to the group, please do not mail me too



Sun, 22 May 2005 19:23:57 GMT  
 Converting file encoding

Quote:

> > When you say "you just require the reader to use the appropriated one"
> > what exactly do you mean?
>     I mean that the reader constructor will detect the encoding of the text
> file, and will use it to read the characters.

How do you expect it to do that? For instance, *any* file can be an
ISO-8859-1 file. Do you have any documentation to support this?

Quote:
> Unless, the reader will use
> the Windows standard encoding,  which is UTF8 for Windows 2000.

Are you sure about that? I believe it's actually a Windows codepage,
dependant on regional settings. For instance, running:

using System;
using System.Text;

public class Test
{
    public static void Main()
    {
        Console.WriteLine (Encoding.Default.EncodingName);
        Console.WriteLine (Encoding.UTF8.EncodingName);
    }

Quote:
}

on my XP box gives the output of:

Western European (Windows)
Unicode (UTF-8)

which shows that there's a difference between UTF8 and the default
encoding. Again, do you have any documentation to support your claim
that the default encoding under W2k is UTF8?

--

http://www.pobox.com/~skeet/
If replying to the group, please do not mail me too



Sun, 22 May 2005 19:21:24 GMT  
 Converting file encoding
Unicode files should have a Byte Order Mark or BOM, which is a signature at
the head of the file to indicated the encoding used:

UTF-8 - EF BB BF
UTF-16 Big-Endian  - FE FF
UTF-16 Little-Endian - FF FE
UCS-4 Big-Endian - 00 00 FE FF
UCS-4 Little-Endian - FF FE 00 00

These sequences are highly unlikely to occur in other real world character
encodings.

7 and 8 bit character sets are a bit of a confusing mess, but I do recall
seeing an probabilistic algorithm for identifying a character set used in a
particular file. Apparently it was usually correct for non trivial files,
and often when it was wrong, it didn't much matter anyway (there is a lot of
overlap in many of the encodings).

--
Nick Holmes
Coyote Software, GmbH.


Quote:

> > > You know the Character Encoding option when you do the Save As on
> > > Notepad, right? The thing automatically detects, with which encoding a
> > > specific file was saved the last. Will I be able to achieve the same
> > > using your suggestion?
> >     yes ! In fact, the Encoding.Default parameter will force the reader
to
> > get the right encoding (the one used by the file), but you don't know it
as
> > far as you don't ask it.

> Again, I don't believe this is actually the case at all, as it's
> fundamentally impossible to tell the difference between many different
> encodings.

> --

> http://www.pobox.com/~skeet/
> If replying to the group, please do not mail me too



Sun, 22 May 2005 19:42:21 GMT  
 Converting file encoding

Quote:

> Unicode files should have a Byte Order Mark or BOM, which is a signature at
> the head of the file to indicated the encoding used:

> UTF-8 - EF BB BF
> UTF-16 Big-Endian  - FE FF
> UTF-16 Little-Endian - FF FE
> UCS-4 Big-Endian - 00 00 FE FF
> UCS-4 Little-Endian - FF FE 00 00

> These sequences are highly unlikely to occur in other real world character
> encodings.

I was aware of the above *apart* from UTF-8 - I didn't realise it had
any marking. Cheers.

Quote:
> 7 and 8 bit character sets are a bit of a confusing mess, but I do recall
> seeing an probabilistic algorithm for identifying a character set used in a
> particular file. Apparently it was usually correct for non trivial files,
> and often when it was wrong, it didn't much matter anyway (there is a lot of
> overlap in many of the encodings).

I suspect any 7/8 bit file which only contains ASCII characters will be
fine for almost all encodings (EBCDIC aside). It's when you get above
that that it becomes incorrect. However, I'd really *hope* that if the
default encoding did anything like that, the docs would say so. Bear in
mind that an encoding can't necessarily "read ahead" before the results
start being needed.

--

http://www.pobox.com/~skeet/
If replying to the group, please do not mail me too



Sun, 22 May 2005 20:55:13 GMT  
 Converting file encoding
Quote:
> >     I mean that the reader constructor will detect the encoding of the
text
> > file, and will use it to read the characters.
> How do you expect it to do that? For instance, *any* file can be an
> ISO-8859-1 file. Do you have any documentation to support this?

    well ... I don't know, but I think Nick already answered this.

Quote:
> > Unless, the reader will use
> > the Windows standard encoding,  which is UTF8 for Windows 2000.
> Are you sure about that?

    I can't find it back, but I found this information in the MSDN doc.

Quote:
>         Console.WriteLine (Encoding.Default.EncodingName);
>         Console.WriteLine (Encoding.UTF8.EncodingName);
> on my XP box gives the output of:

>     Western European (Windows)
>     Unicode (UTF-8)

    yes, but that's not what I said. Well, you're probably right saying that
Encoding.Default
    gets back the system setting. But, creating a StreamReader with that
parameter
    does the rigth job ! I used to test it on Swedish, Russian, French, ...
files, and the reader
    is perfectly initialized. Maybe I am wrong, but you should test it. I
think it worths it, no ?

Mehdi



Sun, 22 May 2005 21:43:57 GMT  
 Converting file encoding

Quote:

> > >     I mean that the reader constructor will detect the encoding of the
> text
> > > file, and will use it to read the characters.
> > How do you expect it to do that? For instance, *any* file can be an
> > ISO-8859-1 file. Do you have any documentation to support this?
>     well ... I don't know, but I think Nick already answered this.

He answered how heuristically you can get it right *most* of the time -
*if* you get enough data to make the decision. I don't believe a
StreamReader will always have enough data, as it needs to be able to
read from a stream, which may only have the first few bytes available by
the time the first result is required.

Quote:
> > > Unless, the reader will use
> > > the Windows standard encoding,  which is UTF8 for Windows 2000.
> > Are you sure about that?
>     I can't find it back, but I found this information in the MSDN doc.
> >         Console.WriteLine (Encoding.Default.EncodingName);
> >         Console.WriteLine (Encoding.UTF8.EncodingName);
> > on my XP box gives the output of:

> >     Western European (Windows)
> >     Unicode (UTF-8)
>     yes, but that's not what I said.

Well, you said it would use a UTF8 encoding by default on Windows 2000,
which I don't believe it will.

Quote:
> Well, you're probably right saying that
> Encoding.Default gets back the system setting. But, creating a StreamReader
> with that parameter does the rigth job ! I used to test it on Swedish, Russian,
> French, ... files, and the reader
>     is perfectly initialized. Maybe I am wrong, but you should test it. I
> think it worths it, no ?

I think it's worth you being *very* clear exactly what you mean. When
you said you tested it on Swedish, Russian and French files - what
*exactly* was in those files? What was the encoding? How did you verify
that encoding?

--

http://www.pobox.com/~skeet/
If replying to the group, please do not mail me too



Sun, 22 May 2005 22:28:47 GMT  
 Converting file encoding
If your files are truly ASCII encoded, then you don't need to convert them
to UTF-8. Or at least, if you did, you wouldn't be able to tell the
difference between input and output (with the possible exception of the
optional BOM at the top of the file).

--
Nick Holmes
Coyote Software, GmbH.



Quote:
> Hi,

> I have tones of files stored in ASCII encoding and would like to convert
> them into
> UTF-8 encoding. Can someone help me on this?

> Things I think I need to do are:
> 1. check the file's current encoding -> how would I do this?
> 2. convert only those files in ASCII encoding to UTF-8  -> how would I go
> about this?

> Would really appreciate your help on this.

> Thanks!
> -Yasutaka



Sun, 22 May 2005 22:55:32 GMT  
 Converting file encoding
Don't trust me, I don't care. That was just to help.

Last answer.
Mehdi



Sun, 22 May 2005 23:38:10 GMT  
 Converting file encoding
[Just to clarify, as this thread has gone awry...]


Quote:
> I have tones of files stored in ASCII encoding and would like to convert
> them into UTF-8 encoding. Can someone help me on this?

> Things I think I need to do are:
> 1. check the file's current encoding -> how would I do this?

There's no guaranteed way of doing it. If the files don't contain any
bytes > 127, then it's a *reasonable* guess that they're ASCII, but it's
not certain.

Quote:
> 2. convert only those files in ASCII encoding to UTF-8  -> how would I go
> about this?

Create a StreamReader using Encoding.ASCII, and a StreamWriter using
Encoding.UTF8. Read characters (in blocks, for speed) from the
StreamReader and write them to the StreamWriter.

--
Jon Skeet

If replying to the group, please do not mail me at the same time



Mon, 23 May 2005 00:50:00 GMT  
 
 [ 21 post ]  Go to page: [1] [2]

 Relevant Pages 

1. helper function for converting from a WCHAR string to UTF8 encoded string

2. How do you convert a Big5 encoded String to Unicode encoded String in C#?

3. html file and encoding

4. encoding\decoding files

5. encode binary file with base64

6. encoding text files

7. char -> big5 encoded file

8. Reading Base64 encoded file

9. help:file attachment using MIME,encoding take forever...!!!

10. How to Encode files

11. Suitable tool to convert c files to cpp files

12. Converting a 16bits stereo wav file to a 16bits mono wav file

 

 
Powered by phpBB® Forum Software