Newbie needs unicode guidance 
Author Message
 Newbie needs unicode guidance

I'm trying to write a script to handle the output of regedit (the
text .reg files) on Windows systems. To ensure my script was able
to deal with all the possible kinds of data I did a complete dump
of the registry on a NT system and one on a W2K system. Working with
the smaller NT file first I was piecemeal able to get it to recognise
all the different kinds of registry entry. I switched to working with
the W2K file and I get failures for every line in the file. With the NT
file everything that I want to happen is happening as I expect it to.

I'm reading the file in like this:


'real' text. When I print these lines to a log file individual characters
are interspersed with spaces or some non-printable character and there
seem to be several control codes at the end of each line. I don't know
what these are, they could just be multiple newlines. When I open the
file in notepad it looks like a 'real' text file. Opening the W2K file
in Word I noticed it asked me what kind of encoding I wanted to treat
the file as. It suggested Unicode.

My question is where do I go from here? I've been able to do some amazing
things with Perl but I'm still only a relative newcomer to it. This actually
started out as a project to learn calling functions by reference with. I'd
always assumed that if there was some kind of encoding on text files there'd
be something sitting between Perl and the disk that would take care of this for
me. What do I need to do to turn this 'text' file into a *text* file? :) Any
help greatly appreciated.

BTW, I tried saving it out again in Word by telling it to save it under a
different filename as 'plain text', 'msdos text', 'text only', etc. These
all shaved a reasonable chunk off the size of the file (drastically
altering my impression of the size of my registry). All of these altered
the file in some way but changed nothing with respect to the success of
my script.

Cheers,
Chris.

--
Chris     |
Russell's |
Five      | If the past was so good why did it end?
Line      |
Sig       |



Thu, 12 Aug 2004 22:40:46 GMT  
 Newbie needs unicode guidance


Quote:
> I'm trying to write a script to handle the output of regedit (the
> text .reg files) on Windows systems. To ensure my script was able
...
> I'm reading the file in like this:



unless you actually need all the lines in memory at once (and you may)
then you may as well process a line at a time (e.g.
while (<FILE>) {
   # the line is now in $_

Quote:
}

> 'real' text. When I print these lines to a log file individual characters
> are interspersed with spaces or some non-printable character and there
> seem to be several control codes at the end of each line.

AIUI all the text in the registry is Unicode, so those things that look
like spaces are really the high half of the unicode characters - the low
half is the characters you can see[1]. You may also have
byte-order/encoding/direction codes in there too.

Quote:
> file in notepad it looks like a 'real' text file. Opening the W2K file
> in Word I noticed it asked me what kind of encoding I wanted to treat
> the file as. It suggested Unicode.

Well it was probably right to do that as this sounds like utf-16 (or is
it UCS2?) text.

Quote:
> My question is where do I go from here? I've been able to do some amazing
> things with Perl but I'm still only a relative newcomer to it. This actually
> started out as a project to learn calling functions by reference with. I'd
> always assumed that if there was some kind of encoding on text files there'd
> be something sitting between Perl and the disk that would take care of this
> for

Perl is quite happy handling binary data so I guess that in your case it
just read in some binary data, and printed that binary data to a file -
Perl doesn't care what that binary data really is (if you're handling
binary data to/from a file you should set binmode on the filehandles on
a Windows system).

So, it currently sounds like perl is simply taking the dump data from
the file and (after maybe some other operations) logging the data to a
log file. I guess you want to take a unicode string and approximate it
with a utf-8, CP1252 or ASCII string?

Also, I thought the registry contained true binary data aswell as text -
I don't know what if anything you need to do about such entries.

Quote:
> me. What do I need to do to turn this 'text' file into a *text* file? :) Any
> help greatly appreciated.

That's the thing - it _is_ a text file. It's just not encoded with ASCII
or CP1252, but with Unicode. Word understands the encoding. The latest
version of perl has experimental unicode support - see the perlunicode
manpage - but that be orthogonal to your requirements.

Have a look at:
http://search.cpan.org/doc/MSCHWARTZ/Unicode-Map-0.110/Map.pm
that seems to do what I think you need to do - namely change 2-byte
stuff to single-byte.

P

[1] ok maybe it's not exactly like that but you know what I'm getting at.

--
pkent 77 at yahoo dot, er... what's the last bit, oh yes, com
Remove the tea to reply



Fri, 13 Aug 2004 01:26:09 GMT  
 Newbie needs unicode guidance


: Also, I thought the registry contained true binary data aswell as text -
: I don't know what if anything you need to do about such entries.

It does, but regedit translates this to text (hex values) and prepends the entry
with a code to tell you what kind of data it is. The file format I'm able to process
OK if it comes from another OS.

: > me. What do I need to do to turn this 'text' file into a *text* file? :) Any
: > help greatly appreciated.

: That's the thing - it _is_ a text file. It's just not encoded with ASCII
: or CP1252, but with Unicode. Word understands the encoding. The latest
: version of perl has experimental unicode support - see the perlunicode
: manpage - but that be orthogonal to your requirements.

I don't understand though why the logic I'm using in my program to determine the
kind of data in the file isn't working. Here's a snippet:

        # Found a DWORD value
        if ($data =~ m/^dword/ig) {
            my ($junk, $ddata) = split (/:/, $data, 2);

            # Create a new hash reference if needed
            if (!exists ($Registry{$keyname}{$value})) {
                $Registry{$keyname}{$value} = {};
            }

            $data = ReturnDwordValue($ddata);
            $Registry{$keyname}{$value}{'Type'} = $DWORD;
            $Registry{$keyname}{$value}{'Data'} = $data;
        }

$keyname and $value are figured out a little earlier in the code.

With the file from the NT system the regex matches all the dword
values in the file and runs the relevant bits of code to store the
data in the hash and mark it up as a dword correctly. On the text from
the W2K system this isn't true, I'm assuming because the spaces are really
represented as such in $data. At least, they are representing something that
the encoding of the file doesn't intend to be represented.

What I want Perl to do is where the file uses two bytes to represent a single character
understand that and act accordingly. If that were happening then the
I'm sure regex would match. I don't want to translate it as such. If it needs to
be that way then it needs to be that way. I want Perl to understand it as it is
for what it is because I ultimately intend to use this to read in regedit-like
output and punt it off to a couple of hundred machines in one go. Hopefully the
module you suggested will let me use the regex above. I'll check it out, thanks.

: Have a look at:
: http://search.cpan.org/doc/MSCHWARTZ/Unicode-Map-0.110/Map.pm
: that seems to do what I think you need to do - namely change 2-byte
: stuff to single-byte.

Cheers,
Chris.

--
Chris     |
Russell's |
Five      | Caned and unable.
Line      |
Sig       |



Fri, 13 Aug 2004 03:12:04 GMT  
 
 [ 3 post ] 

 Relevant Pages 

1. Need Guidance...

2. New at Perl..Need guidance

3. Need some guidance w/script

4. Just getting started, need guidance

5. Need Design Guidance

6. Need guidance on string comparison

7. unicode newbie, can you help?

8. Newbie: Problems with Unicode::String

9. reading unicode file/creating unicode dir under w2k

10. Unicode Map8 or Unicode String install for Win32?

11. DESTROY.al message while using Unicode::Map8, Unicode::Map

12. Unicode-String and Unicode-Map8 for win32

 

 
Powered by phpBB® Forum Software