Sunday, May 4, 2014

UTF-8 output from Perl in cmd.exe: Layers of goodness

This is a followup to my previous post: UTF-8 output from Perl and C programs in cmd.exe on Windows 8.

The goal is for a Perl script to output UTF-8 encoded text in a cmd.exe window already set to code page 65001.

First, if you already have a UTF-8 encoded string, achieving correct output in such a window seems to be impossible:

C:\> chcp
Active code page: 65001

C:\> perl -e ↯
	"print qq{\xce\xb1\xce\xb2\xce\xb3xyz!\n}"
αβγxyz!
!

unless you give up CRLF translation:

C:\> perl -e "binmode(STDOUT, ':unix'); print qq{\xce\xb1\xce\xb2\xce\xb3xyz!\n}"
αβγxyz!

It seems logical to me that CRLF translation should be the last transformation applied to output which leads to the current work-around of using binmode(STDOUT, ":unix:encoding(utf8):crlf"):

binmode(STDOUT, ":unix:encoding(utf8):crlf");
print Dump [
    map {
        my $x = defined($_) ? $_ : '';
        $x =~ s/\A([0-9]+)\z/sprintf '0x%08x', $1/eg;
        $x;
    } PerlIO::get_layers(STDOUT, details => 1)
];
print "αβγxyz!\n";

outputs:

---
- unix
- ''
- 0x01205200
- crlf
- ''
- 0x00c85200
- unix
- ''
- 0x01201200
- encoding
- utf8
- 0x00c89200
- crlf
- ''
- 0x00c8d200
αβγxyz!

That is two layers short of seven-layer goodness.

The flag values are defined in perliol.h:

/* Flag values */
#define PERLIO_F_EOF            0x00000100
#define PERLIO_F_CANWRITE       0x00000200
#define PERLIO_F_CANREAD        0x00000400
#define PERLIO_F_ERROR          0x00000800
#define PERLIO_F_TRUNCATE       0x00001000
#define PERLIO_F_APPEND         0x00002000
#define PERLIO_F_CRLF           0x00004000
#define PERLIO_F_UTF8           0x00008000
#define PERLIO_F_UNBUF          0x00010000
#define PERLIO_F_WRBUF          0x00020000
#define PERLIO_F_RDBUF          0x00040000
#define PERLIO_F_LINEBUF        0x00080000
#define PERLIO_F_TEMP           0x00100000
#define PERLIO_F_OPEN           0x00200000
#define PERLIO_F_FASTGETS       0x00400000
#define PERLIO_F_TTY            0x00800000
#define PERLIO_F_NOTREG         0x01000000   
#define PERLIO_F_CLEARED        0x02000000 /* layer cleared but not freed */

The flags for the first unix layer are 0x01205200 = CANWRITE | TRUNCATE | CRLF | OPEN | NOTREG. Why is CRLF set for the unix layer on Windows? I do not know about the internals enough to understand this.

However, the flags for the second unix layer, the one pushed by my explicit binmode, are 0x01201200 = 0x01205200 & ~CRLF. This is what would have made sense to me to begin with.

The flags for the first crlf layer are 0x00c85200 = CANWRITE | TRUNCATE | CRLF | LINEBUF | FASTGETS | TTY. The flags for the second crlf layer, which I push after the :encoding(utf8) layer are 0x00c8d200 = 0x00c85200 | UTF8.

Now, if I open file using open my $fh, '>:encoding(utf8)', 'ttt', and dump the same information, I get:

---
- unix
- ''
- 0x00201200
- crlf
- ''
- 0x00405200
- encoding
- utf8
- 0x00409200

As expected, the unix layer does not set the CRLF flag.

No comments:

Post a Comment