Thursday, May 1, 2014

UTF-8 output from Perl and C programs in cmd.exe on Windows 8

This all happened because I decided to put together a cute post involving objects representing functions for perltricks.com. One thing lead to another, and I found myself completely incapable of understanding what's going on. So, let's start at the end:

#include <stdio.h>

int main(void) {
    /* UTF-8 encoded alpha, beta, gamma */
    char x[] = { 0xce, 0xb1, 0xce, 0xb2, 0xce, 0xb3, 0x00 };
    puts(x);
    return 0;
}

Let's see what happens when we compile and run that program in cmd.exe on my system:

When, I switch to using the UTF-8 codepage, I get:

If this was all there was to it, there would be no need for a blog post.

Let's see some Perl:

use utf8;
use strict;
use warnings;
use warnings qw(FATAL utf8);

binmode STDOUT, ':utf8';

print 'αβγ', "\n";

And, here is the output:

Copying and pasting the output in the cmd.exe window, we have:

C:\Users\sinan\src\poly> pttt.pl
αβγ
�

xxd does not help at all:

C:\Users\sinan\src\poly> pttt.pl | xxd
0000000: ceb1 ceb2 ceb3 0d0a                      ........

C:\Users\sinan\src\poly> cttt.exe | xxd
0000000: ceb1 ceb2 ceb3 0d0a                      ........

So, the Perl program seems to output the exact same byte sequence as the C program, but when I run the Perl program, I get an extra line with a mystery character.

OK, let's reduce the Perl program to just output the octets like the C program does:

C:\Users\sinan\src\poly> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3\n}"
αβγ
�

Copying and pasting that text and examining the octets in a hex editor gives me U+FFFD which reveals no information. However, changing the print statement in the script to:

print 'αβγ1', "\n";

gives me the following output:

C:\Users\sinan\src\poly> pttt.pl
αβγ1
1

C:\Users\sinan\src\poly> pttt.pl | xxd
0000000: ceb1 ceb2 ceb3 310d 0a                   ......1..

When directed to a pipe, we don't get the extra line with the extra digit. However, when directed to the cmd.exe window where the code page is set to 65001, there is an extra line with the digit one.

This leads me to believe that somehow the last octet gets repeated on a separate line when output is not redirected. Given that the octet 0xb3 is not a valid encoding of any character, somewhere along the way, it gets replaced with U+FFFD.

I have tried this on Windows 8.1 Pro (64-bit), and Windows Vista Home (32-bit), both with self-compiled 5.18.2 and ActiveState's 5.16.3. The problem is not seen with mintty with Cygwin's perl 5.14.4. Nor do I see it when I run Cygwin's perl from the cmd.exe window set to code page 650001:

C:\Users\sinan\src\poly> c:\opt\cygwin64\bin\perl.exe pttt.pl
αβγ

What should I try next?

Update

For some reason, I had forgotten about ConEmu. I installed the 64-bit version, and everything works, presumably because it is capturing output from the Perl script:

My question on Stackoverflow.

3 comments:

  1. Hey Sinan,

    I tried piping the output to a text file and that stops the issue. I wonder if cmd.exe is appending a newline and/or other characters when printing to the screen?

    David Farrell

    ReplyDelete
    Replies
    1. I do note the repeated octet is not there when the output is piped/redirected.

      However, note that the C program and the Perl one-liner output the exact same octets. The former does not exhibit the problem whereas the latter does.

      Delete
  2. This happens because the windows console uses an internal buffer that can mangle output . I'm posting the code that should give the desired result but if you want to see a thorough explanation about the why's then please check
    Printing Unicode on the Windows Console and the importance of of i/o layers

    use Win32::API;
    use utf8;

    binmode(STDOUT, ":unix:utf8");

    #Must set the console code page to UTF8
    $SetConsoleOutputCP= new Win32::API( 'kernel32.dll', 'SetConsoleOutputCP', 'N','N' );
    $SetConsoleOutputCP->Call(65001);

    $unicode_string="αβγ\n";
    $unicode_string1="αβγ1";

    print "THIS IS THE CORRECT EXAMPLE OUTPUT IN PURE PERL: \n";
    print $unicode_string;
    print $unicode_string1= "αβγ1";


    ReplyDelete