More on UTF-8 output from perl in cmd.exe

Tony Cook provided some insight about what might be going on regarding extraneous trailing output when cmd.exe is set to code page 65001.

It turns out there is a bug in Windows’ WriteFile function: When console is set to code page 65001, it reports the number of characters written instead of bytes. So, for example,:

perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3123}"

sends 9 bytes to output. However, WriteFile reports that it wrote only 6 bytes (which is the number of characters in the string "αβγ123") and therefore it goes back and outputs the last three bytes again:

C:\> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3123}"
αβγ123123

The relevant code is in win32io.c:

238 SSize_t
239 PerlIOWin32_write(pTHX_ PerlIO *f, const void *vbuf, Size_t count)
240 {
241  PerlIOWin32 *s = PerlIOSelf(f,PerlIOWin32);
242  DWORD len;
243  if (WriteFile(s->h,vbuf,count,&len,NULL))
244   {
245    return len;
246   }
247  else
248   {
249    PerlIOBase(f)->flags |= PERLIO_F_ERROR;
250    return -1;
251   }
252 }

Now, looking at the documentation for WriteFile, we have:

The WriteFile function returns when one of the following conditions occur:

  • The number of bytes requested is written.
  • A read operation releases buffer space on the read end of the pipe (if the write was blocked). For more information, see the Pipes section.
  • An asynchronous handle is being used and the write is occurring asynchronously.
  • An error occurs.

Further reading suggests to me that if WriteFile is being used for synchronous IO, as PerlIO seems to be doing, then either count bytes will be successfully written, or the function will return with an error (broken pipes also result in error returns).

So, might the issue be fixed simply by ensuring PerlIOWin32_write always returns count, disregarding what WriteFile returns?

To investigate this, I recompiled my brand new perl 5.20.0 after changing PerlIOWin32_write to:

SSize_t
PerlIOWin32_write(pTHX_ PerlIO *f, const void *vbuf, Size_t count)
{
 PerlIOWin32 *s = PerlIOSelf(f,PerlIOWin32);
 DWORD len;
 if (WriteFile(s->h,vbuf,count,&len,NULL))
  {
   return count;
  }
 else
  {
   PerlIOBase(f)->flags |= PERLIO_F_ERROR;
   return -1;
  }
}

Well, long story short, it doesn’t work. In fact, the erroneous output above was generated with that binary.

Incidentally, for some reason CRLF translation flag is still being set for the bottom-most unix layer on STDOUT:

C:\> perl -MYAML::XS -MPerlIO::Layers=get_layers -e "print Dump get_layers(\*STDOUT)"
---
- unix
- ~
- - OPEN
  - TRUNCATE
  - CRLF
  - CANWRITE
---
- crlf
- ~
- - FASTGETS
  - TRUNCATE
  - LINEBUF
  - CRLF
  - CANWRITE

Given the WriteFile bug, this seems to be a red herring, but it still bugs me.

Coming back to the topic at hand, note the following simple C program:

C:\> type tt.c
#include <stdio.h>

int main(void) {
    char x[] = {
        0xce, 0xb1, /* α */
        0xce, 0xb2, /* β */
        0xce, 0xb3, /* γ */
        49, 50, 51, 0 /* 123 */
    };
    printf("\n%d\n", printf("%s", x));
    return 0;
}

C:\> chcp
Active code page: 65001

C:\> tt
αβγ123
9

Compare that to the following version:

#define WIN32_LEAN_AND_MEAN
#include <stdio.h>
#include <string.h>
#include <windows.h>

int main(void) {
    DWORD n;
    HANDLE out = GetStdHandle(STD_OUTPUT_HANDLE);
    char x[] = {
        0xce, 0xb1, /* α */
        0xce, 0xb2, /* β */
        0xce, 0xb3, /* γ */
        49, 50, 51, 0 /* 123 */
    };
    WriteFile(out, x, strlen(x), &n, NULL);
    printf("\n%u\n", n);
    return 0;
}

Output:

C:\> tx
αβγ123
6

So, while WriteFile cannot count, it is not responsible for the repeated output. Why, then, is ignoring the number of bytes reported in PerlIOWin32_write not solving the problem?