Trying to handle character encoding on Windows in multi-platform programs is a n...

CountSessine · on May 8, 2016

For a company that claims to be so supportive of "developers, developers, developers", Microsoft's stubborn and developer-hostile approach to internationalization and their dogged loyalty to the awful UTF-16 encoding is ironic.

The Right Thing To Do at this point is to make UTF-8 a multi-byte code page in Windows and build a UTF-8 implementation in the msvc libc. The milquetoast excuse I hear from Microsoft people is that some win32 APIs can't handle MBCS encodings with more than 3 bytes per character. Which sort of sounds like a problem for developers to fix; perhaps Microsoft could hire some?

bitwize · on May 8, 2016

Before "developers, developers, developers" comes "backward compatibility, backward compatibility, backward compatibility". Windows is perhaps the first commercial platform to commit to Unicode; they made that commitment when UTF-8 was still some notes scribbled on Brian Kernighan's napkin. And all future Win32 implementations must be 100% binary compatible with previous ones. That creates inertia for UTF-16 (or UCS-2), true, but the backwards compatibility guarantees make Windows an absolute joy compared to Linux if you want to write software with a long service lifetime. The decision to stick with 16-bit Unicode is an engineering tradeoff.

CountSessine · on May 8, 2016

but the backwards compatibility guarantees make Windows an absolute joy compared to Linux if you want to write software with a long service lifetime

The linux user-space ABI is extremely stable. I think the one thing that would frustrate the development of long service lifetime software on linux would just be library availability on various distros, but the actual operating environment presented by the kernel and the image loader for user-space software on linux is very, very stable.

And all future Win32 implementations must be 100% binary compatible with previous ones

And no one is asking microsoft to break the windows user-space ABI. Adding a new, sufficiently-tested, MBCS codepage would have no impact at all on existing windows software. None whatsoever. Other than to make localization a whole lot easier. And compared to the cost of building in a whole new csrss-level subsystem like the one Microsoft just built in to windows last month (linux), it's probably a lot safer and easier to test.

Working in this world myself, windows developers become philosophical when discussing localization. "Someday, we'll turn on UNICODE," "We really should be using TCHAR" (as if that silly thing would fix anything at all), "Shouldn't we really be using wstring?"

OSX, Linux, and mobile developers just do it. It's mostly a solved problem on those platforms.

The decision to stick with 16-bit Unicode is an engineering tradeoff.

It's a cost tradeoff - Microsoft doesn't want to spend the developer and testing time adding a UTF-8 codepage.

jstarks · on May 8, 2016

Vote here: https://wpdev.uservoice.com/forums/266908-command-prompt-con....

Currently we do "support" CP65001 in the console, but things break if you enable it. One of the problems, for example, is that .NET sees 65001 and starts outputting the UTF-8 BOM everywhere, breaking applications that don't even care about the character encoding. I suspect that's going to be difficult to fix without breaking compatibility.

Having said that, I think it's apparent that we are investing heavily in the console for the first time in a long while, so I'm more hopeful than ever that we can get this fixed.

mark-r · on May 9, 2016

The BOM is an aBOMination. If you simply assume that everything is UTF-8 until proven otherwise, you can get pretty far - the legacy code pages produce text that's not usually valid UTF-8 unless they stick to ASCII.

be5invis · on May 9, 2016

Yes, you can assume, but existing applications, maybe written in 1990s even 1980s, won't. And there are millions of computers, maybe in important industrial companies, are still using them.

mark-r · on May 9, 2016

If that's the case they won't handle a BOM either.

CountSessine · on May 8, 2016

Well, if I could use UTF-8 with CreateFileA() et. al. and never have to use wchar_t again, that would just be like Christmas. I don't know whether this survey is for blanket win32 UTF-8 compatibility or whether this just means making the admittedly hugely improved cmd.exe work with UTF-8 command parsing, but anything that gives us better UTF-8 support is a step in the right direction.

azernik · on May 9, 2016

the actual operating environment presented by the kernel and the image loader for user-space software on linux is very, very stable.

The guarantees that Windows makes about backwards-compatibility go far beyond just the image-loader and syscalls. For comparison, the glibc backwards-compatibility story is... messier. https://www.kernel.org/pub/software/libs/glibc/hjl/compat/

mark-r · on May 9, 2016

Linux had the luxury of waiting until UTF-8 was available before settling on a Unicode strategy. Given that UTF-8 was extremely compatible both forwards and backwards, it was a fortuitous choice. I weep a little whenever I think about how Microsoft messed up by being an early adopter of Unicode - being an early adopter should have made life better, not worse.

be5invis · on May 9, 2016

True story: Office maintained its API to be compatible with a plugin which its developer has already bankrupted, and its source code is lost. Because millions of US Government computers are still using them.

adzm · on May 8, 2016

Especially hilarious how SQL server nchar is widechar so most all string data is twice as large as it needs to be. But there isn't a utf8 option to enable! Though this might have been addressed in one of the later versions, I wouldn't be surprised if it was limited to enterprise edition or other crap.

chris_wot · on May 8, 2016

Bloody hell... I looked this up as I was certain it couldn't still be the case. But yes, it is!

This was logged as an issue on Connect in 2008, [1] and Microsoft's response was:

"Thanks for your suggestion. We are considering adding support for UTF8 in the next version of SQL Server. It is not clear at this point if it will be a new type or integrate it with existing types. We understand the pain in terms of integrating with UTF8 data and we are looking at ways to effectively resolve it."

1. https://connect.microsoft.com/SQLServer/feedback/details/362...

Ono-Sendai · on May 8, 2016

It's a bit annoying and lame but not really a huge deal. For example, you can just open an fstream like so:

  std::ofstream file(convertUTF8ToFStreamPath(pathname).c_str());

Where convertUTF8ToFStreamPath converts from UTF-8 to the Windows wide encoding (UTF-16?)

alexbock · on May 8, 2016

It gets even more fun when you have no choice but to use a poorly maintained closed-source third-party library (par for the course on Windows). If the library authors didn't use this trick themselves you can get stuck in situations where you have no way to make the library open the file without resorting to the DOS 8.3 name. And if the user disabled 8.3 names system-wide... then the only consolation is that something even more important on the system will probably break first. A sad state of affairs in 2016.

Ono-Sendai · on May 8, 2016

The third party library situation is indeed trickier. I've never found a library that I couldn't get to open Unicode paths eventually though. Sometimes you have to open a file handle yourself and pass it to the library.

Edit: By the way the worst handling of Unicode paths on Windows I have found is by Ruby, which is still partially broken. (last time I checked)

cremno · on May 8, 2016

What is broken? Have you reported it? I've reported a related bug this week (File.truncate called CreateFileA() with a UTF-8 string). I even had a working patch but forgot to attach it. Anyway it was almost immediately fixed including various additional tests for other File class methods (which handled Unicode without any problems).

Ono-Sendai · on May 8, 2016

https://bugs.ruby-lang.org/issues/1685 etc..