Trying to handle character encoding on Windows in multi-platform programs is a nightmare. In C++ you can almost always get away with treating C strings as UTF-8 for input/output and you only need special consideration for the encoding if you want to do language-based tasks like converting to lowercase or measuring the "display width" of a string. Not on Windows. Whether or not you define the magical UNICODE macro, Windows will fail to open UTF-8 encoded filenames using standard C library functions. You have to use non-standard wchar overloads or use the Windows API. That is to say, there is no standard-conformant internationalization-friendly way to open a file by name on Windows in C or C++. I really wish Microsoft would at least support UTF-8, even if they want to stick with UTF-16 internally.
For a company that claims to be so supportive of "developers, developers, developers", Microsoft's stubborn and developer-hostile approach to internationalization and their dogged loyalty to the awful UTF-16 encoding is ironic.
The Right Thing To Do at this point is to make UTF-8 a multi-byte code page in Windows and build a UTF-8 implementation in the msvc libc. The milquetoast excuse I hear from Microsoft people is that some win32 APIs can't handle MBCS encodings with more than 3 bytes per character. Which sort of sounds like a problem for developers to fix; perhaps Microsoft could hire some?
Before "developers, developers, developers" comes "backward compatibility, backward compatibility, backward compatibility". Windows is perhaps the first commercial platform to commit to Unicode; they made that commitment when UTF-8 was still some notes scribbled on Brian Kernighan's napkin. And all future Win32 implementations must be 100% binary compatible with previous ones. That creates inertia for UTF-16 (or UCS-2), true, but the backwards compatibility guarantees make Windows an absolute joy compared to Linux if you want to write software with a long service lifetime. The decision to stick with 16-bit Unicode is an engineering tradeoff.
but the backwards compatibility guarantees make Windows an absolute joy compared to Linux if you want to write software with a long service lifetime
The linux user-space ABI is extremely stable. I think the one thing that would frustrate the development of long service lifetime software on linux would just be library availability on various distros, but the actual operating environment presented by the kernel and the image loader for user-space software on linux is very, very stable.
And all future Win32 implementations must be 100% binary compatible with previous ones
And no one is asking microsoft to break the windows user-space ABI. Adding a new, sufficiently-tested, MBCS codepage would have no impact at all on existing windows software. None whatsoever. Other than to make localization a whole lot easier. And compared to the cost of building in a whole new csrss-level subsystem like the one Microsoft just built in to windows last month (linux), it's probably a lot safer and easier to test.
Working in this world myself, windows developers become philosophical when discussing localization. "Someday, we'll turn on UNICODE," "We really should be using TCHAR" (as if that silly thing would fix anything at all), "Shouldn't we really be using wstring?"
OSX, Linux, and mobile developers just do it. It's mostly a solved problem on those platforms.
The decision to stick with 16-bit Unicode is an engineering tradeoff.
It's a cost tradeoff - Microsoft doesn't want to spend the developer and testing time adding a UTF-8 codepage.
Currently we do "support" CP65001 in the console, but things break if you enable it. One of the problems, for example, is that .NET sees 65001 and starts outputting the UTF-8 BOM everywhere, breaking applications that don't even care about the character encoding. I suspect that's going to be difficult to fix without breaking compatibility.
Having said that, I think it's apparent that we are investing heavily in the console for the first time in a long while, so I'm more hopeful than ever that we can get this fixed.
The BOM is an aBOMination. If you simply assume that everything is UTF-8 until proven otherwise, you can get pretty far - the legacy code pages produce text that's not usually valid UTF-8 unless they stick to ASCII.
Yes, you can assume, but existing applications, maybe written in 1990s even 1980s, won't.
And there are millions of computers, maybe in important industrial companies, are still using them.
Well, if I could use UTF-8 with CreateFileA() et. al. and never have to use wchar_t again, that would just be like Christmas. I don't know whether this survey is for blanket win32 UTF-8 compatibility or whether this just means making the admittedly hugely improved cmd.exe work with UTF-8 command parsing, but anything that gives us better UTF-8 support is a step in the right direction.
the actual operating environment presented by the kernel and the image loader for user-space software on linux is very, very stable.
The guarantees that Windows makes about backwards-compatibility go far beyond just the image-loader and syscalls. For comparison, the glibc backwards-compatibility story is... messier. https://www.kernel.org/pub/software/libs/glibc/hjl/compat/
Linux had the luxury of waiting until UTF-8 was available before settling on a Unicode strategy. Given that UTF-8 was extremely compatible both forwards and backwards, it was a fortuitous choice. I weep a little whenever I think about how Microsoft messed up by being an early adopter of Unicode - being an early adopter should have made life better, not worse.
True story: Office maintained its API to be compatible with a plugin which its developer has already bankrupted, and its source code is lost.
Because millions of US Government computers are still using them.
Especially hilarious how SQL server nchar is widechar so most all string data is twice as large as it needs to be. But there isn't a utf8 option to enable! Though this might have been addressed in one of the later versions, I wouldn't be surprised if it was limited to enterprise edition or other crap.
Bloody hell... I looked this up as I was certain it couldn't still be the case. But yes, it is!
This was logged as an issue on Connect in 2008, [1] and Microsoft's response was:
"Thanks for your suggestion. We are considering adding support for UTF8 in the next version of SQL Server. It is not clear at this point if it will be a new type or integrate it with existing types. We understand the pain in terms of integrating with UTF8 data and we are looking at ways to effectively resolve it."
It gets even more fun when you have no choice but to use a poorly maintained closed-source third-party library (par for the course on Windows). If the library authors didn't use this trick themselves you can get stuck in situations where you have no way to make the library open the file without resorting to the DOS 8.3 name. And if the user disabled 8.3 names system-wide... then the only consolation is that something even more important on the system will probably break first. A sad state of affairs in 2016.
The third party library situation is indeed trickier. I've never found a library that I couldn't get to open Unicode paths eventually though. Sometimes you have to open a file handle yourself and pass it to the library.
Edit: By the way the worst handling of Unicode paths on Windows I have found is by Ruby, which is still partially broken. (last time I checked)
What is broken? Have you reported it? I've reported a related bug this week (File.truncate called CreateFileA() with a UTF-8 string). I even had a working patch but forgot to attach it. Anyway it was almost immediately fixed including various additional tests for other File class methods (which handled Unicode without any problems).
The section titled "How to do text on Windows" on http://utf8everywhere.org/#windows covers the insanity in more detail.