Latin text gives people a misleading idea as to how simple text is. For Latin, each character is generally an independent unit with independent metrics isolated from its environment, with a small set of exceptions to this rule (ligatures).
East Asian ideographs bring up interesting questions about what constitutes a character, with Unicode "solving" the problem by saying "every distinct rendering is a distinct character," necessitating somewhere in the region of 80,000 characters or so once all of them get added. Even more difficult are scripts like Korean Hangul or Egyptian and Mayan glyphs, which are composed from a relatively small set of independent units but laid out in blocks and sub-blocks themselves composed in linear text. Unicode has both precomposed Hangul characters and the individual Jamo radicals, but it has currently punted on being able to accurately Egyptian or Mayan text in the first place (although they do appear to be revisiting that decision).
Scripts like Arabic differ from Latin in that glyphs changing shape according to their surrounding context is the norm rather than the exception. However, in Arabic, you largely break this down into an initial/medial/final form, with some characters inducing a shift from medial back to initial. Indic scripts go far beyond this by needing to treat entire consonant clusters as single rendered glyphs.
The end result is that it's very easy to find that invariants that one expects to exist if one is used to Latin text to be violated in other languages. Some concepts of text metrics might not even exist in the first place. As speculated elsewhere, it's probable that the text is crashing because someone insufficiently versed in scripts is asserting an invariant that doesn't actually exist. It does not appear to be a rendering issue, but rather a slightly higher level operation on top of that instead.
Finding these sorts of issues pretty much requires extensive fuzzing with known problematic scenarios. The fault is caused by "I didn't know this exists" which is both a very reasonable situation (few UI experts are well-versed in the complexities of foreign scripts) and very hard to solve from a process perspective.
> Indic scripts go far beyond this by needing to treat entire consonant clusters as single rendered glyphs.
FWIW, Arabic does this too, just that most default Arabic fonts don't. https://www.google.com/get/noto/#nastaliq-aran contains a bunch of specialized ligatures (and is overall a very complicated font).
I always say: There's a reason a lot of the folks working on font shaping are Persian/Arabic speakers :)
Yeah, the common ones are lam-alif showing لا , and alif-lam-lam-heh showing الله .
But for example خ/ح/چ at the end of a word often form cool ligatures, and the dots on ب/پ/ت often go to interesting places, and سے forms a cool ligature where it uses the other form of the bari yeh and the س forms little teeth marks up top. (Some of these are Urdu-specific; I can read regular Arabic but I have more experience trying to read Urdu calligraphy)
One great way of dealing with this at a software engineering level is to stop operating on arrays of characters, and instead add functionality on a case-by-case basis.
For example, you have a wrapper that only lets you iterate forwards. Once you hit languages where you need to "move back", you have to explicitly code in the functioanality. And in theory you can do it in a safe way.
Sure, some might argue you'll end back at arrays. But I believe that if you encode the traversals in a specific way, you'll at least end up at bounds-safe arrays.
For example, if you use an array as a queue, but some other part of the code doesn't, you're gonna have problems. But if you wrap your thing as a queue, no other part of the code will be able to pierce the veil.
Though I don't know how how well C abstractions let you do this.
So far, if I understand correctly, nobody knows of a sequence of characters you could write down in any language that would trigger a crash when encoded in Unicode in a straightforward way. That suggests that the invariant being assumed may come, ironically, from a deep understanding of the scripts rather than ignorance.
In Bengali and Oriya specifically, a ZWNJ can be used to force a different vowel form when used before a vowel (e.g. রু vs রু), however this bug seems to apply to vowels for which there is only one form
This seems to say that the ZWNJ has a meaning before vowels that have different forms, but the crash happens with vowels that only have one form, where the ZWNJ has no effect. Maybe I am misreading?
I'm saying that this crash _also_ applies to vowels with one form.
রু was the original Bengali crash, and that has two forms. I'm saying it's less likely to be related to the zwnj-vowel interaction because it also occurs for vowels where such interaction doesn't exist.
East Asian ideographs bring up interesting questions about what constitutes a character, with Unicode "solving" the problem by saying "every distinct rendering is a distinct character," necessitating somewhere in the region of 80,000 characters or so once all of them get added. Even more difficult are scripts like Korean Hangul or Egyptian and Mayan glyphs, which are composed from a relatively small set of independent units but laid out in blocks and sub-blocks themselves composed in linear text. Unicode has both precomposed Hangul characters and the individual Jamo radicals, but it has currently punted on being able to accurately Egyptian or Mayan text in the first place (although they do appear to be revisiting that decision).
Scripts like Arabic differ from Latin in that glyphs changing shape according to their surrounding context is the norm rather than the exception. However, in Arabic, you largely break this down into an initial/medial/final form, with some characters inducing a shift from medial back to initial. Indic scripts go far beyond this by needing to treat entire consonant clusters as single rendered glyphs.
The end result is that it's very easy to find that invariants that one expects to exist if one is used to Latin text to be violated in other languages. Some concepts of text metrics might not even exist in the first place. As speculated elsewhere, it's probable that the text is crashing because someone insufficiently versed in scripts is asserting an invariant that doesn't actually exist. It does not appear to be a rendering issue, but rather a slightly higher level operation on top of that instead.
Finding these sorts of issues pretty much requires extensive fuzzing with known problematic scenarios. The fault is caused by "I didn't know this exists" which is both a very reasonable situation (few UI experts are well-versed in the complexities of foreign scripts) and very hard to solve from a process perspective.