Latin text gives people a misleading idea as to how simple text is. For Latin, e...

Manishearth · on Feb 15, 2018

> Indic scripts go far beyond this by needing to treat entire consonant clusters as single rendered glyphs.

FWIW, Arabic does this too, just that most default Arabic fonts don't. https://www.google.com/get/noto/#nastaliq-aran contains a bunch of specialized ligatures (and is overall a very complicated font).

I always say: There's a reason a lot of the folks working on font shaping are Persian/Arabic speakers :)

jcranmer · on Feb 15, 2018

I didn't know about Arabic having more ligatures than a few common ones (e.g., Allah).

That said, there's a reason that I want to learn Arabic.

Manishearth · on Feb 15, 2018

Yeah, the common ones are lam-alif showing لا , and alif-lam-lam-heh showing الله .

But for example خ/ح/چ at the end of a word often form cool ligatures, and the dots on ب/پ/ت often go to interesting places, and سے forms a cool ligature where it uses the other form of the bari yeh and the س forms little teeth marks up top. (Some of these are Urdu-specific; I can read regular Arabic but I have more experience trying to read Urdu calligraphy)

Arabic calligraphy can be pretty involved.

rtpg · on Feb 16, 2018

One great way of dealing with this at a software engineering level is to stop operating on arrays of characters, and instead add functionality on a case-by-case basis.

For example, you have a wrapper that only lets you iterate forwards. Once you hit languages where you need to "move back", you have to explicitly code in the functioanality. And in theory you can do it in a safe way.

Sure, some might argue you'll end back at arrays. But I believe that if you encode the traversals in a specific way, you'll at least end up at bounds-safe arrays.

For example, if you use an array as a queue, but some other part of the code doesn't, you're gonna have problems. But if you wrap your thing as a queue, no other part of the code will be able to pierce the veil.

Though I don't know how how well C abstractions let you do this.

Manishearth · on Feb 20, 2018

I think this oversimplifies the problem.

OpenType basically requires you to operate on a glyph buffer.

What you can do, is try and make all glyph buffer operations throw an error that makes you hit a fallback case.

dkarl · on Feb 15, 2018

So far, if I understand correctly, nobody knows of a sequence of characters you could write down in any language that would trigger a crash when encoded in Unicode in a straightforward way. That suggests that the invariant being assumed may come, ironically, from a deep understanding of the scripts rather than ignorance.

Manishearth · on Feb 16, 2018

No, the crashy Bengali sequences are reasonable to have; like I mentioned ZWNJ has semantic meaning with bengali vowels.

dkarl · on Feb 16, 2018

I am judging by this statement in the blog post:

In Bengali and Oriya specifically, a ZWNJ can be used to force a different vowel form when used before a vowel (e.g. রু vs র‌ু), however this bug seems to apply to vowels for which there is only one form

This seems to say that the ZWNJ has a meaning before vowels that have different forms, but the crash happens with vowels that only have one form, where the ZWNJ has no effect. Maybe I am misreading?

Manishearth · on Feb 16, 2018

Yeah, you are.

I'm saying that this crash _also_ applies to vowels with one form.

র‌ু was the original Bengali crash, and that has two forms. I'm saying it's less likely to be related to the zwnj-vowel interaction because it also occurs for vowels where such interaction doesn't exist.

umanwizard · on Feb 16, 2018

What do you mean by "straightforward" ?