This is kind of just a measurement of how representative a language is in the di...

make3 · 2026-01-12T05:17:31 1768195051

If you look at the list, you'll see that you're incorrect, as C and JavaScript are not at the top.

Seeing all the C languages and JavaScript at the bottom like this makes me wonder if it's not just that Curly brackets take a lot of tokens.

xigoi · 2026-01-12T06:49:57 1768200597

I imagine that having to write

  for (int index = 0; index < size; ++index)

instead of

  for index in 0...size

eats up a lot of tokens, especially in C where you also need this construct for iterating over arrays.

cryptonector · 2026-01-12T07:12:33 1768201953

Well, yes, looking beyond token efficiency I find that the more constrained (stronger and richer static typing) the language the better/faster (fewer rounds of editing and debugging, ergo fewer tokens) the LLM deals with it. C is a nightmare.

moelf · 2026-01-12T05:00:33 1768194033

the most efficient languages are pretty unpopular, so this argument makes them even more efficient in reality?...

muyuu · 2026-01-12T04:09:29 1768190969

You could, but you wouldn't when those keywords can all change in equivalent contexts.

janalsncm · 2026-01-12T20:22:20 1768249340

The BPE or wordpiece tokenization algorithm will greedily take the longest valid token prefix. So if your text starts with “public static void main” it will try to find the longest token which matches that prefix. Even if “public” is a token, it will prefer to tokenize “public static” together.

muyuu · 2026-01-14T01:17:54 1768353474

yes, but then you have both alternatives as tokens, which nullifies GP's argument

eru · 2026-01-12T04:17:45 1768191465

What do you mean?

`public` might have a token by itself, even though you can have `pub` occurring in other contexts, too.

muyuu · 2026-01-14T01:17:02 1768353422

I meant that it wouldn't be efficient to agglomerate tokens in that way and that's why the system won't do it