Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is kind of just a measurement of how representative a language is in the distribution of the tokenizer training. You could have a single token equal to “public static void main”.


If you look at the list, you'll see that you're incorrect, as C and JavaScript are not at the top.

Seeing all the C languages and JavaScript at the bottom like this makes me wonder if it's not just that Curly brackets take a lot of tokens.


I imagine that having to write

  for (int index = 0; index < size; ++index)
instead of

  for index in 0...size
eats up a lot of tokens, especially in C where you also need this construct for iterating over arrays.


Well, yes, looking beyond token efficiency I find that the more constrained (stronger and richer static typing) the language the better/faster (fewer rounds of editing and debugging, ergo fewer tokens) the LLM deals with it. C is a nightmare.


the most efficient languages are pretty unpopular, so this argument makes them even more efficient in reality?...


You could, but you wouldn't when those keywords can all change in equivalent contexts.


The BPE or wordpiece tokenization algorithm will greedily take the longest valid token prefix. So if your text starts with “public static void main” it will try to find the longest token which matches that prefix. Even if “public” is a token, it will prefer to tokenize “public static” together.


yes, but then you have both alternatives as tokens, which nullifies GP's argument


What do you mean?

`public` might have a token by itself, even though you can have `pub` occurring in other contexts, too.


I meant that it wouldn't be efficient to agglomerate tokens in that way and that's why the system won't do it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: