> The weird part: different duplication patterns create different cognitive "modes" from the same weights. Double-pass boosts math. Triple-pass boosts emotional reasoning. Interleaved doubling (13,13,14,14,15,15,16) creates a pure math specialist. Same model, same VRAM, different routing.
As far as I can see that's not implied by the original post.
But that's beside the point: quoting the bit where the poster says "here's what I'm building on top of" and using that to imply they haven't done anything new is a bit pointless, no?
You're right that my quote was misleading, I overlooked "the weird part" in the post because it didn't seem new to me either.
Here's the section in the original post that covers it: https://dnhkng.github.io/posts/rys/#the-brain-scanner All heatmaps are split by tasks and show an optimal point for each. The resulting routing he chose is a trade-off for both tasks, there isn't much else to do unless you intend to train a router anyway.
> So the ‘math organ’ has boundaries on both sides. Too few layers and you get nothing — you’ve cut into the circuit and it can’t complete its operation. Too many layers and you also get nothing — you’ve included tissue from a neighbouring circuit that doesn’t belong. Pre-training carved these structures out of the layer stack, and they only work whole. It also doesn’t translate to other tasks, as the heatmap for EQ scores doesn’t have this patch.
This is stated in the original post as well, under "The Beginning of LLM Neuroanatomy?" section:
> From end-position 43 to 46, we then see solid boosts in math scores (red = good, yay). But include layer 46 or beyond, and the benefits collapse again. The hypothesis: position 47 is where a different circuit begins. Including even one step of the next recipe messes up the current recipe.
> So the ‘math organ’ has boundaries on both sides. Too few layers and you get nothing — you’ve cut into the circuit and it can’t complete its operation. Too many layers and you also get nothing — you’ve included tissue from a neighbouring circuit that doesn’t belong. Pre-training carved these structures out of the layer stack, and they only work whole. It also doesn’t translate to other tasks, as the heatmap for EQ scores doesn’t have this patch.
> This is a much more specific claim than “middle layers do reasoning.” It’s saying the reasoning cortex is organised into functional circuits: coherent multi-layer units that perform complete cognitive operations. Each circuit is an indivisible processing unit, and the sweeps seen in the heatmap is essentially discovering the boundaries of these circuits.
So the reason the comment appears weirdly disconnected from the content of the article is that it was generated independently from the content of the article.
It's just setting the font-family in the style attribute of a <span>. (As you can see by inspecting the text/html content of your clipboard, e.g. with `xclip -selection clipboard -o -t text/html`)
Well, the weights are accumulated in full precision and are multiplied by a full-precision scale factor after quantization, and the activations and backward pass are computed in full precision as well, so it's not quite true 4-bit precision training. The resulting model can be stored with just slightly more than 4 bits per parameter, though.
Nice work! The paper feels verbose at times and could use some editing to slim it down (also, equation 6 is just equation 5 in a box) but I enjoyed it a lot nonetheless.
There are a few pictures of truncated icosahedra in the article, alongside several other shapes that are not icosahedra. The point is that they have icosahedral symmetry. The L is important.
I was going to comment pedantically that soccer balls were dodecahedrons not icosahedrons, but in reading the article, I came to realize that truncated icosahedrons are the same as truncated dodecahedrons.
This was such a delightful realization I felt the need to comment anyway.
Hmm. I'm sorry, but truncated dodecahedra are different from truncated icosahedra.
Truncated dodecahedra are made from twelve 10-gon and twenty triangular faces. Truncated icosahedra are made from twenty hexagonal and twelve pentagonal faces.
Flowering plants (angiosperms) appeared during the Cretaceous before dinosaurs got wiped out, and there is fossil evidence of insects pollinating non-flowering plants (gymnosperms) like ferns and confers even earlier than that: https://repository.si.edu/server/api/core/bitstreams/152b12d...
It makes it easier to compare with other papers. If two different papers apply different methods to different models and get different results, how do you know which method is better?
Once you have identified the best method and want to productize it, it would of course make sense to apply it on top of the best model, but if you're just doing research, you can skip that expensive last step.
reply