More

jakobov · 2025-08-30T19:41:58 1756582918

Yes this is the central theme in https://codeisforhumans.com/

jakobov · on Dec 25, 2024

Yes 100%!

Read "Code is For Humans" for more on the subject https://www.amazon.com/dp/B0CN6PQ42B

jakobov · on Sept 28, 2024

Codeisforhumans.com

jakobov · on Sept 15, 2024

Dont be mean

jakobov · on June 29, 2024

Codeisforhumans.com

jakobov · on June 27, 2024

How much faster (in terms of the number of iterations to a given performance) is training from distillation?

jakobov · on June 27, 2024

Nice! Can you explain what you mean by "simulate training beyond the number of available tokens"?

Why does using distillation from a larger model simulate training with more tokens?

suryabhupa · on June 27, 2024

Surya here from the core Gemma team -- we can think of a distillation loss as learning to model the entire distribution of tokens that are likely to follow the prefix thus far, instead of only the token in the training example. If you do some back of the envelope calculations, we can see that learning to model a larger distribution yields many more bits of information to learn from.

jakobov · on June 27, 2024

Gotcha. That makes sense. Thanks!

What are the theories as to why this works better than training on a larger quantity of non-simulated tokens?

Is it because the gradient from the non-simulated tokens is too noisy for a small model to model correctly?

canyon289 · on June 27, 2024

Hi, I work on the Gemma team (same as Alek opinions are my own).

Essentially instead of tokens that are "already there" in text, the distillation allows us to simulate training data from a larger model

jakobov · on March 19, 2024

They are claiming a 25x reduction in power consumption. That can't be right. Anyone understand where this number is coming from?

cavisne · on March 19, 2024

Comes from here [1]. Basically 100 racks of H vs 8 racks of B.

I think there may be a typo though, I assume this also includes liquid-cooled vs air-cooled.

[1] https://nvdam.widen.net/s/xqt56dflgh/nvidia-blackwell-archit...

LTL_FTC · on March 19, 2024

Did you read that in the linked article? I couldn’t find it. But maybe due to the better efficiency with regard to the performance boost (5x) and the ability to now use 27 trillion parameters versus 1.7 Trillion, one can presumably finish the same amount of work in 1/25th of the time and bam, reduction in power consumption. As you say, I’m skeptical the max power draw itself is 25x lower.

wmf · on March 19, 2024

I think Jensen said something like needing 25x fewer GPUs (vs. A100) to get the same performance, which amounts to essentially the same thing.

creshal · on March 19, 2024

It doesn't imply a full 25x reduction in power consumption though, that might "only" go down by 10x.

jakobov · on Feb 14, 2024

Vested his rsu and left

choppaface · on Feb 14, 2024

And got very bored and unhappy with big company issues. And has the perspective from his time at Tesla to know how things only get worse for creativity at that stage.

jdd33 · on Feb 14, 2024

Its not a good thing if true. Tech and creative folk have to find ways to stick around or the financial folk fill the leadership and decision making space.

steveBK123 · on Feb 14, 2024

It's a hard thing to manage. Tech orgs of ~20 people are just more fun than tech orgs of 200 people, which are more fun that tech orgs of 20,000 people which.. you get the picture.

You can create and encourage small teams, but then they need to coordinate somehow. Coordination & communication overhead grows exponentially. Then you get all the "no silos" guys and then its all over..

p1esk · on Feb 14, 2024

He is one of OpenAI founders. I don’t think he needs to vest anything.

Jensson · on Feb 14, 2024

He founded the non profit, not the part that earns money. Non profit founders doesn't have shares.

dathinab · on Feb 14, 2024

yeah, he is basically on of the people who got screwed over, but given that he did work for OpenAI he might not thinking about it that way

VirusNewbie · on Feb 14, 2024

he was a director at Tesla when the stock 10xed.

The guy has many tens of millions of dollars most likely.

fbdab103 · on Feb 14, 2024

Presumably he was already set for life from his Tesla gig, no?

lysecret · on Feb 14, 2024

I usually agree but I honestly believe even before OpenAI he was set for life and he will now care more about how exciting the work is and how much it aligns with his interests/values.

jesterson · on Feb 14, 2024

Even when you have a yacht, you wanna get a bigger one :)

jumpCastle · on Feb 14, 2024

Rsu exist before ipo?

whiplash451 · on Feb 15, 2024

OpenAI does not have RSUs. They have shares of future profit instead.

nextworddev · on Feb 14, 2024

Only 1/4th tho?

rvz · on Feb 14, 2024

Exactly. That is the real reason.

jakobov · on Jan 19, 2024

Shameless plug: Code Is for Humans is a book I just published. You can get the ebook for free at CodeIsForHumans.com