Surya here from the core Gemma team -- we can think of a distillation loss as learning to model the entire distribution of tokens that are likely to follow the prefix thus far, instead of only the token in the training example. If you do some back of the envelope calculations, we can see that learning to model a larger distribution yields many more bits of information to learn from.
Did you read that in the linked article? I couldn’t find it. But maybe due to the better efficiency with regard to the performance boost (5x) and the ability to now use 27 trillion parameters versus 1.7 Trillion, one can presumably finish the same amount of work in 1/25th of the time and bam, reduction in power consumption. As you say, I’m skeptical the max power draw itself is 25x lower.
And got very bored and unhappy with big company issues. And has the perspective from his time at Tesla to know how things only get worse for creativity at that stage.
Its not a good thing if true. Tech and creative folk have to find ways to stick around or the financial folk fill the leadership and decision making space.
It's a hard thing to manage. Tech orgs of ~20 people are just more fun than tech orgs of 200 people, which are more fun that tech orgs of 20,000 people which.. you get the picture.
You can create and encourage small teams, but then they need to coordinate somehow. Coordination & communication overhead grows exponentially. Then you get all the "no silos" guys and then its all over..
I usually agree but I honestly believe even before OpenAI he was set for life and he will now care more about how exciting the work is and how much it aligns with his interests/values.