I don't believe that the model was trained on so few GPUs, personally, but it al...

I don't believe that the model was trained on so few GPUs, personally, but it also doesn't matter IMO. I don't think SOTA models are moats, they seem to be more like guiding lights that others can quickly follow. The volume of research on different approaches says we're still in the early days, and it is highly likely we continue to get surprises with models and systems that make sudden, giant leaps.

Many "haters" seem to be predicting that there will be model collapse as we run out of data that isn't "slop," but I think they've got it backwards. We're in the flywheel phase now, each SOTA model makes future models better, and others catch up faster.