Unfortunately GPUs (and even CPUs' SIMD) floating point math is riddled with cross-platform determinism issues; hardware manufacturers intentionally trade that in order to get faster math operations in general, because although behavior of floating point is defined in IEEE 754, you get rounding errors for each operation.
Compiler optimizations (and remember, GPU drivers each use their own compiler behind the scenes to translate to their actual hardware architecture) can alter rounding errors of each operation, and parallel execution - which differs from hardware to hardware - also affects it.
Some APIs (Cuda?) let you disable all optimizations and there are ways to get cross-platform determinism, but in general it's much much slower if you want bit-for-bit equality across different hardware.
SPIR-V/Vulkan for example[0] only define an error range based in ULP for some operations - not bit-for-bit equality.
Reproducible results across GPUs are theoretically possible, but it would take some effort to implement and be slower. At least the primitive operations (addition, multiplication, etc.) are there: https://docs.nvidia.com/cuda/floating-point/index.html
One way to get deterministic output is to use integer/fixed point math. Quantised models already do that for matrix multiplication, but things like softmax may still be implemented using some floating point math. It's possible to replace that, just takes a bit of extra work and is probably slower than using the GPU's native float ops.
This seems like a massive issue for actual use. Are there really not some workarounds to get deterministic output?