Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Stacking Up AMD MI200 versus Nvidia A100 Compute Engines (nextplatform.com)
77 points by rbanffy on Dec 8, 2021 | hide | past | favorite | 38 comments


I find it interesting they’ve focused so much silicon on improving FP64 - in my mind that means they’re targeting physics simulations and other more traditional HPC workloads more than deep learning. I think that’s a smart thing on their part, because Nvidia really has a chokehold on the deep learning field right now with the A100 and CUDA/CUDNN software stack. I also find it interesting (and a good sign) that their recent supercomputer deals include 100 million for software development for ROCm/HIP.

However, considering how robust CUDA is compared to ROCm, I feel like all Nvidia would need to do to take the HPC market back completely would be to get close to AMD’s current FP64 performance. I don’t think anyone would buy AMD if prices were comparable and FP64 performance was anywhere in the ballpark. It’ll be very interesting to see Nvidia’s new cards, hopefully next year.


> I find it interesting they’ve focused so much silicon on improving FP64 - in my mind that means they’re targeting physics simulations and other more traditional HPC workloads more than deep learning.

I develop scientific simulation software, and I can't tell you how happy I'm about it. Because while doing high precision work, GPUs fall flat fast.

I also work at a HPC center, and there's mountains of FP64 dependent applications running on CPUs. Moving them to GPUs will bring a lot of improvements in a lot of disciplines. It's not uncommon to let things run for a week on multiple nodes for meaningful results.


If I could in any way afford it I'd buy one of these just for doing (hobby) n-body simulations and fractal rendering. I bought a Radeon VII specifically for this and it died a month or so after I got it, with no replacement possible because they were EOL :( So badly want a replacement, but with GPU prices as they are now, there's just no way to justify buying a 2nd hand one.

Right now I think the best you can do for FP64/$ is AVX-512 CPUs and Radeon GPUs.


Not sure I really trust AMD not to trip over their own feet when it comes to software unfortunately.

Hats off to them if they manage to really land a blow on NV here.

Also potentially keep an eye on Intel a few years down the line? Since they can actually do software in my experience (and, get this AMD, document the software!)


Supercomputer market is still there. Intel Sapphire rapids owes its existence to the next US nuke design (to be made on Aurora supercomputer.)


I don't think so. The new ISA that is in the SPR is mostly about deep learning: it supports int8 and bfloat16 (https://fuse.wikichip.org/news/3600/the-x86-advanced-matrix-...). You can emulate higher precision using bfloat16 (https://arxiv.org/abs/1904.06376), but I have not seen this used in the wild.


You know about surrogate models being run in simulations (physics informed neural network and all that stuff) the US is investing a lot in this. Maybe then it makes sense why bfloat16 exist in SPR.


Nvidia can’t improve FP64 performance to near parity without sacrificing a lot in the process, which would make them more vulnerable on the deep learning side of things.


?? Build two separate product lines.


With which engineers, managers, etc? Intel has long faced the same issues, they want their best R&D on x86 CPU’s so they had major issues expanding into other product lines.


4.9x faster FP64, but only 1.2x faster FP16. FP64 is totally irrelevant for machine learning. A100 has been available on AWS for over a year already, while MI200 won't be available at all until next year. Nvidia's A100 successor should be out next year too. And Nvidia's software stack puts AMD's to shame. AMD needs to do a lot better if they want to start being competitive in ML.


Machine learning isn't the only useful thing to run on "GPU"s (these can't do graphics workloads anymore; they're basically just vector processors).

AMD isn't attempting to complete on f16 perf. They're completing where Nvidia's perf is abysmal: f64.

Having programmed AMDGPUs at a lower level than HIP/Rocm, they are actually much better than Nvidia (and in fact I'm able to do cool things like pcie large bar/p2p even on RDNA1 GPUs) in terms of flexibility. The HIP/CUDA API (nevermind OpenCL) doesn't do them justice, CUDA's enormous ecosystem and vender lock-in notwithstanding.


Just curious, what lower level options are there? Inline assembly in OpenCL?


I never touched it myself, but AMD once exposed the "HSA" interface, which is the AMD GPU execution engine.

ROCm / HIP and OpenCL are built on top of that HSA level. I don't think any docs exist for it, but you can see a ton of references to HSA stuff if you browse the ROCm source code.

--------

It sounds like the parent post discusses details about how AMD GPUs "pick" the next kernel to run. I've been told that the AMD GPU is very advanced at this, but no adequate interface has ever been exposed to the programmer.


Correct; Rocm/HIP/OpenCL are all built on HSA APIs, with a few AMD specific extensions.

AMDGPU's command processor is exposed to you (the CP is what invokes "kernels" GPU side): create signals (essentially just an atomic u64 in host memory, with a few extra bells and whistles to support interrupts) and use those in one of the barrier packet types in a HSA device queue. With these (with one sad caveat that work packets can't have deps themselves :( ) you can enqueue most computation graphs and the CP will handle waiting for signals without any CPU involvement. Plus, GPU kernels can also concurrently write to this queue (though you can't create signals GPU side...)

AMDGPUs are like shared memory machines, which I think is really cool.


> you can enqueue most computation graphs

Can those command graphs loop?

I know this interface is undocumented... but I've had an idea for a GPU-language akin to Java or Lisp memory-management. The gist is that kernel_X() can execute, but may fail in any new() or malloc() command.

In such a case, I'd want the compute-graph to loop: while (kernel_X fails due to out-of-memory){ garbage_collect(); try kernel_X() again}.

-------

Not that I have the time to experiment with something like this, but I guess I've been curious to know if that sort of thing can even work.


> Can those command graphs loop?

Not directly (barrier packets wait for 0 only, plus the queue packets aren't preserved), but kernels can write to any dispatch queue themselves, so you can get the same effect at the end of your "loop body".

> I know this interface is undocumented... but I've had an idea for a GPU-language akin to Java or Lisp memory-management. The gist is that kernel_X() can execute, but may fail in any new() or malloc() command.

Memory allocation isn't special and allocators can be layered: you can allocate memory ahead of time and then just run the allocation algorithms GPU side. I wrote a Rust framework which cross compiles code/MIR on demand; you can in theory have a Rust allocator and use it to allocate GPU/CPU memory from either GPU/CPU. The only part the GPU can't do (directly) is invoke syscalls, which you can probably guess is the part needed to allocate virtual memory from the OS.

But as long as your allocator has enough spare virtual memory, it shouldn't need to do a syscall. And if you /really/ needed the ability GPU side, technically with signals you can actually just ask the CPU to allocate the virtual memory on the GPU's behalf and have the GPU spin until the allocation is "complete". Or with compiler support: automatically make the workgroup/kernel async and resume execution by enqueuing another kernel, but that sort of thing is kinda hard :).

Btw, the Rust framework is here: https://github.com/geobacter-rs/geobacter. I mostly work on it in my spare time, a scarce resource these days, so I admit it has some scuff.

> In such a case, I'd want the compute-graph to loop: while (kernel_X fails due to out-of-memory){ garbage_collect(); try kernel_X() again}.

Pretty much. Even garbage collection can (theoretically lol) happen on the GPU.


> Pretty much. Even garbage collection can (theoretically lol) happen on the GPU.

Oh, that's the plan. Semispace collection is very clearly a problem that can be solved in parallel: https://en.wikipedia.org/wiki/Cheney%27s_algorithm

That's just a breadth-first traversal over the fromspace. That's like... GPU programming 101 level material there. Its obviously parallel.

That's why Java/Lisp is the model I'm using, because they use semispace malloc / semispace garbage collection. 100% GPU-side malloc / garbage collection.

Nothing that I'm working on for real, mostly just theory-craft. But fun to think about on my spare time. I would expect that semispace garbage collection and allocation of memory would be very efficient on GPUs, and serve as the basis of some higher-level abstraction.

-------

The "while loop" would need a custom compiler to emit the trampoline / continuation, so that the kernel knows how to "restart" itself in cases where the malloc() fails and garbage collection was run.

Kernels exiting serves as the innate synchronization point, the "synchronized stop" in the stop-the-world garbage collection schemes.

If I could write a routine that saves off every "malloc" as a possible "continuation" point (possibly saving that information in a queue-data structure or stack-data structure of some kind), then it probably would work.


I use LLVM directly. The LLVM AMDGPU target machine is supported by AMD and they use it internally in HIP/OpenCL.

I don't think OpenCL should be used going forward; it not really platform independent. And SPIR-V... kinda sucks tbh. Plus, where's my single source stuff (a la CUDA)?


That's a smart move given GPUs may now loose a lot of market as "AI" thing is rapidly losing steam, and SLIDE is getting better, and better.


More on SLIDE for those curious: https://twimlai.com/slide-the-algorithmic-alternative-that-o...

It's a CPU-based deep learning training algorithm that beats GPUs for some tasks.


I've not seen anything convincing that slide can approximate convolutional layers like at all without almost doing a full convolution at which point it is not competitive.

Can you link me to paper about recent (2021?) work or improvements with slide?


The current work

https://proceedings.mlsys.org/paper/2021/hash/3636638817772e...

seems mostly about tuning the original idea instead of expanding its scope. But it's still a neat idea. I guess it could be possible to adpt many of the approximations used in the SLIDE idea to GPUs too though...


I thought SLIDE was optimized for classification with mind-boggingly sparse data. I.e. "choose the best label for this image out of those 1 mio choices".


They also have a company called ThirdAI to commercialize SLIDE


FP64 is very relevant for the kind of physics simulations that supercomputers are often built for. However the software stack is as well, and there I agree - it’s AMD’s main weak point.


It’s not irrelevant for machine learning in general, e.g. Stan (Bayesian/MCMC) is fp64 only, so its OpenCL offloading is more effective where fp64 is better supported.

It’s less relevant for the kinds of deep neural net work enabled by a vendor who’s software stack is designed around 32 and lower bit types, perhaps in an attempt to ensure their flops numbers are highest?


If you mean integrating the hardware stack to existing open-source software stacks, the yes, Nvidia has the clear advantage. What Nvidia has accomplished is not trivial but it’s not an impossible feat.


Isn't the training done at FP32 and quantized aftweards?

Not that this would make this GPU more appealing. The FP32 performance isn't great either.

The memory bandwidth seems pretty good though.


There is a way to train with mixed fp16/fp32 precision.


I wonder how they manage to keep the FP64 units busy. Seems this is an HPC product, but many HPC apps are memory bound. So to improve FP64 perf by 4 one might need to improve DRAM bandwidth by 8-16x. Otherwise the units would only be stalled waiting for memory.

But it seems they did not improve bandwidth by much?


I don't know anything about the details here, but with usual linear algebra stuff, bandwidth depends on the size of the kernel that fits into the local memory inside whatever IC you use for your floating-point computation.

E.g. matrix multiplication of n×n square matrices has computational cost of n³ but bandwidth cost of n². Usuall a big m x m matrix is split into many blocks of n×n matrices (with m = k×n). If a n×n matrix fits into the local store of your CPU (cache or registers), then bandwidth cost for the m x m matrix product is k³×n×n = m×m×m/n, so the bigger the block-size 'n' that you can process inside the CPU, the less bandwidth you need.

edit: formatting


> I wonder how they manage to keep the FP64 units busy

They don’t. See https://www.amd.com/en/graphics/server-accelerators-benchmar....

The MI250X, despite being dual big dies, doesn’t do especially well.


I disagree. The website you linked to shows speed-ups on MI250X between 1.6x and 3x higher than A100. The theoretical memory bandwidth speed-up between MI250X and A100 is only 1.6X (3.2 TB/s vs 2.0 TB/s). Thus, I'd say they are seeing the advantage of higher FP64 compute in those applications.


Makes sense. Comparing nodes with 2x or 4x MI250X vs 4x or 8x A100-80 it doesn't really seem that there is any speed up at all for memory bound apps.


> Seems this is an HPC product, but many HPC apps are memory bound.

The point of a supercomputer is to throw so much compute at a problem, that everything else is the bottleneck.

If an HPC app is memory-bound, then the GPU / Supercomputer was successful at its job. So many HPC apps are memory bound because... well... turns out our machines are actually quite good.

In any case, MI200 has 1.6x the bandwidth as the A100. So if you have a massively-parallel use-case that is memory bound, the MI200 line should have an advantage.

-------

The main issue IMO, is that the MI200's 1.6x bandwidth is really 80% bandwidth applied over two die, connected with a incredible amount of "infinity fabric" links to share the data. I have to imagine that the A100's larger design wins in some cases over the MI200's chiplet design.


I agree with you, which is why I don't really understand what the point of improving FP64 perf by 4x is, if that is not the bottleneck for many apps.

Per node, a 4x MI250X node has more or less the same BW as a DGX-A100 (8x A100). It has 2x more FP64 compute, but for most science and engineering apps, which are memory bound, 2x more FP64 compute does not make these apps any faster.


Anyone can make microkernels perform well on this but real applications need libraries. The roc libraries are a pain in the ass at the moment, the docs are terrible and look like they were written as an afterthought. This stuff really matters.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: