There's definitely several ways to ask the question, but IMHO the best answer to an unqualified question is to frame it in terms of what the necessary state to store in a thread context. To that end, the registers boil down to:
* 16 general-purpose registers.
* 16 or 32 vector registers.
* 7 vector mask registers
* 8 x87 floating-point registers, with 8 MMX registers aliased. (To separate or not separate x87/mmx is definitely a challenging question)
* Relevant MPX registers (I don't know this ISA extension very well, so I can't count these registers accurately)
These are the registers that I would expect to be able to poke at in a debugger or inspect/modify via something like ucontext_t, and they're going to be found in whatever kernel abstraction you use to save not-currently-running thread information.
That is a perfectly reasonable approach to that answer indeed. Registers do hold state and saving restoring them is burden of program code (user, compiler, lib, os).
Contrast this with a hypothetical cpu that has only one register "base" and allows to address 32 words after the address at that base register. i.e. things like arithmetic instructions would have 5 bits to address operands which will be interpreters as base+8*n. To make things even more interesting this architecture's instruction pointer lives at base+0.
Such an architecture would have one register under your metric (as only one register needs to be saved/restored to context switch an entire "register file").
However, implementations (microarchitecture) could actually shadow that memory range into hardware registers, and page in/out the whole register bank upon writes of the base register (effectively performing a hardware assisted context switch; hello TSS).
However, since each instruction in this hypothetical ISA must have enough space in the encoding to address these operands, for all intents and purposes this architecture would have 32 registers.
Deciding instructions, addressing operands, dealing with consequences of code density (icache misses), ... are all way more frequent events than context switches.
Hence I do agree with TFA that operand encoding should be the default metric to count registers. And this also includes sub/overlapping registers, if they are independently addressed.
> Contrast this with a hypothetical cpu that has only one register "base" and allows to address 32 words after the address at that base register.
This, er, wasn't really hypothetical. The TMS9900, the CPU used in the TI-99/4A, had three hardware registers: a program counter, a status word, and what was called a Workspace Pointer (WP). General purpose "registers" lived in RAM, and were referenced by an offset off the value in the WP. Subroutine calls were initiated by saving the PC and changing the WP to a fresh new register context before branching.
Weird architectures definitely make the question a lot more difficult to answer. My gut reaction would be to say that your hypothetical ISA has 33 registers, and that the register file is memory-mapped to a specified region of virtual memory. That's partially because of the way that you're going to worry about how cache coherence will work out, but also because I suspect the mcontext_t or OS-equivalent interface will also define its structure layout as such.
The broader point, though, is in deciding whether or not to include registers like CR0 and DR0. The principle I'm using here is that registers that are not expected to be saved/restored on task switches should be excluded. Registers that are per-process (i.e., page tables in general, or segment descriptors on x86) or per-CPU (most MSRs) are thus excluded by this criterion.
FSBASE/GSBASE are extremely borderline--I wouldn't complain if they were or if they weren't excluded from a list of registers. These act as a mixture of user-visible registers (even if accessibly only via syscalls until very recently) and segment descriptor information. They're not in Linux's userspace-visible mcontext_t struct, but they are in the kernel's equivalent to mcontext_t.
The goal of the hypothetical ISA in my example was merely to tease apart two distinct aspects of the register file cardinality. Both aspects (state and encoding) exist in real world architectures but apparently it's easy to confuse them.
> I won’t count microarchitectural implementation details, like shadow registers.
I really think this article would have been more interesting if he had discussed the microarchitectural registers. Those are important to understand for optimization even if they're not directly visible.
Also discussing MSRs but not special information tables like the page table and VMCS is a somewhat odd distinction. While they're probably stored differently, they are somewhat similar in how they are used.
Also, isn't the TLB like a kind of set of registers? The TLB entries are very frequently accessed. How about store buffers and the like?
Yep, what really happens under the hood is dynamic register allocation at runtime. I wonder how important is static register allocation by the compiler in that scenario. In theory, even if the compiler uses the same register over and over in sequential instructions, the renamer should be smart enough to detect the false data dependencies and allocate registers from individual instructions to different locations in the register file.
Are there any x86 profiling tools which give any metrics about the real utilisation of the register file?
The number of architectural registers are still relevant for register allocation because of course overlapping and independent code sequences cannot share the same architectural register name. This is not very important for integer loads, but still relevant for FP where optimal scheduling requires having multiple computations in flight at the same time. In some cases 16 FP registers are not enough and Intel had to add 16 more FP registers with AVX512.
Oh: to clarify, you mean that the compiler could just use stack slots for everything, but some instructions are only allowed to operate on architectural registers, right? If you have to execute a lot of those instructions, the number of architectural registers can be the bottleneck in performance?
Yes, consider vector floating point fused multiply-add (FMA). On a typical AVX implementation like Skylake, this instruction has a latency of 4 cycles and a throughput of 2 instructions per cycle. To avoid stalls, you'd need 4 * 2 = 8 instructions to run independently, and 8 architectural registers to simply store the results. You could store the results onto the stack and reuse the same architectural registers, but usually you want to use the values immediately in the next loop iteration (eg. matrix multiply) so this would be expensive. You probably want a few more architectural registers (at the very least 2, up to 16) to hold the inputs as well.
The reason it is less relevant for integer computations is that integer ops have normally lower latency and tend to have shorter loop carried dependency chains.
I don't know of any profiling tools, but I've always relied on llvm-mca[0] -register-file-stats to show me what the expected register file usage is on an ISA
Not just in theory. The way modern register renaming works is that every single write to a register name always allocates a new physical register. It's not even possible for there to be false dependencies because of reusing the same name over and over.
From the programmers point of view only the working registers are really important, other registers are just OS related (MSR), some are really important, but the internal representation may be total different.
Like a mode switch bit in a CR register. So MSR-r are just the interface. And the MSR register access can be "slow", so no synchronization or optimization required.
But, the idea of more register makes better architecture, is a total bad assumption. See the dead body of Itanium (128 general-purpose 64 bit integer registers, 128 floating point registers etc. )
With multitasking, one have to switch between context, and larger context (register file size) takes more time.
There are cases when you are better using just the GPRs, rather than the SIMD registers. (Linux kernel does not use FPU or SIMD registers)
Also SIMD usage may slow down the clock, like AVX in x86_64. So you may trust your compiler for vectorization, but it may make more harm than good.
IMHO what killed itanium wasn't too many registers, and not even compiler difficulties — it was an attempt to have a working x86 emulation.
So, instead of a weird but very fast CPU, it ended up being not very fast both in x86 and native modes, while still being weird. (The makers of the Cell CPU did not compromise, went full weird, and had a winner of sorts.)
More GPRs visible in the ISA also means more bits needed to encode instructions. If instruction length and encoding were not an issue, I bet we would have seen memory-to-memory ISAs where no GPRs exist, only instructions referencing memory locations. The dynamic register file would then be just a level below L1 cache, or even completely removed.
> With multitasking, one have to switch between context, and larger context (register file size) takes more time.
Sparc chips got around that by having sliding windows of registers: instead of having to push all the registers to the stack you just moved the window.
They share the underlying storage (i.e. they are aliased) but they are independently addressed, and thus they consume instruction encoding space.
Not saying that it's not interesting to know how much actual storage the register file offers; just highlighting that TFA focuses on the instruction encoding angle of the question, which is also important.
CPU architectures are masterpieces of tradeoffs.
Put too many registers and your instructions steam is not dense enough and you cannot keep your cpu busy due to stalls in the fetch phase. Also context switches become expensive (there are solutions to that though).
Put to few registers and you have to spill registers to memory too often, and thus also consume precious instruction stream space.
> I will count sub-registers (e.g., EAX for RAX) as distinct registers. My justification: they have different instruction encodings, and both Intel and AMD optimize/pessimize particular sub-register use patterns in their microcode.
I disagree strongly with that characterisation. Just no.
I think it's pointless to debate a methodology without a purpose. "How many registers does an x86-64 CPU have?" is interesting (to 58 voters so far) but too general to be useful for any particular purpose. Consider a couple alternate questions brought up in this thread:
* How many bytes of register files does the OS have to save on task switches? (With this question, subregisters shouldn't be counted.)
* How many registers does an emulator (such as Rosetta 2) have to implement and test? (Subregisters should be counted.)
Even these one might argue aren't directly useful; when considering context switching, one could dig down further into how much of the context switching time is attributable to saving the registers, validate that with experiments across architectures, etc.
> * How many bytes of register files does the OS have to save on task switches? (With this question, subregisters shouldn't be counted.)
This is something that has been bothering me for some time now - actually since the mid-80's: why not implement multiple contexts as an index into a large register file? This way, a context switch would take the time it takes to write to the `task-id` register. It will impact latencies, but would the impact of having, say, 8 contexts not be smaller than having to hit L1 or L2 for the same data?
True, but at two per core (4 or 8 in more enlightened architectures), it's very meh.
I would assume that, instead, a modern CPU tags decoded instructions in the reorder buffer with the virtual core number (and register set) it should be applied to. This way, the parallelism would be much easier to exploit.
The emulator has to handle and the testing infrastructure has to have tests for each subregister and how it affects its parent register. You have to prevent and detect design-level oversights such as "AH is not the LSB of AX/EAX/RAX and my design only accounts for LS-whatever subregisters". Your point that they can't be distinct holds, but but if they were distinct they would have less impact on the engineering than they will have as subregisters.
Does that mean you would count AL, AH, but not AX (as it can be split into AL/AH and doesn't "bring new bits"), and then again would count EAX and RAX?
However if I were doing emulation, I would have to count it because it is a register case that has to be addressed. Just like the D register has to be dealt with on a 6809.
My own personal way to resolve this has always been to determine whether or not a given register specification, that can be addressed somehow, brings new information not contain in any other register specification, to the table.
Fact is, CPU designers do all kinds of crazy things with registers. They overlap, they may be indirect, like not directly addressable, but still there as consideration for the programmer.
Think about something like the REP instruction found on some CPUs. There's a little circuit it keeps track of account, and some rules, and account may be a register that may or may not be directly addressable any other way.
My general take on this article is, "wow, that's a lot of registers!"
We can all quibble about what the quantity of a lot is, and it's all good fun, and I don't think it means anything really.
I'd argue that "translation" is an implementation detail of emulation. Your translated app still thinks it is x86-64 (witnessed by running "uname -a" in a Rosetta terminal.)
That's just the difference between a JIT and not - something like QEMU in the other direction doesn't "run" ARM code either, but in the end it really doesn't matter and is only a minor pedantry.
It makes sense if you're counting register _encodings_ (that is, how much ISA encoding space the registers use). But yeah, a more useful count would consider the sub-registers as part of the main register (and the same for other ISAs like 64-bit ARM, which does have a 32-bit view of its 64-bit general purpose registers), and would not consider registers outside each core (like the MTRR registers).
It depends on if you’re looking at this from an implementation in hardware vs software I’d say. The article specifically mentions Rosetta 2 as the context so I’m guessing the enumeration is more important if you intend to understand all the things Rosetta 2 has to implement.
Author here: I don't think this is a good necessary condition for what makes something a register. For consideration:
* Both x86 and other ISAs have registers that can't be stored to at all, like `k0` for the constant opmask and a whole bunch of read-only MSRs. But they can be read from wholly independently and as discrete registers.
* There are lots of cases where registers can be programmed to clobber other registers, particularly in the performance counter and PAT MSRs.
Sub-registers are definitely a stretch from the above, since they explicitly share bits in the x86 model. But then again, even the x86 model exposed to assembly programmers is a lie: the underlying microcode dynamically renames a large arena of anonymous registers at runtime, and subregisters like AL and AH have been separated in the microcode (to avoid some cases of partial register stalls) for over a decade.
MSRs are called "registers" sure but you can't actually perform any operations on them other than a load or a store, and you can't use them to stash random data outside of main memory because they affect the operation of the processor. `k0` (or e.g. MIPS's `r0`) are even worse, since you can't even write to them.
That's not what most people care about when they talk about how many registers a processor has.
> That's not what most people care about when they talk about how many registers a processor has.
I think most people do consider the instruction pointer and status word to be registers, despite also violating the constraints you specified.
Most people probably don't think about MSRs at all, and so maybe just aren't interested in a count of them. But I'm interested in counting the different pieces of on-CPU state that would be necessary to faithfully model an entire x86 core, and both Intel and AMD refer to those bits of state as "registers."
This analysis does have some limited value for characterizing the complexity required in the processor's front end to decode instructions. Register renaming means it has somewhat less relevance to the difficulty a compiler faces with register allocation, and basically no relevance to the actual size of a processor's physical register file(s).
They are the same architectural register, but due to register renaming, they may not reside in the same physical transistors. That’s probably why the author supersets them.
Funny to think that once x86 was the platform that had too few general-purpose registers, so people sacrificed the frame pointer register in their highly-optimized assembly routines...
I still remember benchmarking the various optimization options in GCC and the only one that consistently and significantly improved performance on real code was -fomit-frame-pointer.
TIL I learned about an entire Intel microprocessor subsystem, MPX, that was added and then deemed useless before I learned about it. It is both less secure and slower than software solutions. What a failure in processor design.
How about removing obsolete stuff from 86x CPUs to make the platform perform better? If someone need to execute old programs/OSes they can use emulators for that...
> As of November 2020, all supercomputers on TOP500 are 64-bit, mostly based on CPUs using the x86-64 instruction set architecture (of which 459 are Intel EMT64-based and 22 are AMD AMD64-based. The few exceptions are all based on RISC architectures). Thirteen supercomputers, including the no 2. and no. 3 are based on the Power ISA used by IBM POWER microprocessors, three on Fujitsu-designed SPARC64 chips. One computer uses another non-US design, the Japanese PEZY-SC (based on the British ARM[8]) as an accelerator paired with Intel's Xeon.
There are non-x86 architectures in the TOP500, including ones which have less cruft than x86, but the x86 chips keep on being used in some of the fastest machines on the planet. My hypothesis is that x86 cruft doesn't really matter, and you'd need to go to a cruft level that was orders-of-magnitude worse for the ISA choice to dominate performance.
It really doesn't matter. The biggest benefit of ARM are the fixed length instructions and only Apple is actually taking advantage of this by decoding 8 instructions at once. The big question is whether decoding that many instructions is actually a benefit. It's entirely possible that branch prediction and other factors are greater bottlenecks that have to be tackled first to take advantage of the faster instruction decoding.
Intel's Pentium processors (the original ones) were doing pretty badly because they made the pipeline too deep at the expense of other things.
I think the original Pentium had the pretty much canonical 5 stage pipeline. It did pretty well. Its successor, the Pentium Pro, an OoO design, was deeper but also did amazingly well.
You are probably thinking of the Pentium 4 which was designed as a speed demon with a very deep pipeline and failed to reach its target frequency.
Anandtech's article indicates they have an out-of-order buffer of around 630 entries (Zen 3 is only 256 entries). The M1 has 7 integer math ports and 4 FP/SIMD math ports plus several bunch of load/store/branch ports . It seems like they could completely saturate those decoders given the right code.
Fair points, but I think we can be 100% certain that Apple has modelled this and made their architectural decisions based on this modelling - especially as they are no the 10th or so iteration of their designs.
That and that their CPUs are designed to run one OS and apps are developed against one set of libraries. This frees them to tune the hardware to the needs of the software much more than any other manufacturer can do (a PC needs to run Word and Autocad equally well)
They have certainly optimised against some key aspects of their software (eg Rosetta and reference counting) but that is not at the expense of other software. The M1 Arm CPUs are just very fast general purpose CPUs.
It's probably that x86 has a better cost/performance than others. If you can get the job that'd require, say, 200 POWERs or 250 SPARC64s with 300 Xeons that cost half per socket than a POWER, x86 will still be a better choice. This could be for many reasons - from intrinsic performance of the CPU to the quality of the code generated by the compiler and/or architectural fitness to the task at hand.
Also, take into account the CPUs are not always the more expensive part of the compute node - GPUs, HBM, lots of DDR4, and fast networking gear are also pretty expensive and will be more or less constant as you change CPU architectures.
And while counting architectural registers is something, in reality modern out-of-order processors do something called "register renaming" so that more registers can be used on-the-fly as it dynamically creates a data flow dependency graph. Yes, inside each processor.
There have been attempts to move the register renaming out of the processor and into the program through VLIW architectures such as the Itanium, but it failed because (1) they require “sufficiently smart compilers” that weren’t available in the day, and (2) putting it in the silicon allows new architecture revisions to benefit older programs (putting it in code means it can’t benefit from newer register renaming algorithms without a recompile).
Also, if it’s in the silicon and there’s a bug in the algorithm, a microcode update can fix it for everyone. If the bug was in the compiler, you’d need to recompile everything to fix it.
The only successful VLIW architectures run JIT compilers on an existing ISA. Transmeta did this for x86 but Intel brought them to court them which gave Intel time to release superior chips (through superior manufacturing). The other example is Nvidia's Project Denver.
A very debatable level of "successful" on that. It shipped in a product, yes, but it wasn't very good either. It ran ARM code, but not very well, and was a horrible nightmare to work on. The JIT'd aspect completely breaks profilers. ie, what do you mean this simple field access took 20ms?? Oh, because the CPU wasn't running my code, it just silently went out to lunch to JIT some random shit, cool, thanks. There's then also the questionable security design of a globally read/write/executable chunk of memory where security is "enforced" by the JIT which is a complex bit of microcode, totally nothing can go wrong there...
It's only noteworthy "feature" was that it was able to ship ARMv8 support before ARM had a proper ARMv8 CPU design. Time to market for a new ISA was fast, but that's about it.
Project Denver lives on in NVIDIA's Carmel cores shipping in Tegra 194 (Xavier) chips today. Though they seem to be giving up on it, as they'll be using Cortex "Hercules" A78 cores in the successor named Orin.
Essentially every device that does not use a x86 cpu did that. There's no point in waiting an entire development cycle to get a new smaller x86-alike when you could solder an ARM chip or similar to the board today.
The vast majority of CPUs shipped are not x86 compatible.
Statista claims 23.5 billion microcontrollers are shipped annually.
I know microchip (the PIC people) made press releases roughly annually as they shipped another billion flash microcontrollers. Google found the one from 2011 when they shipped their tenth billion PIC chip.
I find it difficult to get x86 sales figures. Intel gross revenue is high because they have their fingers in everything. AMD financial statements claim about $2B/quarter total revenue, so if you figure the average shipped price of a AMD cpu is $200 and they made all their revenue off CPUs, that would be 40 million CPUs shipped per year, which seems both ridiculously high AND about a 25th the quantity of microchip PICs shipped.
One way to look at the number of ARM CPUs shipped is the licensing / holding company has made enough licensing fees to pay for about "a hundred and fifty billion" ARM chips in its lifetime.
You mean remove backwards compatibility and ruin the whole reason people use x86? Not to mention that when AMD64 (“long mode”) was introduced, a big amount of cruft was disabled in the new mode (but is still available in 16-bit (“real mode”) and 32-bit (“protected mode”) for compatibility).
There is also "32 bit realmode", which is not mentioned in the official documentation but simply a combination of existing states. Ditto for real mode paging and the like --- which finds more applications as emulator acid-tests than anything else.
I thought the 80286 had 32-bit protected mode, but it was badly implemented (the only way to get back to real mode was a reboot), so they fixed it with the 80386. Unless, are you referring to “unreal mode”?
No, most people forget about it (I had to be reminded by the above comment), but the 80286 did have a 16-bit protected mode. Quoting Wikipedia (https://en.wikipedia.org/wiki/Protected_mode): "[...] Acceptance was additionally hampered by the fact that the 286 only allowed memory access in 16 bit segments via each of four segment registers, meaning only 4*2^16 bytes, equivalent to 256 kilobytes, could be accessed at a time. [...]"
I am still upset after all this time, at what AMD recklessly did --- out of what might be the same misguided notion as the GP comment, they made instructions like LAHF invalid and removed much of segmentation, only to be forced to put much of it back later because people were actually using them. I suspect Intel was actually working on a more consistent extension to 64 bits too, but AMD beat them to it.
If you don’t need backwards compatibility (such as on servers), there’s not much reason to go x86. For that reason, ARM servers exist. IIRC, AWS and Azure have some.
What obsolete stuff could you remove? If you want to actually meaningfully cut out a large amount of space on the processor, you'll have to cut into the actual instruction space and remove instructions that take up space in your execution units.
Removing the 16-bit and 32-bit modes don't actually remove any instructions from the platform (save the binary-coded decimal instructions)--you're largely saving only a few bits of decoder table entries at best. Furthermore, processors reset into 16-bit mode on startup for compatibility reasons, so killing 16-bit and 32-bit mode would introduce major compatibility headaches.
ISA extensions can be more easily removed since there's already a CPUID bit that tells operating systems and applications whether or not they are used. The MPX extension for bounds checking is now regarded as a mistake, and Intel has already confirmed that they are removing it from future processor generations. The TSX extension for transactional memory is apparently on the hit list because of Spectre, and was removed from some processor generations.
The only significant processor execution unit space that is truly obsolete I can think is the x87 floating-point execution unit logic, with the concomitant MMX execution unit logic--SSE is just strictly better for everything here, except if you're trying to actually get the 80-bit precision. But the existence of 80-bit floating point in the 64-bit ABI (i.e., long double) means you'd have a hard ABI break that would potentially break software even written today, and the pain of breaking that ABI is probably not worth whatever savings you get out of it.
> Furthermore, processors reset into 16-bit mode on startup for compatibility reasons, so killing 16-bit and 32-bit mode would introduce major compatibility headaches.
Didn't the switch to UEFI effectively reduce the scope of this problem to only apply to motherboard firmware? Operating systems no longer need 16-bit code to boot.
There are a lot of x86 software out there. Even games switched to x64 binary relatively recently. That alone means that you would want to emulate with performance of, let's say i5-2500, if you want a decent framerate, which would be quite chellanging given that modern CPUs are 80% faster at best.
Once you remove a single opcode, it's not really x86 any more, and you need emulation at OS level. But then, once you've done that, why not remove more instructions? Why not remove all of them and start again on a much more power-efficient platform? Why not remove the memory model?
One of the brilliant ideas in the A1 is a flag for whether the current process insists on the slower but more comprehensible x86 memory ordering.
Several CPU features require other ones--for example, AVX requires the XSAVE feature. Additionally, x86-64 implicitly requires several features (most notably SSE2), and the glibc folks have been working on a proposed ABI for x86-64 that groups the feature sets into levels--roughly base (≤SSE2), ≤SSE4.2, ≤AVX2, current skylake-server (a clutch of AVX-512 features), with the BMI and FMA features scattered in there somewhere.
That said, the MMX instructions in particular are so problematic to use (and SSE ubiquitous and strictly better) that I suspect you could introduce a processor that lacks MMX support and break almost nobody, certainly far fewer people than removing x87. I don't know if there is an implicit or explicit actual dependency on MMX anywhere.
We can't really demand compilers to create code that's compatible with all the ancient variations of a currently popular architecture. It's still good manners to include a function that exits with a clear error message about a required architecture feature that's missing.
I had similar issues with PPC software that just blindly assumed I had Altivec on my G3. It wasn't fun.
It doesn't matter. The only meaningful change would be to go with fixed length instructions and maybe get rid of the memory ordering guarantees. If you do that you might as well switch to any other ISA that has these properties. But since Intel has an inferior manufacturing process the new ISA would still suffer from inferior CPUs. However, AMD has shown that x86_64 is still viable so the benefits of switching are miniscule anyway.
AMD64 has 32 registers: 16 scalar ones, and 16 vector ones. Maybe 34 if you count RIP and flags.
From programmer’s perspective, the rest of them are either different ways to access these 32-34, or very exotic and rarely used.
P.S. Modern compilers don’t normally emit x87 nor MMX instructions, because they are often slower compared to SSE (all AMD64 processors are required to support at least SSE1 and SSE2). For instance, FSQRT on Zen2 has 22 cycles both latency and throughput, VSQRTPD has 20 cycles latency and 8.5 cycles throughput (lower is better), despite taking 4 square roots in one shot. I think it’s safe to assume x87 and MMX instructions are only left for backward compatibility with old 32-bit binaries, when writing new code they can be ignored.
* 16 general-purpose registers.
* 16 or 32 vector registers.
* 7 vector mask registers
* 8 x87 floating-point registers, with 8 MMX registers aliased. (To separate or not separate x87/mmx is definitely a challenging question)
* 3 normal status registers: RIP, RFLAGS, MXCSR
* 6 x87 status registers: FSW, FCW, FTP, FDP, FIP, FOP
* 6 segment registers
* 6 debug registers
* Relevant MPX registers (I don't know this ISA extension very well, so I can't count these registers accurately)
These are the registers that I would expect to be able to poke at in a debugger or inspect/modify via something like ucontext_t, and they're going to be found in whatever kernel abstraction you use to save not-currently-running thread information.