Good one. I was blessed to have the opportunity to watch that one live, on stream. It's always stuck with me and, now that I think about it, is the best resource I know of that puts those ideas into words/writing.
For a tracing profiler, you want to know which thread a function call or return was made by. LTTng has kernel modules which it can use to trace context switches, and then a per-CPU trace buffer is fine, provided that you get cheap atomic writes which rseq can be used for.
Funtrace on the other hand does support ftrace for tracing context switches (https://yosefk.com/blog/profiling-in-production-with-functio...), but it doesn't require ftrace for tracing function calls made by your threads (the problem with ftrace as well LTTng's kernel modules being, of course, permissions; which shouldn't be a problem in any reasonable situation by my standard of "reasonable", but many find themselves in unreasonable situations permissions-wise, sadly.) So I don't think funtrace can use rseq, though I might be missing something.
Presumably you could store the TID in every event, or otherwise check whether the TID has changed since the last time it was logged and push a (timestamp, TID) pair if so. Reading TID should be cheap.
In what sense should reading the TID be cheap? You would need either a syscall (not cheap) or thread-local storage (the subject of TFA.) Avoiding the use of TLS by reading the TID can't really work
It looks like the TID is stored directly in the pthread struct pointed to by %fs itself, at a fixed offset which you can somewhat-hackily compile into your code. [0]
In the process of investigating this, I also realized that there's a ton of other unique-per-thread pointers accessible from that structure, most notably including the value of %fs itself (which is unfortunately unobservable afaict), the address of the TCB or TLS structures, the stack guard value, etc. Since the goal is just to have a quickly-readable unique-per-thread value, any of those should work.
Windows looks similar, but I haven't investigated as deeply.
Hi, love the article. You mention in the article that a hardware mechanism for tracing should exist -- have you investigated the intel_pt (processor trace) extension? I believe this uses hardware buffers and supports timestamping & cycle counters (at somewhat larger than instruction granularity sadly, although it might issue forced stamps at on branches, not sure).
You can also use the PTWRITE instruction to attach metadata to the stream which seems very powerful.
Intel PT is indeed useful (although very, very slow compared to regular sampling profiling), but there's hardly any CPUs that actually implement PTWRITE. (IIRC there's some obscure Xeon or something?)
Typically you get a cycle count every six branches, give or take.
Regarding the slowdown - magic-trace reports 2-10% slowdowns which IMO is actually fine even for production (unless this adds up to a huge dollar cost, for most people it won't) since in return for this you are actually capable to debug the rare slowdowns which are the worst part of your user experience.
However, the hardware feature that I propose (https://yosefk.com/blog/profiling-in-production-with-functio...) would likely have lower overhead since it relies on software issuing tracing instructions, eg at each function entry & exit (rather than any control flow change), and it could be variously selective (eg exclude short functions without loops; and/or you could configure the hardware to ignore short calls. BTW maybe you can with Intel Performance Trace, too, I'm just not really familiar with it.)
Like I said there, I'm frankly shocked that all CPUs haven't raced to implement similar features, that magic-trace which is built on top of Intel Performance Trace isn't used more widely, and that developers aren't insisting on running under magic-trace in production and requiring to deploy on Intel servers for that purpose.
The extension I propose is much simpler, and seems similar to what PTWRITE would do if it was the only feature in Intel Performance Trace. I have a lot of experience in chip architecture, and I believe that every CPU maker and every chip maker can support this easily - much more so than full feature parity with Intel Performance Trace. I hope they will!
I wonder if this is a general issue relating to memory ordering or out-of-order execution, or whether this can be implemented more efficiently in a different extension.
Thank you for the linked article! Agreed on the huge potential for using these tools in production. The community could definitely benefit (even indirectly) by pushing for this kind of instruction set more widely.
This isn't really about constructors/destructors. Expressions like function calls with multiple arguments have always been "unsequenced" with respect to each other. In other words the order is left to decide for the compiler. It's always been like that, going back to C (and probably other languages). If you call f(x++, x++) what values get passed to f is unspecified.
I suppose the destruction of whatever the expressions constructed still happens in reverse order of construction.
But either way I might not even care, I'm aware that at the level of a single statement the execution sequences aren't much defined, so I rarely put more than one mutating thing per expression, or otherwise I'm sure that I don't care about the order -- I could live with any sequence as well as totally parallel execution.
Example: buf[i++] = 5 has 2 mutating sub-expressions, but I know it's not messing up anything. I don't care whether i gets incremented before 5 gets assigned or the other way around.
Say I wanted to rank my own personal collection of songs by retention/engagement— are there any open source libraries or crisp descriptions of algorithms/statistical models that one could use?
Is this a 1-1 comparison? If the ARM compile is compiling to ARM binaries then there might be less work/optimizations since it is a newer architecture. Seems like a test with two variables that changed. Would be interesting to see them both cross-compile to their respective opposite archs.
Maybe not, but A) it's close-- most of the work of compiling is not microarchitecture-level optimizations or emitting code, and B) if you're a developer, even if some of the advantage is being on an architecture that it's easier to emit code for... that's still a benefit you realize.
It's worth noting that cross-compiling is definitely harder in many ways, because you can't always evaluate constant expressions easily at compile-time in the same way your runtime code will, etc, too, and have to jump through hoops.
If everything else is the same, that seems like a solid reason to prefer the ARM architecture even setting aside 1:1 comparisons. Isn't faster compilation and execution the whole point of a faster processor?
Could you describe what makes the Google Fibers so nice?
I'm also really curious why they require modifications to the Linux kernel. My first guess would be stronger integration with the IO model at the syscall boundary (similar to io_uring).
You know how the first time you learned about tcp sockets you made a server that spawned a new thread to handle an incoming connection (or maybe not, people learn differently nowadays).
With the fibers implementation you can just do that. It doesn't kill your performance, and you don't need to go to a painful async model just for performance reasons.
Pretty much. You get to pretend inside your fibers that you're actually running threads. IIRC (it's been a while) you also get proper stack trace when something barfs, the importance of which cannot be overstated.
What are they though? Is this a library for an existing language? A runtime scheduler like the one that does goroutines in Go? If it were open sourced, how would I use it?
It’s just a library that allows easier development of C++ servers in the synchronous, thread-per-requests style, similar to working in Go but a bajillion times better because it’s not in Go.
(I work at OpenAI, but on the infra side of things not on models)