Hacker Newsnew | past | comments | ask | show | jobs | submit | nwlieb's commentslogin

Have you tried the OpenAI deep research in the past week or so? It's been updated to use 5.2 https://x.com/OpenAI/status/2021299935678026168

(I work at OpenAI, but on the infra side of things not on models)


Yes: https://m.youtube.com/watch?v=xt1KNDmOYqA

Title: “ Casey Muratori | Smart-Pointers, RAII, ZII? Becoming an N+2 programmer”


Good one. I was blessed to have the opportunity to watch that one live, on stream. It's always stuck with me and, now that I think about it, is the best resource I know of that puts those ideas into words/writing.


Known users:

- LTTng tracer

- tcmalloc

Curious if there are other prominent users of rseq.


For a tracing profiler, you want to know which thread a function call or return was made by. LTTng has kernel modules which it can use to trace context switches, and then a per-CPU trace buffer is fine, provided that you get cheap atomic writes which rseq can be used for.

Funtrace on the other hand does support ftrace for tracing context switches (https://yosefk.com/blog/profiling-in-production-with-functio...), but it doesn't require ftrace for tracing function calls made by your threads (the problem with ftrace as well LTTng's kernel modules being, of course, permissions; which shouldn't be a problem in any reasonable situation by my standard of "reasonable", but many find themselves in unreasonable situations permissions-wise, sadly.) So I don't think funtrace can use rseq, though I might be missing something.


Presumably you could store the TID in every event, or otherwise check whether the TID has changed since the last time it was logged and push a (timestamp, TID) pair if so. Reading TID should be cheap.


In what sense should reading the TID be cheap? You would need either a syscall (not cheap) or thread-local storage (the subject of TFA.) Avoiding the use of TLS by reading the TID can't really work


It looks like the TID is stored directly in the pthread struct pointed to by %fs itself, at a fixed offset which you can somewhat-hackily compile into your code. [0]

In the process of investigating this, I also realized that there's a ton of other unique-per-thread pointers accessible from that structure, most notably including the value of %fs itself (which is unfortunately unobservable afaict), the address of the TCB or TLS structures, the stack guard value, etc. Since the goal is just to have a quickly-readable unique-per-thread value, any of those should work.

Windows looks similar, but I haven't investigated as deeply.

[0] https://github.com/andikleen/glibc/blob/b0399147730d478ae451...

[1] https://github.com/andikleen/glibc/blob/b0399147730d478ae451...


Hi, love the article. You mention in the article that a hardware mechanism for tracing should exist -- have you investigated the intel_pt (processor trace) extension? I believe this uses hardware buffers and supports timestamping & cycle counters (at somewhat larger than instruction granularity sadly, although it might issue forced stamps at on branches, not sure).

You can also use the PTWRITE instruction to attach metadata to the stream which seems very powerful.

Hope we can see such an extension on AMD as well.


Intel PT is indeed useful (although very, very slow compared to regular sampling profiling), but there's hardly any CPUs that actually implement PTWRITE. (IIRC there's some obscure Xeon or something?)

Typically you get a cycle count every six branches, give or take.


Sampling profilers are indeed very low-overhead, however they can't help debug tail latency, for which tracing profilers are indispensable:

https://yosefk.com/blog/profiling-in-production-with-functio...

https://danluu.com/perf-tracing/

Regarding the slowdown - magic-trace reports 2-10% slowdowns which IMO is actually fine even for production (unless this adds up to a huge dollar cost, for most people it won't) since in return for this you are actually capable to debug the rare slowdowns which are the worst part of your user experience.

However, the hardware feature that I propose (https://yosefk.com/blog/profiling-in-production-with-functio...) would likely have lower overhead since it relies on software issuing tracing instructions, eg at each function entry & exit (rather than any control flow change), and it could be variously selective (eg exclude short functions without loops; and/or you could configure the hardware to ignore short calls. BTW maybe you can with Intel Performance Trace, too, I'm just not really familiar with it.)


I discuss Intel Performance Trace in the writeup where I propose my much simpler hardware support for tracing: https://yosefk.com/blog/profiling-in-production-with-functio...

Like I said there, I'm frankly shocked that all CPUs haven't raced to implement similar features, that magic-trace which is built on top of Intel Performance Trace isn't used more widely, and that developers aren't insisting on running under magic-trace in production and requiring to deploy on Intel servers for that purpose.

The extension I propose is much simpler, and seems similar to what PTWRITE would do if it was the only feature in Intel Performance Trace. I have a lot of experience in chip architecture, and I believe that every CPU maker and every chip maker can support this easily - much more so than full feature parity with Intel Performance Trace. I hope they will!


One concern with PTWRITE is that it is somewhat "slow," at least according to this: https://community.intel.com/t5/Processors/Intel-Processor-Tr...

I wonder if this is a general issue relating to memory ordering or out-of-order execution, or whether this can be implemented more efficiently in a different extension.

Thank you for the linked article! Agreed on the huge potential for using these tools in production. The community could definitely benefit (even indirectly) by pushing for this kind of instruction set more widely.


There's some ambiguity for argument destruction order for example: https://stackoverflow.com/a/36992250

Similarly, the construction/destruction order for std::tuple elements is not well defined.

Granted, that's implementation defined behavior, which is technically deterministic on a single compiler.


This isn't really about constructors/destructors. Expressions like function calls with multiple arguments have always been "unsequenced" with respect to each other. In other words the order is left to decide for the compiler. It's always been like that, going back to C (and probably other languages). If you call f(x++, x++) what values get passed to f is unspecified.

I suppose the destruction of whatever the expressions constructed still happens in reverse order of construction.

But either way I might not even care, I'm aware that at the level of a single statement the execution sequences aren't much defined, so I rarely put more than one mutating thing per expression, or otherwise I'm sure that I don't care about the order -- I could live with any sequence as well as totally parallel execution.

Example: buf[i++] = 5 has 2 mutating sub-expressions, but I know it's not messing up anything. I don't care whether i gets incremented before 5 gets assigned or the other way around.


Say I wanted to rank my own personal collection of songs by retention/engagement— are there any open source libraries or crisp descriptions of algorithms/statistical models that one could use?


The runtime is quadratic for a given context size, although it seems like there is some progress on this front https://gwern.net/note/attention


Is `(x - 1)` not a runtime cost if `x` is a runtime variable?


Not on many instruction architectures - addressing modes often support adding/subtracting a constant.


Is this a 1-1 comparison? If the ARM compile is compiling to ARM binaries then there might be less work/optimizations since it is a newer architecture. Seems like a test with two variables that changed. Would be interesting to see them both cross-compile to their respective opposite archs.


Maybe not, but A) it's close-- most of the work of compiling is not microarchitecture-level optimizations or emitting code, and B) if you're a developer, even if some of the advantage is being on an architecture that it's easier to emit code for... that's still a benefit you realize.

It's worth noting that cross-compiling is definitely harder in many ways, because you can't always evaluate constant expressions easily at compile-time in the same way your runtime code will, etc, too, and have to jump through hoops.


As someone who knows relatively little about this, I'm very curious why this is downvoted. It seems like a rebuttal would be enlightening.


Hm my experience was that compiling C on arm was always super fast compared to x86, because the latter had much more work to do.


This doesn't align with my experience. Clang is about the same, but GCC often seems much slower emitting cross-ARM code.

  jar% time x86_64-linux-gnu-gcc --std=c99 -O3 -c insgps14state.c -I inc -I ../../shared/api
  x86_64-linux-gnu-gcc --std=c99 -O3 -c insgps14state.c -I inc -I   0.97s user 0.02s system 99% cpu 0.992 total
  jar% time x86_64-linux-gnu-gcc --std=c99 -O3 -c insgps14state.c -I inc -I ../../shared/api
  x86_64-linux-gnu-gcc --std=c99 -O3 -c insgps14state.c -I inc -I   0.93s user 0.03s system 99% cpu 0.965 total
  jar% time x86_64-linux-gnu-gcc --std=c99 -O3 -c insgps14state.c -I inc -I ../../shared/api
  x86_64-linux-gnu-gcc --std=c99 -O3 -c insgps14state.c -I inc -I   0.94s user 0.01s system 99% cpu 0.947 total
  jar% time x86_64-linux-gnu-gcc --std=c99 -O3 -c insgps14state.c -I inc -I ../../shared/api
  x86_64-linux-gnu-gcc --std=c99 -O3 -c insgps14state.c -I inc -I   0.92s user 0.04s system 99% cpu 0.955 total

  jar% time arm-linux-gnueabihf-gcc --std=c99 -O3 -c insgps14state.c -I inc -I ../../shared/api
  arm-linux-gnueabihf-gcc --std=c99 -O3 -c insgps14state.c -I inc -I   1.43s user 0.03s system 99% cpu 1.458 total
  jar% time arm-linux-gnueabihf-gcc --std=c99 -O3 -c insgps14state.c -I inc -I ../../shared/api
  arm-linux-gnueabihf-gcc --std=c99 -O3 -c insgps14state.c -I inc -I   1.46s user 0.03s system 99% cpu 1.486 total
  jar% time arm-linux-gnueabihf-gcc --std=c99 -O3 -c insgps14state.c -I inc -I ../../shared/api
  arm-linux-gnueabihf-gcc --std=c99 -O3 -c insgps14state.c -I inc -I   1.55s user 0.04s system 99% cpu 1.587 total
  jar% time arm-linux-gnueabihf-gcc --std=c99 -O3 -c insgps14state.c -I inc -I ../../shared/api
  arm-linux-gnueabihf-gcc --std=c99 -O3 -c insgps14state.c -I inc -I   1.44s user 0.03s system 99% cpu 1.471 total


That’s interesting. I was not cross compiling so maybe the arm system I was using was just faster.


So cross compile for RISC-V, POWER, or something else would be fair?


Apple has been optimizing the compiler for a decade for iOS.


If everything else is the same, that seems like a solid reason to prefer the ARM architecture even setting aside 1:1 comparisons. Isn't faster compilation and execution the whole point of a faster processor?


The assertion is that compilation might be faster since there are fewer optimizations, and therefore runtime would be slower.


Could you describe what makes the Google Fibers so nice?

I'm also really curious why they require modifications to the Linux kernel. My first guess would be stronger integration with the IO model at the syscall boundary (similar to io_uring).

Edit: is this the talk your referring to? https://www.youtube.com/watch?v=KXuZi9aeGTw


You know how the first time you learned about tcp sockets you made a server that spawned a new thread to handle an incoming connection (or maybe not, people learn differently nowadays).

With the fibers implementation you can just do that. It doesn't kill your performance, and you don't need to go to a painful async model just for performance reasons.


Pretty much. You get to pretend inside your fibers that you're actually running threads. IIRC (it's been a while) you also get proper stack trace when something barfs, the importance of which cannot be overstated.


What are they though? Is this a library for an existing language? A runtime scheduler like the one that does goroutines in Go? If it were open sourced, how would I use it?


It’s just a library that allows easier development of C++ servers in the synchronous, thread-per-requests style, similar to working in Go but a bajillion times better because it’s not in Go.


All of the above, and more -- kernel enhancements. See the linked paper, they detail what they do for the kernel side at least.


> or maybe not, people learn differently nowadays

If by “nowadays” you mean ~2000 when I first learned socket programming (using select!)? ;-)


Yes, that's the one. Unfortunately it doesn't show any of the API details that a developer would be exposed to.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: