Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Embed is in C23 (thephd.dev)
483 points by aw1621107 on July 23, 2022 | hide | past | favorite | 338 comments


I represent Sweden in the ISO WG14, and I voted for the inclusion of Embed in to C23. Its a good feature. But its not a necessary feature and I think JeanHeyd is wrong in his criticism of the pace of wg14 work. I have found everyone in wg14 to be very hardworking and serious about their work.

Cs main strengthen is its portability and simplicity. Therefore we should be very conservative, and not add anything quickly. There are plenty of languages to choose form if you want a "modern" language with lots of conveniences. If you want a truly portable language there is really only C. And when I say truly, I mean for platforms without file systems, or operating systems or where bytes aren't 8 bits, that doesn't use ASCI or Unicode, where NULL isn't on address 0 and so on.

We are the stewards of this, and the work we put in, while large, is tiny compared to the impact we have. Any change we makes, needs to be addressed by every compiler maintainer. There are millions of lines of code that depend on every part of the standard. A 1% performance loss is millions of tons of CO2 released, and billions in added hardware and energy costs.

In this privileged position, we have to be very mindful of the concerns of our users, and take the time too look at every corner case in detail before adding any new features. If we add something, then people will depend on its behavior, no matter how bad, and we therefor will have great difficulty in fixing it in the future without breaking our users work, so we have to get it right the first time.


> for platforms without file systems, or operating systems or where bytes aren't 8 bits, that doesn't use ASCI or Unicode, where NULL isn't on address 0 and so on.

This seems totally misconceived to me as a basis for standardizing a language in 2022. You are optimizing for the few at the expense of the many.

I get that these strange architectures need a language. Why does it have to be C or C++? They can use a nonstandardized variant of C, but why hobble the language that is 99% used on normal hardware with misfeatures that are justified by trule obscure platforms.


It doesn't have to be C, but as of today there is no other option. No one is coming up with new languages with these kinds of features so C it is. People should, but language designers today are more interested in memory safety and clever syntax, than portability.

I would like to caution you against thinking that these weird platforms are old machines from the 60s that only run in museums. For instance many DSPs have 32bit bytes (smallest memory unit that can be individually addressed), so if you have a pair of new fancy noise canceling headphones, then its not unlikely you are wearing a platform like that on your head everyday.


Unusual platforms like DSPs usually have specific (usually proprietary) toolchains. Why can't those platforms implement extensions to support 32-bit bytes? Why must everyone else support them? In practice ~no C code is portable to machines with 32-bit bytes. That's okay! You don't choose a DSP to run general purpose code. You choose it to run DSP code, usually written for a specific purpose, often in assembly.


"Weird" platforms often do have their own tool-chains but they do have the ability to leverage LLVM, MISRA, and an array of common tools and analyses that exists for C. One of the reason we got new platforms like RISC-V is that today its possible to use existing OSS software to build a platform with a working OS and Development environment, that common basic libraries can be built for is that all this software is written in C and can be targeted towards a new platform.


doesn't LLVM only cover a smidge of the weird platforms, if that?

Like, it doesn't even cover ESP32 chips, despite attempts to get support into LLVM since 2019.

point being, they can't leverage LLVM


What is the relevance of RiscV here? Not weird at all. I feel like you skipped part of the argument.


Ok here’s a concrete one, ARM is experimenting with a brand new CHERI implementation that throws out a lot of “obvious” things that are supposed to be true for pointers and integers. The only way this has been able to work is that the standard is flexible enough to let C(++) work here. Rust is getting breaking changes to support the platform.


Unsurprisingly Rust can target Morello already but it comes at a performance penalty. Whereas C has myriad integer types for special purposes, Rust only has a pair - a signed and unsigned integer of the same size as pointers. Since on CHERI a pointer is a different size from an address, Rust pays the price of using a 128-bit integer (to store pointers) as its array index.

The potential change would make these types address-sized, so you can't fit a whole pointer in them on Morello where the pointer isn't just an address. On other platforms this change makes no practical difference.


The point is that new exploration of the design space only works when there’s a familiar environment to build on. The old days of each architecture being its own hermetic environment are gone.


Because C already does this, and has from the beginning. C was designed to be portable in an era where there were significant differences in fundamental CPU design decisions between platforms. C is widely used to write software for all kinds of weird platforms. Changing that would be far more work than just making a new language.


They also tend to be non-standard for a variety of reasons anyway! C bends backwards to support odd architectures but is often also insufficient (or at least the vendors cannot justify the effort to achieve full compliance and their customers don't care significantly anyway).


Perhaps Carbon is the first in a series of new low level languages that free us from the impossible tensions of C/C++ having to be all things to all (low level) programmers.

I would love a new language for implementing high level languages. I've worked on several of these projects and we use mostly unstandardized dialects of C++ and it's really not fit for purpose.


While at it, I should mention Zig.


Does zig have as a selling point that it has _more_ UB than C?


AFAICT, Zig is halfway step between C and C++ for desktop and mobile developers.

I'm thrilled that the folks steering C are keeping the needs of embedded developers at the forefront.


> AFAICT, Zig is halfway step between C and C++ for desktop and mobile developers.

What does this even mean? Zig very much has embedded use as a target as well, with freestanding use being a first-class citizen. The majority of the standard library works without an OS (and if you wish, you can provide your own OS-like interfaces quite easily, and use the OS-dependent parts in a freestanding environment). I've written a UEFI bootloader in Zig, and right now I'm using it on an RP2040 as well, cross compiling from an M1 Mac without the need to install any additional cross compilers.

I'd argue that it might be even better for embedded than C/C++ eventually, as unlike them, allocation is even more tightly controlled, with a convention that only functions (or datastructures) that take an allocator as an argument will allocate memory on the heap. Future versions may even restrict recursion in certain contexts to guarantee max stack depths.


Cool, you've targeted ARM on a common SoC and a desktop PC. Now do sh2, sh4, m68k or even xtensa with Zig with a standard build.

They chose to build Zig on top of the worst compiler backend toolchain for embedded support.


Man, it hasn’t even fucking reached 1.0 yet. The fact that the default install can target multiple architectures (including C dependencies) is nothing short of impressive.

Does GCC target any of those in its standard build? Oh. Wait. It doesn’t. It builds for only one architecture at a time and if you want anything else you’ll need to recompile the compiler.

LLVM is even intended to be optional in future, and the Zig team is well on their way to making that a reality with the work on stage2.

What backend would you suggest for embedded support?


Zig is 6 years old, and years-old issues exist in their task tracker requesting support for some of these targets.

You _cannot_ simply recompile LLVM to support the targets I listed without using unsupported, risky patches from community sources; if any such patches exist. The multiple architecture support in one binary is a pointless feature when installing multiple GCC builds only takes a tiny, insignificant fraction of my development machine's disk space. Never in my life have I thought "Wow, I'd really speed up development if GCC was one enormous binary holding all possible targets."

Zig's best path forward is to support targeting C as a first tier target; which they seem to be interested in doing.


> It doesn't have to be C, but as of today there is no other option

Isn’t C99 an option? Why can’t more advanced things go into newer C and people who genuinely need something more basic can use C99.


We can! Many of us still use c89.(c99 has problems, like variable length arrays).

The reality however is that you cant escape never versions entirely. Not all code you interact with was written in the subset you want, so when your favorite OS or library starts using header files with newer features you need to run that version of the language too.

Another less appreciated detail, is that a lot of WG14 work is not about adding new features but clarifying how existing features are meant to work. When the text is clarified this gets back-ported to all previous versions of C in major compilers. An example of this is "provenance". This is a concept that implicitly been standard since the first ISO standard, but only now is becoming formalized. This means that if you want to adhere to the C89 standard, you will find a lot of clarifications about how things should work in the C23 standard.


VLAs are optional since C11. There is no reason why a vendor can't support a modern language.


VLAs are not the only thing added to C since 1999.


They are one of the few things added that require the target platform, rather than the compiler, to do something different. (And the other big things like that, like atomics, are similarly optional.)


If it were to focus on stability, it would probably be LLVM IR. That said, there's plenty of C++ being written for these applications. And Ada.

> so if you have a pair of new fancy noise canceling headphones, then its not unlikely you are wearing a platform like that on your head everyday.

Chip shortage aside, the likelihood of these devices using obscure hardware like discrete DSPs is going down as cheaper low power architectures are becoming commoditized.


LLVM IR isn’t stable or even portable. It’s just a compiler IR, not a language.


Hence the qualifier, if it focused on stability. And IRs are languages. They look and quack like them and people treat them as such.


And pay the price, there are tons of places that are stuck with some ossified “LLVM 3.2” toolchain or similar because they build their stuff on an unstable IR.


Yea and again, I said "if". We live on the else branch of that if statement.

My point is that a lot of this sentiment is treating C as the portable backend of a compiler because there is no portable front end. It holds us back in a lot of ways from iterating on systems languages in ways that are interesting and valuable.

Ideally there would be an IR with stable textual and binary format that can be compiled into the machine code for various ISAs and support the extensions necessary by silicon manufacturers for the exotic bits. I've used exotic tool chains for weird ISAs where the exotic bits are sugar added to a GCC front end, and it always feels wrong and limiting.


> This seems totally misconceived to me as a basis for standardizing a language in 2022. You are optimizing for the few at the expense of the many.

Sure, but it's the same line of reasoning that made C relevant in the first place, and keeps it relevant today - some library your dad wrote for a PDP-whatever is still usable today on your laptop running Windows 10.

Because it's antiquated, it's also extremely easy to support, and to port to new and/or exotic platforms.


The library my dad wrote (lol) for the PDP-11 is probably full of undefined behaviour and won't work now that optimizers are using any gap in the standard to miscompile code.


> using any gap in the standard to miscompile code

For code to be miscompiled, there has to be a definition of what correctly compiling it would mean, and if there were, it would not be undefined behavior.


Instead of "miscompiled" you can read "Doesn't do what it did on the PDP-11 with the compilers of the time".


Yeah, but if that definition is constantly shifting, you cannot expect it to work with existing codebases.


Well yeah — therein lies the problem with a language with pervasive undefined behavior.


The standard doesn't do that often, but it does sometimes. E.g. realloc to null which was previously defined, and is now UB :(


We are taking about code written before the standard so every bit of UB in the standard is in play here.

Eg the fact that overflowing a signed int can cause the compiler to go amuck would certainly be a surprise to the person who wrote code for the PDP-11.


What a useless and jaded assumption that code written in the past is bad.


The assumption being made here is "any useful C program relies on undefined behavior" which is pretty much true.


Yes and I'm sure it's doubly true of code that was written before the C standards were written.


That sounds like a strong take - to you have examples?

Compiling with -Wall -Werror is pretty much standard those days.


-Wall -Werror is mostly designed to catch dangerous but totally well-defined idioms, not UB. It doesn't warn on every signed arithmetic operation or unchecked array access, for example.


See "useful" — it may not be not quite as strong as you're thinking. It may be possible to write a minimal C program without UB, but I'm thinking of larger programs, more than a few hundred lines. Common UB includes: array access out of bounds, dereferencing a null pointer, use after free, use of uninitialized variables. -Wall -Werror can catch some instances of some types of UB, and runtime libraries like UBSan can catch more. But they're not exhaustive.


I certainly didn't say it was bad. Just that it went outside the boundaries of a standard that was written 25 years later.


> PDP-whatever is still usable today on your laptop running Windows 10

No, it isn't. Go on. Go ahead and try

See it break in a million weird ways. (Or, for a start, it will have the K&R C format, which is a pain to maintain)

"If your computer doesn't have 8-bit bytes" at this day and age? It belongs in a dumpster, sorry.

(I think the only "modern" arch that does this is PIC, and even only for program data - where you're not running anything "officially" C89 or later)


When I first learned C, it was K&R, pre-ANSI with old style function parameters. It is trivial to convert to ANSI C. The truth is C has barely changed in decades.


You should take a look at plan9port - a bunch of userspace tools from Plan 9, carefully ported to Linux with few changes. Maybe that's not PDP-whatever, but it is Sun-whatever. Either way, it's code that was written decades ago for a dead architecture.


C is pretty much the only language in common use for programming microcontrollers. Microntrollers seldomly have filesystems. To break the language on systems without filesystems or terminals means to break the software of pretty much every electronics manufacturer out there.


Thinking of it, JavaScript is a language that target mainly browser, which also doesn't have a filesystem.


Web requests are its filesystem.


Sure, but you don’t typically run the compiler on the microcontroller! It’s the host that needs a filesystem, not the target.


Of course, I can't see a reason why #embed wouldn't be useful for microcontrollers. In fact, I imagine it's a key target market for a feature like that, resource managers are complex and tools like bin2c have always felt like a terrible back.

I was solely replying to the commenter who said that all reasonable modern systems have filesystems so I put one in for the embedded software developers.


It may have no filesystem but it's extremely likely it has 8 bit bytes


CHAR_BIT != 8 on a lot of systems: https://stackoverflow.com/q/2098149/7509065


Yeah, if the number of platforms using an obscure corner of the spec were exactly zero, we wouldn't be talking about whether the niche size is worth the hassle.

My point was that this niche doesn't cover the entirety of the embedded / IoT space.


But you don't run the compiler on a computer without a file system. How would #include works otherwise?


This feature is precisely very useful when you don't have a filesystem to read data from at runtime. Embedding it to the binary to flash it is so much simpler with this.


As the GP post comments, if you want those features there are plenty of other languages to choose from.

I don’t even like programming in C but I respect what the committee is trying to do, and yes I do sometimes write C code.


I'll flip that around if you want to serve on a language standards commit there are a lot of other languages to choose from. Why be on the C standards committee with the express purpose of blocking progress?


Because it's one of the very few -- almost only -- languages with this objective.


Whose objective? The people that use it or the people that sit on the standards committee?


I would say that one should be pretty cautious when baking in assumptions snouty such a fleeting thing as hardware into such a lasting thing as a language.

C itself carries a lot of assumptions about computer architecture from the PDP-9 / PDP-11 era, and this does hold current hardware back a bit: see how well the cool nonstandard and fast Cell CPU fared.

A language standard should assume as little about the hardware as possible, while also, ideally, allowing to describe properties of the hardware somehow. C tries hard, but the problem is not easy at all.


Can you explain what aspect of C from PDP-11 was problematic for Cell?


All memory is uniform, for instance. There is one scalar data processing unit that finishes a previous operation and then issues the next: no way to naturally describe SIMD, for instance. No way to speak about asynchronous things that happen on a Cell CPU all the time, as much as I can judge. (I never programmed it, but I remember that people who did said they had to use assembly extensively.)

OTOH you can write stuff like `*src++ = *dst++`, and it would neatly compile into something like `movb (R1)+, (R2)+`, a single opcode on a PDP-11.


It’s worse—-almost all of them already use a nonstandard variant of C. The committee is bending over backwards to accommodate them, but they literally _do not care what the standard says_, so this doesn’t even benefit them. Most will keep using a busted C89 toolchain with a haphazard mix of extensions no matter what the standard does.


Even if they fork the language, standard still provides the common baseline for all, which is useful.


This is purely compiler side and usually those esoteric hosts are not running the compiler, being cross compiled but cross compiled, aren't they?


Well and studiously not talking to the few about their actual needs.


This reasoning has always rung mostly hollow for compiler features (#embed, typeof) rather than true language features (VLAs, closures).

Modern toolchains must exist for marginal systems. It's understandable to want to write code for a machine from 1975, or a bespoke MCU, on a modern Thinkpad. It is not necessary to support a modern compiler running on the machine from 1975 / bespoke MCU. You might as well argue against readable diagnostic messages because some system out there might not be able to print them!


I could also see this, though perhaps it's a step too far for C, applying to Unicode encoding of source files.

The 1970s mainframe this program will run on has no idea that Unicode exists. Fine. But, the compiler I'm using, which must have been written in the future after this was standardised, definitely does know that Unicode exists. So let's just agree that the program's source code is always UTF-8 and have done with it.

Jason Turner has a talk where the big reveal is, the reason the slides were all retro-looking was that they were rendered in real time on a Commodore 64. The program to do that was written in modern C++ and obviously can't be compiled on a Commodore 64 but it doesn't need to be, the C64 just needs to run the program.


This seems a step too far for me. Compatibility with existing source files which may not be trivial to migrate does also matter. (Well, except for `auto`, C23 was right to fuck with that.) At the very least you'll need flags that mean "do whatever you did before".


Sure, I don't seriously expect C to embrace that, even though I think it'd be worth the effort I'm sure plenty of their users don't.

For auto I think the argument is that if you poke around in real software the storage specifier was basically never used because it's redundant. That's the rationale WG21 had to abolish its earlier meaning in C++ before adding type deducing auto.

As I read it, N2368 (which I think is what they took?) gives C something more similar to the type inference found in many languages today (which gives you a diagnostic if it can't infer a unique type from available information) whereas C++ got deduction which will choose a type when ambiguous, increasing the chance that a maintenance programmer misunderstands the type of the auto variable.

However it got inference from return, which I think is a misfeature (although I think I can see why they took it, to make generics nicer). With inference from return, to figure out what foo(bar)'s type is, I need to read the implementation of foo because I have to find out what the return statements look like. It's more common today to decide we should know from the function's signature.

This is somewhat mitigated by the fact that N2368 says auto won't work in extern context, so we can't just blithely say "This object file totally has a function which returns something and you should figure out what type that is" because that's clearly nonsense. You will have the source code with the return statements in it.


We took N3007 which does not have inference on return etc. https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3007.htm


Ah, great. I don't write very much C any more, but the auto described in N3007 (well, the skim of N3007 I just did) feels very much like what I'd want from this feature in C and perhaps more importantly, what I'd assume auto does if I see it in a snippet of somebody else's code I'm trying to understand.


Or how to make everyone life worse for the few weirdos that don't use a LSP/IDE. The type of auto on the return of a function can and is automatically shown e.g. https://resources.jetbrains.com/help/img/rider/2022.1/inlay_... With this kind of non-consideration of the developer comfort, it is clear C is an obscolete language.


C++ got auto because C++11 introduced types that could not be named, and this had to be inferred using auto.


> And when I say truly, I mean for platforms without file systems

Are we're really talking about compiling on such platforms? And if that's the case, how would #include work but not #embed?


No, I'm mainly talking about targeting. My point is not so much about embed, but rather that, almost anything you assume you think you know about how computers work isn't necessarily true, because C targets such a wide group of platforms. Almost always when some one raises a question along the line of "No platform has ever done that right?", some one knows of a platform that has done that, and it turns out has very good reasons for doing that.

For this reason, everything is much more complicated then you first think. For me joining the WG14 has been an amazing opportunity to learn the depths of the language. C is not big but it is incredibly deep. The answer to "Why does C not just do X?" is almost always far more complicated and thought through than the one thinks.

Everyone in the wg14 who has been around for a while, knows this, and therefore assumes that even the simplest addition will cause problems, even if they cant come up with a reason why.


Yeah, but then I have to side with the author - how could a compile time only feature which doesn't even introduce new language semantics possibly be affected by the multitude of build targets?

Unless "it's more complicated than you think" is the catchall answer to any and all proposals for new language features. In which case, how to make progress at all?

Also, I find the point about the language being "truly portable" a bit ironic, considering the whole rationale of #embed was that the use case of "embed large chunks of binary data in the executable" was completely non-portable and required adding significant complexity to the build scripts if you were targeting multiple platforms.

It's easy to make a language portable on paper if you simply declare the non-portable parts to not be your responsibility.

> Everyone in the wg14 who has been around for a while, knows this, and therefore assumes that even the simplest addition will cause problems, even if they cant come up with a reason why.

That's not something to be proud of.


> That's not something to be proud of.

Its learning from old mistakes.

Look at embed as an example. Look how complex it is, dealing with empty files, different ways of opening files, files without lengths, null termination... the list goes on. This is typical of a proposal for C, it starts out simple "why cant i just embed a file in to my code?" and then it gets complicated because the world is complicated.

I worry a lot about people loading in text files and forgetting to add null termination to embeds. I would not be surprised if in a few years that provides a big headline on Hacker news, about how that shot someone in the foot and how C isn't to be trusted. The details matter.


I worry a lot about people loading in text files and forgetting to add null termination to embeds.

If you really worry about that, why did you vote in favour of this feature (as you stated earlier)?


He’s just trying to justify why it took 5 years to approve a feature that is so similar to #include.


null termination is not to be added to embed. embed adds a const sized buffer of unsigned bytes, not strings. files are not strings, files do contain \0.

and I still don't get why embed is so much better than xxd included buffers. it's more convenient sure, but 10x faster?


#embed is faster because it skips the parsing step. xxd generates source code tokens, which the parser must then decode back into bytes; embed just reads the bytes directly.


Multiple big-O steps faster.


That's not how big-O works.


> I worry a lot about people loading in text files and forgetting to add null termination to embeds. I would not be surprised if in a few years that provides a big headline on Hacker news, about how that shot someone in the foot and how C isn't to be trusted. The details matter.

The compiler should insert the null terminator if it's not in the embedded file.


This is another issue here. If loads of compilers start doing this then programs start relying on it an then it becomes a de-facto undocumented feature. That means if you move compilers/platforms you get new issues. A lot of what the C standard does is mopping up these kinds of issues.


Then require compilers implement it in the standard. I think it's really backwards to ignore the tool chain and its ability to prevent bugs from entering software.

It's stuff like this that leaves us writing C to rely on implementation defined behavior. Under specification that leaves easy holes to fill will be filled by the compiler and we will rely on them. Just like type punning.


This is the problem. Things get complicated fast. If we mandate null termination, then its impossible to have multiple embeds in a row to concatenate files, or we need some how to have rules for when to add null termination and not. These rules in turn are not going to be read by all users, so some people will just assume that embed always adds null terminate in when it doesn't and then we are back to square one. The more we add the more corner cases there are.


I don't see how that would prevent sequential embeds unless you have defined the semantics of embed to be a textual include, which if that's the case then the mistake are the semantics and not the constraints on compilers or embedded files.

edit: why is it desirable to concatenate files with #embed? Does that not seem out of scope if not contrived?


Why assume the data should be null terminated? Its an array with a known compile time size. Binary data often needs to include 0 / NULL.


This is literally the premise of this comment chain. Can you see how the standard can get dragged around by various competing needs? People want to embed strings, with null termination, some just want to embed binary data. Neither is particularly “wrong”. And in this thread everyone is presenting “obvious” solutions that are one or the other.


No, at least in this case I think the feature chooses the correct path on this. If you want to embed a string that’s null terminated then just save the string into a file as a null terminated string or copy it from the array to a proper string type. It looks like you may even be able to use this with a struct which might let you add a null termination after the array.

Though yes I agree lots of features get bogged down trying to handle everything. But in this case not adding it creates even more complexity. You can store strings natively in C. You can’t include binary blobs in C in a platform independent way so you have hacks that explode compile times for everyone using that software.

The ability to add vendor specific attributes also allows for those use cases to evolve naturally while still solving the core problem of embedding binary data.


I, FWIW, agree with the implementation. I'm just saying that this thread has someone saying "the compiler should add null termination" in it so the choice to make is not obvious.


If you need 0 termination simply write

    const char foo[] = {
    #embed <file.txt>
        , 0
    };


I don't think adding a null terminator is useful for binary files which are not null-terminated strings, and may even have embedded 0 bytes in the middle.


sure, but if it's a string that requires it to be null terminated, there's no reason the compiler can't solve that problem


How does the compiler know if an array is a string or not?


It doesn't have to, you just add a zero byte at the end of the embedded byte sequence in the object file. It's up to the programmer to make the choice how to interpret that.

That said for binary embeds you almost always need the length embedded as well, which has been the case for every tool I've used to embed files in object code. You usually get something like

    const size_t My_FILE_LENGTH = ... ;
    const uint8_t MY_FILE[] = { ... };


> It doesn't have to, you just add a zero byte at the end of the embedded byte sequence in the object file. It's up to the programmer to make the choice how to interpret that.

Like, are you proposing that the ideal semantics for `#embed "foo"` if foo contained 0x10 0x20 0x30 0x40 to be to expand to `16, 32, 48, 64, 0`? That seems more annoying than the opposite, given that:

> That said for binary embeds you almost always need the length embedded as well, which has been the case for every tool I've used to embed files in object code. You usually get something like

TFA demonstrated it by relying on sizeof() of arrays doing the right thing:

    static_assert((sizeof(sound_signature) / sizeof(*sound_signature)) >= 4,
        "There should be at least 4 elements in this array.");
You'd need to change this to subtract one every time, which sounds more annoying when embedding binary resources than adding the zero to strings would be.


What ? No it shouldn't. Use pointer+size if you want strings


I was on X3J11, the ANSI committee that created the original C standard and my experience was similar. It was a great opportunity to learn C at depth and get an understanding of many of the subtle details. We rejected a great many suggestions because our mandate was to standardize existing practice, address some problem areas, and not get too creative. (We occasionally did get too creative. The less said about noalias the better.)


We are still fixing bugs in restrict...


Maybe you can answer a question I have: what companies are still supporting C compilers for sign-magnitude and 1s-complement machines today? I've been programming for almost 40 years now, and I have never come across any machine that is sign-magnitude or 1s-complement (I have encountered real analog computers---a decent sized one too---about 9' (3m) long, 6' (2m) high, and about 3' (1m) deep, requiring hundreds of patch cables to program).


"""Codify existing practice to address evident deficiencies. Only those concepts that have some prior art should be accepted. (Prior art may come from implementations of languages other than C.) Unless some proposed new feature addresses an evident deficiency that is actually felt by more than a few C programmers, no new inventions should be entertained."""

Source: Rationale for International Standard — Programming Languages — C https://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.1...

I don't know if this rationale is still followed, but I think it applies here. We need to be cautious when adding new features to C.


well, basic string support would be fine, wouldn't it? the C standard still having no proper string library for decades didn't harm its popularity, but still.

you cannot find non-normalized substrings (strings are Unicode nowadays), utf-8 is unsupported. coreutils and almost all tools don't have proper string (=Unicode) support.


> where NULL isn't on address 0

Isn't there literally a single GPU for which it is true?

Asking because everytime this surfaces, someone inevitably asks for an example, and the only example I've seen over the years was of one specific (Nvidia?) GPU that uses NULL of 0xFFFFFFFA (or something similar).

That is, do you know how common it is for NULL to not be 0?


There’s a lot of platforms where you might want to do this. If you’re programming baremetal the “address 0” might be a physical address that you expect stuff to exist at, so it might be relevant to use the bit pattern 0xffffffff instead. If you’re targeting a blockchain or WASM VM you may not also not have memory protection to work with, just a linear array of memory. And some machines don’t even have bit patterns for pointers, like say a Lisp machine.


There are many MCUs where flash is mapped to 0x0 but NULL is still 0.

It’s not really a problem in practice unless you want to dump the whole flash and some != NULL check somewhere


For an older CPU example, x86 (in 16 bit mode) maps the interrupt table at the physical address 0. So to tell the CPU what handler to use for the 0th interrupt, you have to do:

    *(u16)0 = segment // dereference of null!
    *(u16)2 = offset
Granted, most 16 bit OSs were written in assembly, not C, but if you were to write one in C, you’d have this problem.

IIRC, the M68k (which was a popular C target with official Linux support) did the same thing.

For a more recent example, AVR (popularized by Arduino) maps it’s registers into RAM starting at address 0. So, if you wanted to write to r0, you could write to NULL instead. Although one would be using assembly for this, not C.


It's true (in some memory spaces) in AMD GPU too:

https://llvm.org/docs/AMDGPUUsage.html#memory-spaces


That's the one!


Here is an answer that includes a few examples systems from comp.lang.c

https://c-faq.com/null/machexamp.html


People who call C simple have some weird definition of simple. How many C programs contain UB or are pure UB? Probably over 95%+. Language's not simple at all.


A straight razor is simple and that's why it's the easiest to cut yourself with. An electric razor is much safer precisely because much engineering went into its creation.


Thank you for your post!

Thank you especially for reminding everybody that programming is much more than web programming and information systems.


Thank you,

Its also worth remembering that a lot of higher level languages have runtimes / VMs are implemented in C. Web applications rely heavily on databases, java script VM, network-stacks, system calls and operating system features, all of which are impemented in C.

If you are a software developer and want to do something about climate change, consider becomming a compiler engineer. If you manage to get a couple of tenths of a percent performance increase in one of the big compilers during your career, you will have materially impacted global warming. Compiler engineers are the unsung heroes of software engineering.


> If you manage to get a couple of tenths of a percent performance increase in one of the big compilers during your career, you will have materially impacted global warming.

I've heard this kind of claim a number of times and I think it's more complicated than the crude statistical measurement makes it sound. Personally, I think that most programs are not run frequently enough to matter from an emissions perspective. For programs that are, like ML training programs, users will just train more data if the algorithms are faster so most energy efficiencies will get wiped out by the increased usage.

Even if that theory is wrong, what if there is a language that is 10% better than C for 95% of common C use cases? Wouldn't it be better for compiler engineers to focus on developing that language than micro-optimizing C?


No JavaScript VM is implemented in C. They are all written in a language that's a bit like C++ but has no exceptions and relies on lots of compiler behaviour that is not defined by the C++ standard.


Hm? I can think of two pure C JS engines off the top of my head: Duktape and Elk. I believe Samsung or another vendor also has their own; they’re all somewhat common in the embedded space.


Fair. I'm not familiar with the tiny JS VMs. But really the main point stands: It's not possible to build a decent GC without violating strict aliasing so C and C++ as standardized are not suitable for this task.


I guess this doesn't exist then: https://bellard.org/quickjs/


This is not written in C if it doesn’t pass UBSan/ASan/Frama-C and co. It’s written in a language that just happens to look like C.


This is incredibly pedantic, even for Hacker News. I suspect you do not wish to be responded to like this in general, so I struggle to see why it is appropriate here.


The point is that if you are programming in a C dialect or a C++ dialect you will be told this again and again.

Ask a question on Stackoverflow about code that requires nonstandard flags or uses UB. You will be told your program is not really in C/C++ and that your question makes no sense. You will be lectured on nasal demons for the nth time.

File a bug against a compiler asking the optimizers to back off using UB to subvert the intentions of the programmer and you will be told there's no way to even know the intentions of the programmer given that the standards don't apply.

So we can all move to a nomenclature where C is used loosely to indicate a family of languages, once of which happens to be standardized. I'd be happy to do that.

But let's not bait and switch, using standards pedantry to dismiss feature requests and bug reports, but then turning around and saying C/C++ is suitable for implementing runtimes of other languages when nothing can really be achieved in that space without going beyond the language spec.

And really in 2022 runtimes, JITs, GCs should be the primary use of C/C++. Many other uses (systems software, compilers, desktop apps, phone apps) are not suited for C/C++ due to security, stability, ease of development and newer languages that make more sense for these domains.


Well…the answer is that the response you get will depend. If you are familiar with the standard and how it is implemented in compilers, you will get a sense for which UBs are implicitly blessed by compilers and you can file bugs against. "I double freed a pointer and expected something reasonable to happen" is never going to get you a serious response. But if you ask something like "I tagged this pointer's bit pattern and untagged it and the pointer I got back isn't valid" you will generally be heard out. FWIW I would recommend picking a different language to write your language runtime in 2022, using it to support your new language is often a good way to invite memory unsafety bugs in your code.


It’s a valuable distinction! This is a thread about the C standard. It’s misleading to talk about projects that use a specific C compiler on a specific platform but plainly need implementation details past what’s in the standard, or even straight up violating them.


Well, yes and no. I do actually agree with your claim that Linux uses a nonstandard C because it builds with special flags and without them it would not work. But for code that just happens to have unintentional bugs that result in UB–I definitely consider that C. Code that has a gentleman's agreement with the compiler on certain UBs is straddling the line but I would still usually call it C, unless someone tried to claim that because GCC accepted something it must be part of the standard. My point really isn't that you're wrong or that this isn't useful in the right context–that's why I called it pedantic–but the difference between "we build our code with special flags and this doesn't work with any other code that is ostensibly in this language because of that" and "this works on major compilers but is not standards compliant" is kind large and I don't think you really needed to go where you did. Sort of like if you made a typo in a comment about the English language I don't go "you must be using a language other than English, because clearly this is not valid" I say you have a typo, but if you're one of the people who makes everything lowercase and doesn't use punctuation I might reasonable say "it's mostly English but I'm not sure I can really call it that without qualification".


Close enough. Will you claim the Linux kernel isn't C because it's compiled with -fno-strict-overflow and -fno-strict-aliasing ?


Yes, that’s why it only supports specific C compilers.

Anything that includes its own memory allocator (that doesn’t call malloc()) is probably not implemented in standardized C.


It’s still “C”, even if it’s a specific dialect. Vendor specific C extensions have existed forever.


How would such a platform without file systems handle #include?

Reading further, I don't think this was ever addressed when someone else brought it up. I cannot for the life of me imagine a system where #include works but #embed doesn't. Again, it's fine if some systems have non-standard subsets of the C standard....why hobble the actual standard for code which can be compiled on systems where you have a filesystem (that will handle #include by the way) for the systems without filesystems?


> How would such a platform without file systems handle #include?

I don't think it would, you'd cross-compile for it on a platform with a file system. I think the parent poster's point was that C is the only option for some ultra low resources platforms and that a conservative approach should be taken to add new features in general. I don't think they were saying that specifically that not having a filesystem is problematic for this particular inclusion.


include is with regards to the source platform, not the target platform la, you (generally) need a filesystem to compile, but you don't need a filesystem to run what you compiled


Congratulations! #embed is a very useful feature.

If I may gripe about C for a bit though. I do truly appreciate C's portability. It's possible to target a very diverse set of architectures and operating systems with a single source. Still, I do wish it would actually embrace each architecture, rather than try to mediate between them. A lot of my gripes with C are due to undefined behaviour which is left as such because of platform differences. I've never seen my program become faster if I remove `-fwrapv -fno-strict-aliasing`, but it has resulted in bugs due to compiler optimisations. I really wish by default "undefined behaviour" would become "platform-specific behaviour", with an officially blessed way to tell the compiler it can perform farther optimisations based on data guarantees.

C occupies a very pleasant niche where it lets you write software for the actual hardware, rather than for a VM, while still being high level enough to allow for expressiveness in algorithms and program organisation. I just wish by default every syntactically valid program would also be a well-defined program, because the alternative we have now makes it really hard to reason about and prove program correctness (i.e. that it does what you think it does).


Thanks for your work on the C standard. Any changes that are made will remain forever, so I'm glad the committee takes this seriously.


Seems like a nice addition. Much better than futzing around with xxd and suchlike.


I'm curious what you think of UB from a standard perspective --- were things left undefined and not just implementation-defined because there was simply so much diversity in existing and possibly future implementations that specifying any requirements would be unnecessarily constraining? I can hardly believe that it was done to encourage compiler writers to do crazy nonsensical things without regard for behaving "in a documented manner characteristic of the environment" which seems like the original intent, yet that's what seems to have actually happened.


>I'm curious what you think of UB from a standard perspective

I think a lot about that! I'm a member of the UB study group and the lead author of a Technical Report we hope to release on UB.

In short, "Undefined behavior" is poorly named. It should have been called "Things compilers can assume the program wont do". With what we call "assumed absence of UB" compilers can and do do a lot of clever things.

Until we get the official TR out, you may find I made a video on the subject interesting:

https://www.youtube.com/watch?v=w3_e9vZj7D8


> And when I say truly, I mean for platforms without file systems, or operating systems or where bytes aren't 8 bits, that doesn't use ASCI or Unicode, where NULL isn't on address 0 and so on.

Genuine question: why do we want these platforms to live, rather than to be forced to die? They sound awful.

I understand retrocomputing, legacy mainframes, etc; but 99% of that work is done in non-portable assembler and/or some flavor of BASIC; not in C.


May of these platforms are micro controllers, DSPs or other programmable hardware, that are in every device now a days, so its not retro, its very much current technology.


Once again — I can understand wanting to program this hardware, but who's programming it in C, rather than writing directly to the metal in order to squeeze every cycle out of these?


Arduino programmers don't need to squeeze every cycle out of their hobby projects.


Because they weren't necessarily awful. Because in the future we may discover we need to do "weird" things again, for performance or other reasons.


Ha I suggested this on the C++ proposals mailing list 7 years ago:

https://groups.google.com/a/isocpp.org/g/std-proposals/c/b6n...

Enjoy the naysayers if you like! I'm glad someone spent the time and effort to push past them. Bit too late for me - I have moved on to Rust which had support for this from version 1.0.0.

> There's also the standard *nix/BSD utility "xxd".

> Seems like the niche is filled. Or, at least, if you want to claim that

> (A) XPM

> (B) incbin

> (C) "xxd -i"

> (D) various ad-hoc scripts given in http://stackoverflow.com/questions/8707183/script-tool-to-co...

>...do NOT completely fill this evolutionary niche

> This ultimately would encourage a weird sort of resource management philosophy that I think might be damaging in the long run.

> Speaking from experience, it is a tremendously bad idea to bake any resource into a binary.

> I'll point out that this is a non-issue for Qt applications that can simply use Qt's resources for this sort of business.

(Though credit to Matthew Woehlke, he did point out a solution which is basically identical to #embed)

> I find this useless specially in embedded environments since there should be some processing of the binary data anyway, either before building the application

In fairness there was a decent amount of support. But given the insane amount of negativity around an obviously useful feature I gave up.

I wonder if there was a similar response to the proposal to include `string::starts_with()`...


> > Speaking from experience, it is a tremendously bad idea to bake any resource into a binary.

What a pompous douche whoever wrote that was.

> > This ultimately would encourage a weird sort of resource management philosophy that I think might be damaging in the long run.

So, this might be a valid point, although not enough to reject the feature for. It true that it's a feature that could potentially see over-use and ab-use. But then, so did templates :-P


> What a pompous douche whoever wrote that was.

And clearly someone who had never once written code for a system without a filesystem.


Are they writing code on a system without a filesystem, preprocessor feature


what is the Rust equivalent for #embed?





> told me this form was non-ideal and it was worth voting against (and that they’d want the pure, beautiful C++ version only[1])

I heard about #embed, but I didn't hear about std::embed before. After looking at the proposal, to me it does look a lot better than #embed, because reading binary data and converting it to text, only to then convert it to binary again seems needlessly complex and wasteful. I also don't like that it extends the preprocessor, when IMHO the preprocessor should at worst be left as is, and at best be slowly deprecated in favour of features which compose well with C proper.

Going beyond the gut reaction and moving on to hard data, as you can expect from this design, std::embed of course is faster during compilation than #embed for bigger files (comparable for moderately-sized files, and a bit slower for tiny files).

I'm not a huge fan of C++, but the fact that C++ removed trigraphs in C++17 and that it's generally adding features replacing the preprocessor scores a point with me.

[1]: <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p10...>


Compilers follow the "as if" principle, they don't have to literally follow the formal rules given by the standard. They could implement #embed by doing as you say, pretty printing out numbers and then parsing them back in again. But that would be an extremely roundabout way to do it, so I doubt anyone will actually do it that way. Unless you're running the compiler in some kind of debugging mode like GCC's -E.


I don’t think the implication is that the C compiler must encode the binary file as a comma-separated integer list and then re-parse it, only act as if it did so.


How would that work? It would need to depend on the grammar of surrounding C code. This directive isn't limited to variable initialisers. You can use it anywhere. So e.g. you can use it inside structure declaration, or between "int main()" and "{". etc. etc. Those will generate errors in subsequent phases, but during preprocessing the compiler doesn't know about it. Then there is also just that:

  int main () {
      return
  #embed "file.bin"
      ;
  }
There are plenty of cases, where it will all behave differently. And if you're going to pretend even more that the preprocessor understands C syntax, then why not just give this job to compiler proper, which actually understands it?


Preprocessing produces a series of tokens, so you would implement it as a new type of token. If you're using something like `-E` you would just pretty-print it as a comma-delimited list of integers. If you're moving on to translation phase 7, you'd have some sort of rules in your parser about where those kinds of tokens can appear . Just like you can't have a return token in an initializer, you wouldn't be allowed to have an embed token outside of one (or whatever the rules are). And you can directly instantiate some kind of node that contains the binary data.


> you would implement it as a new type of token

That's a good point. I consider myself debunked.


the preprocessor is a great tool to reduce duplication and boilerplate.

People that don't like it generally just don't know how to use it.


People don't dislike it because they are unaware how helpful it can be. They dislike it because they are aware how hacky, fragile and error-prone it is. They want something more robust than text substitution.


People that don't like it generally have used macros that are more sophisticated than just blindly copy pasting text into your source files and have became aware of how absurd that is.


Or perhaps those people can think of better ways to get those benefits that also don't allow things like

    #ifndef asdf
    }}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
    #endif
which obliterate tooling such as IDEs. Of course, this is a contrived example, but the preprocessor is just one big footgun, which offers no benefits over other ways of solving the problems you mentioned, such as constexpr and perhaps additional, currently unimplemented solutions.


It being possible to misuse a tool does not mean the tool is not very useful.


A tool being very useful doesn’t mean that it is a very good tool.

There are better tools for the functionality the C preprocessor attempts to provide. Other languages have module inclusion systems and very powerful macros that don’t have the enormous footguns of the C preprocessor.

Édit: to be clear, I think #embed is a fine idea; I’d use it and it would make my sourcebase cleaner in some places.


My carpenter has a lot of tools that can be dangerous if misused. Of course better tools can be devised, but useful things have been done with them (and he still has all his fingers)


Yes, but we’re in a thread about ways to improve the language, not about how to make the best with what’s there. This type of argument holds back improvement.


This serves the same use as Rust's `include_bytes!` macro, right? Presumably most people just use this feature as a way to avoid having to stuff binary data into a massive array literal, but in our case it's essential because we're actually using it to stuff binaries from earlier in our build step into a binary built later in the build step. Not something you often need, but very handy when you do.


This has different affordances than std::include_bytes! but I agree that if you were writing Rust and had this problem you'd reach for std::include_bytes! and probably not instead think "We should have an equivalent of #embed".

include_bytes! gives you a &'static [u8; N] which for non-Rust programmers means we're making a fixed size array (the size of your file) full of unsigned 8-bit integers (ie bytes) which lives for the life of the program, and we get an immutable reference to it. Rust's arrays know how big they are (so we can ask, now or later) but cannot grow.

#embed gets you a bunch of integers. The as-if rule means your compiler is likely to notice if what you're actually doing is putting those integers into an array of unsigned 8-bit integers and just stick all the file bytes in the array, short cutting what you wrote, but you could reasonably do other things, especially with smaller files.


for both Rust and C, these features "just" make something you could otherwise do with the build system and generated code easier, I think.


As the article quotes, in C the lack of standardisation makes this tricky when you want to support more than one compiler, or even when you want to support just one compiler (cf email about the hacks to make it work on GCC with PIE).


> Even among people who control all the cards, they are in many respects fundamentally incapable of imagining a better world or seizing on that opportunity to try and create one, let alone doing so in a timely fashion.

That does sound soul-crushing. Congrats on this achievement!


This is simply wrong. We (the ISO wg14) don't hold the cards, compilers are free to implement what ever they want, users are free to use what ever tools or languages they want.

We exist only as long as we are trusted to be good stewards, and only go forward with the consensus of the wider community.


You're both right.

It's amazing that you and the ISO team are good stewards of the C standard. Thank you for being part of that.

And it can also be true that it was "hell" and "hardly worth it" for the OP to get a new feature added to the language. I believe it was a miserable experience that has him questioning how he spends his time.

Both can be true. Thank you for your efforts. And thank the OP for his efforts too.


> > Even among people who control all the cards, they are in many respects fundamentally incapable of imagining a better world or seizing on that opportunity to try and create one, let alone doing so in a timely fashion.

> This is simply wrong. We (the ISO wg14) don't hold the cards, compilers are free to implement what ever they want, users are free to use what ever tools or languages they want.

This is an incredibly oblivious realization of JeanHeyd's point.


> (the ISO wg14) don't hold the cards

That "standard" card seem to be a pretty huge one though.


Yeah, but a document doesn't compile c code so best to stay humble. :-)


I think in our reality the prerequisite for holding all the cards is the lack of competence in knowing how to improve the world. We've gotten where we are now through sheer force of will of those that are empty handed.


The reasonable man adapts himself to the world: the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.

George Bernard Shaw


This reminds me, I'd argue that the explosion of JS frameworks can be mainly blamed on one thing: the lack of an <include src="somemodule.html"> tag. If you have that you basically have 80% of vue.js already natively supported. No clue why this was never added in any fashion. Change my mind.


HTML imports were part of the original concept of Web Components, and I think they were supported in Chrome. If you look up examples of things built with Polymer 1.x, it was used extensively.

It was actually pretty neat, because you could have an HTML file with a template, style, and script section.

Safari rejected the proposal, so it had to get dropped.

But ESM makes it a bit redundant anyway. The end-goal is to allow you to import any kind of asset, not just JS. There have been demos and examples of tools supporting this going back over half a decade at this point.


Firefox refused the proposal as well. ESM requires javascript though. :/


My admiration for Mozilla and Apple grows every day /s


Same, but without the /s


It's funny I read that and I remember Apache's virtual-include facility:

    <!--#include virtual="/cgi-bin/example.cgi?argument=value" --> 
I used that, back in the day, as an alternative to PHP.


Wouldn't the include still need some templating functionality? Or are people using vue that heavily for just importing static html?


Not the parent comment, but my personal use case is for rendering a selectable list. The server side would render a static list with fragment links (ex. `#item-10`) and include elements with corresponding IDs, and a `:target` css rule to unhide the element. This would hopefully be paired with lazy loading the include elements.

edit:

My goal is to avoid reloading the page for each selection and rendering all items eagerly. JS frameworks are the only ones that really allow this behavior.


> https://caniuse.com/imports

It was a feature in Chrome 36-79 and there were working polyfills to make it work on other browsers.

It was actually a great feature and I used it extensively on an old project back then.

CanIUse: https://caniuse.com/imports

(Now obsolete) tutorial: https://www.sitepoint.com/introduction-html-imports-tutorial...


I wonder why there's never been a

    <calc> /* ... compute and return dom element */ </calc>
Basically what php does but with structure and objects instead of a bytestream

in HTML. Or maybe it's been discussed but got left out


It's always been possible:

<script>document.write(`<p>foo</p>`)</script>


Good point, but it's very different in terms of thinking. Same as structural versus string macros in programming languages.


You can do it with custom elements, e.g.

      <dom-calc>
        const p = document.createElement('button');
        p.innerText = 'hi';
        p.onclick = () => alert('hi');
        return p;
      </dom-calc>

For something like:

  class CalcElement extends HTMLElement {
    connectedCallback() {
      setTimeout(() => {
        const fn = new Function(this.innerText);
        this.replaceWith(fn());
      }, 0);
    }
  }

  customElements.define('dom-calc', CalcElement);


    <?php require("somemodule.html"); ?>


How would <include> be useful for dynamically updating the DOM based on data, which is the main point of Vue?


Is <script type="module" /> not sufficient for your needs? If not then what is missing?


Seems to be arguing for modular layout/templating, which is what virtual includes did (the cgi in the example would hypothetically output html)


Honestly I'm usually very wary of additions to C, as one of its greatest strengths (to me) is how rather straightforward it is as a language in terms of conceptual simplicity. There just aren't that many big concepts to understand in the language. (On the other hand there's _many_ footguns but that's another issue.)

That said, to me this seems like a great addition to the language. It's very single-purpose in its usage (so it doesn't seem to add much conceptual complexity to the language) and it replaces something genuinely painful (arcane linker hacks). I'm very much looking forward to using this as I often make single-executable programs in C. The only thing that's unfortunate is I'm sure it'll take decades before proprietary embedded toolchains add support for this.


C23 and C26 are basically heading into C++ without classes.


There is way too much in C already.

The first commandment of C is: 'writing a naive C compiler should be "reasonable" for a small team or even one individual'. That's getting harder and harder, longer and longer.

I did move from C being "the best compromise" to "the less worse compromise".

I wish we had a "C-like" language, which would kind of be a high-level assembler which: has no integer promotion or implicit casts, has compile-time/runtime casts (without the horrible c++ syntax), has sized primitive types (u64/s64,f32/f64,etc) at its core, has sized literals (42b,12w,123dw,2qw,etc), has no typedef/generic/volatile/restrict/etc well that sort of horrible things, has compile-time and runtime "const"s, and I am forgetting a lot.

From the main issues: the kernel gcc C dialect (roughly speaking, each linux release uses more gcc extensions). Aggressive optimizations can break some code (while programing some hardware for instance).

Maybe I should write assembly, expect RISC-V to be a success, and forget about all of this.


I wish we had something like typed Lua without Lua’s weird quirks (e.g. indexing by 1), designed with performance enhancement and and safety in mind, and with the features you mention.

But like Lua, the base compiler is really small and simple and can be embedded. And it’s “pseudo-interpreted”: ultimately it’s an ahead-of-time language to support things like function declarations after references and proper type checking, but compiling unoptimized is practically instant and you can load new sources at runtime, start a REPL, and do everything else you can with an interpreted language. Now having a simple compiler with all these features may be impossible, so worse-case there is just a simple interpreter, a separate type-checker, and a separate performance-optimized JIT compiler (like Lua and LuaJIT).

Also like Lua and high-level assembly, debugging unoptimized is also really simple and direct. By default, there aren’t optimizations which elide variables, move instructions around, and otherwise clobber the data so the debugger loses information, not even tail-call optimization. Execution is so simple someone will create a reliable record-replay, time-travel debugger which is fast enough you could run it in production, and we can have true in-depth debugging.

Now that i’ve wrote all that I realize this is basically ML. But oCaml still has weird quirks (the object system), SML too honestly, and I doubt their compilers are small and simple enough to be embedded. So maybe a modern ML dialect with a few new features and none of the more confusing things which are in standard ML.


Checkout Nim! It does much of what you describe and its great. The core language is fairly small (not quite lua simple but probably ML comparable). It compiles fast enough that a Nim repl like `inim` is useable to check features and for basic maths, though it requires a C compiler, but TCC [4] works perfectly. Essentially Nim + tcc is pretty close to your description, IMHO. Though I'm not sure TCC supports non-x86 targets.

I've never used it but Nim does support some hot reloading as well [3]. It also has a real VM if you want to run user scripts and has a nice library for it [1]. Its not quite Lua flexible but for a generally compiled language its impressive.

Recently I made a wrapper to embed access to the Nim compilers macros at runtime [2]. It took 3-4 hours probably and still compiles in 10s of seconds despite building in a fair bit of the compiler! It was useful for making a code generator for a serializer format. Though I'm not sure its small enough to live on even beefy m4/m7 microcontrollers. Though I'm tempted to try.

1: https://github.com/beef331/nimscripter 2: https://github.com/elcritch/cdecl/blob/main/src/cdecl/compil... 3: https://nim-lang.org/docs/hcr.html 4: https://bellard.org/tcc/


Nim is great, shame it isn't more popular; it's my go-to for what would previously have been Go/Rust/Python.


> I wish we had a "C-like" language, which would... How about https://ziglang.org/ ?


GCC or Clang with all warnings turned on will give you almost what you want. -Wconversion -Wdouble-promotion and 100s of others. A good way to learn about warning flags (apart from reading the docs) is Clang -Weverything, which will give you many, many warnings.


Not an exact match, but a close one: https://odin-lang.org/


I agree (with a lot of caveats), but a key value of C is that we do not break peoples code and that means that we cant easily remove things. If we do, we create a lot of problems. This makes it very difficult to keep the language as easy to implement as we would like. As a member of the WG14, I intend to propose that we do make this our prime priority going forward.


> I wish we had a "C-like" language, which would kind of be a high-level assembler which: has no integer promotion or implicit casts, has compile-time/runtime casts (without the horrible c++ syntax), has sized primitive types (u64/s64,f32/f64,etc) at its core, has sized literals (42b,12w,123dw,2qw,etc), has no typedef/generic/volatile/restrict/etc well that sort of horrible things, has compile-time and runtime "const"s, and I am forgetting a lot.

Unsafe Rust code I think fits this model better than C does: it relies on sized primitive types, it has support for both wrapping and non-wrapping arithmetic rather than C's quite frankly odd rules here, it has no automatic implicit casts, it has no strict aliasing rules.


The first commandment of C is: 'writing a naive C compiler should be "reasonable" for a small team or even one individual'. That's getting harder and harder, longer and longer.

100% agreed. I've always viewed C as a "bootstrappable" language, in which it is relatively straightforward to write a working compiler (in a lower level language, likely Asm) which can then be used to bring up the rest of an environment. The preprocessor is actually a little more difficult in some respects to get completely correct, and arguably #embed belongs there, so it's debatable whether this feature is actually adding complexity to the core language.

Your wish for a "C-like" language sounds very much like B.


Time for a B+ language?

There is so much more to remove: 1 loop statement is enough, loop {}, enum should go away with the likes of typeof, etc.

I wonder if all that makes writing a naive "B+" compiler easier (time/complexity/size) than a plain C compiler. I stay humble since I know removing does not mean easier and faster all the time, the real complexity may be hidden somewhere else.


Are you a programmer? Embed is the easiest feature to implement that I have ever heard


I think the blog post provides some insight into the challenges of implementing this.


Implementing it without embed. With embed the author says

> It feels like I wasted a lot of my life achieving something so ridiculously basic that it’s almost laughable

Which makes me think I should never get involved with an ISO committee, not something I want done fast at least


The question here is why this did not already exist as an extension in some compilers. Getting something standardized that exists already in compilers and is used is far easier.


Because the big circle of life.

The people on the C standards committee are gold bricks that say things like we won't add anything to the standard that isn't implemented in two or more compilers.

The compiler writers program in C++. And say things like if you want that feature that's a good reason to use C++ and stop using C. But if the standards committee adds it of course we will.

Also embed is exactly the feature that end uses would find very useful and compiler writers would not care about at all.


I’m really amazed at how divisive this one is, and the number of comments here questioning what seems to me a really useful and well-thought-out feature, something I’d have loved to have used many many times over the years.

I guess the heated arguments here help me understand how it could have taken so long to get this standardised, though, so that’s something!

Congratulations and thank you to the OP for doing this, and thanks also for this really interesting (if depressing) view of the process.


This is a really, really good feature and I am so glad it is finally getting standardized. C23 is shaping up to be a very good revision to the C standard. I’m hoping the proposal to allow redeclaration of identical structs gets in as well as you would finally be able to write code using common types without having to coordinate which would allow interoperability between independently written libraries.


Congratulations to the author. Things like this are why I hope Carbon exists. Evolving c++ seems like a dumpster fire, despite whatever compelling arguments about comparability you are going to drop on me.


The issue is that a lot of people just think about languages in a wrong way, which is the whole reason for pointless things like C++ expansions, Carbon, Rust, and stuff like this.

One of the fundamental ideas that people run with in language creation/expansion is "programmer is stupid and/or make mistakes" -> "lets add language features that intercept and control his stupidity/mistakes".

And there is a very valid reason for this - it allows programmers of lesser skill and knowledge base to pick up codebases and develop safe software, which has economic advantages in being able to higher less experienced devs to write software at lower salary points and spend next to no time fixing segfault issues due to complex memory management. The whole reason Java got so popular over C++ was because of its GC - both C++ and Java supported fairly strong typing with classes, but C++ still had a lot of semantics around memory management that had to be taken care of, whereas with Java you just simply don't do anything.

However, people are applying this idea towards lower level languages, because they want the high performance of a compiled language with a whole bunch of features that make writing code as mistake free as possible. And my challenge to that is this - why not spend the time making just smarter compilers/tooling?

Think about a hypothetical case where Rust gets all the features added to it that people want, and is widely used as the main language over all others. Looking at all the code bases, there will be a lot of common use patterns, a lot of the safety code duplicated over and over in predictable patterns, e.t.c. And you will see these common things added to Rust. Just like with Java a lot of the predictable use patterns got abstracted into widely used libraries like Lombok, Spring, e.t.c, where you don't have to worry about correctness in lieu of using a library. And you essentially will start to move towards more and more stuff being handled for you automagically, which is all part of the compiler/toolchain

In the same way, #embed can be solved by smart compiler. Have a static string that opens a file, and read contents into a buffer that doesn't change? Auto include that file in a binary if you want to target performance rather than executable size. No need for special instruction, just be smart about how you handle an open call, and leave the fine tuning of this to specific compiler options.

And from an economic perspective of ease of use from above, you would have a language like Python which is super easy to pick up and program in, except instead of the interpreter, you would have a compiler that will spit out binaries. Python is already widely adopted primarily of how easy it is to set up and use. Now imagine if you had the option to run a super smart compiler that highlights any potential issues that come with dynamic typing because it understands what you are trying to do, fixes any that it can, and once everything is addressed, it spits out an optimized memory safe executable. With Rust, you code, compile, see you made a mistake somewhere with a reference, fix it, repeat. With this, you would code, compile, fix the mistake somewhere that the compiler warns you about, repeat. No difference.

Focusing on the toolchain also lets you think about integrating features from languages like Coq with provability, where you can focus not only on correctness processing/memory wise, but also "is the output actually correct". I.e, any piece of code for all given input can be specified to have guaranteed bounded output set, which you can integrate into IDE tools to provide you real time feedback on this for you to design the code in a way that avoids things like URL parsing mistakes, which all the languages safety features of Rust won't catch.

As for C, you leave it a version that has a stable, robust ABI, and then anything that you need to support will be delegated to custom tools. That way, in the future where compute will likely be full of specialized ML chips, instead of worrying about writing the frontend to support every feature, you quickly get a notional tool chain made and are able to run existing C code.


Re: #embed </dev/urandom>

Just a random thought, but I'd expect a compiler to do exactly what's described if I tell it:

    static char foo[123] = {
        #embed </dev/urandom>
    };
This would address the most common case with infinity files, and then just let the compiler error out if the array size is not specified.


I would expect that to produce an error, though. If I had a regular file that was not infinite in size, and I specified the wrong length for the array, I would find it more useful to have the compiler inform me as to the discrepancy rather than truncate my file.


A warning, not an error. Both under and over-population can be valid use cases.


Reproducible builds crowd wailing in agony


The context is that of infinite files, not of the urandom specifically. Give the linked post a read for details.


They just need a reproducible urandom.


Looks like you forgot to add your sponsor disclaimer "message sponsored by NSA". ;)


Well, just don’t do networking in the VM your builds run in and you’re good, maybe.


Remove /dev/urandom and replace it with /dev/zero in your sandbox ;)


The preprocessor needs to run before the compiler, though, and isn't complex enough to understand the context of the code that it's in. That would be a substantially complex thing to implement.


This will indeed require delaying population of the array to the compilation stage. However it's worth the convenience and the succinctness of the syntax, and it's not that substantially complex to implement.


Interesting. I look forward to this. What I've been doing now to embed a source.png file is something like this, where I generate source code from a file's data:

in embed_dump.cpp:

  #include <fstream>
  #include <iostream>
  int main(){
   std::ifstream f;
   f.open("./source.png");
   std::cout
    << "//automatically generated by embed_dump from project files:" << std::endl
    << "const char embedded_tex[] = {";
   char a;
   while(f.good()){
    f.read(&a,1);
    std::cout << int(a) << ",";
   }
   std::cout << "};" << std::endl;
   f.close();
  }
Then I set up my makefile like this (main_stuff.cpp #includes embedded_files.h):

  main_stuff: main_stuff.cpp embedded_files.h
   c++ main_stuff.cpp
  embeded_files.h: embed_dump source.png
   ./embed_dump > embeded_files.h
  embed_dump: embed_dump.cpp
   c++ embed_dump.cpp -o embed_dump


What about creating object files from raw binary files and then linking against them? That's what I (and of course many others) do for linking textures and shaders into the program. It's a bit ugly though that with this approach you can't generate custom symbol names, at least with the GNU linker.

This #embed feature might be a nice alternative for small files. Well for large files you usually don't even want to store them inside the binary, so the compilation overhead might be miniscule, since the files are, by intention, small.

When I read the introduction of the article - about allowing us to cram anything we want into the binary - I was hoping to see a standard way to disable optimizations (When the compiler deletes your code and you don't even notice).


One reason against this is mentioned in the letter that is quoted in the article


It depends on your definition of small files. A few hundred kB to a few megabytes will make compilation speed and memory usage explode if you embed it as text, see section 3.2 in https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3017.htm#i...


You reminded me of Bethesda Softworks games, which always seem to have 1GB+ executables for some reason. I hope it isn't all code. Maybe they embed the most important assets that will always need to be loaded.


My guess is that the files are not truly embedded as it would require to load the entire file in memory before running the application, which seems wasteful.

More likely, the actual executable is only a small part of the file which accesses the rest of the file as an archive, like a self-extracting zip. There may also be some DRM trickery going on.


vmprotect and other DRM schemes will also bloat those sizes.

Although I've only seen 1GB+ pds, executables always in the hundreds.


Off the top of my head, I think there's some niche use in embedding shaders so that they don't need to be stored as strings (no IDE support) or read at runtime (slower performance).


There are a lot of use cases for baking binary data directly into the program, especially in embedded applications. For instance, if you are writing a bootloader for a device that has some kind of display you might want to include a splash screen, or a font to be able to show error messages before a filesystem or an external storage medium is initialized. Similarly, on a microcontroller with no external storage at all you need to embed all your assets into the binary; the current way to do that is to either use whatever non-standard tools the manufacturer's proprietary toolchain provides, or to use xxd to (inefficiently) generate a huge C source file from the contents of the binary file. Both require custom build steps and neither is ideal.


Another typical use is embedding a public-key in an application or firmware.


You can get some IDE support with a simple preprocessor macro[1].

It's a crutch, but at least you don't need to stuff the shader into multiple "strings" or have string continuations (\) at the end of every line. Plus you get some syntax highlighting from the embedding language. I.e. the shader is highlighted as C code, which for the most part seems to be close enough.

[1] https://github.com/phoboslab/pl_mpeg/blob/master/pl_mpeg_pla...


That's pretty clever

Thanks, I'll remember to use that in my future shaders


Nice for binary shaders too, e.g. SPIR-V bytecode generated by glslc.


Other stuff:

https://twitter.com/rcs/status/1550526425211584512

nullptr! auto! constexpr!


Not sure about the value of nullptr! Also not sure about auto! In C.


nullptr since we have type detection now, and NULL mustn't be a pointer. auto, because otherwise everybody would create their own hacky auto using the new typeof.


If you want to start playing with this now, my C preprocessor Cedro (https://news.ycombinator.com/item?id=28166125, https://news.ycombinator.com/item?id=31193138) will insert the byte literals for the compiler:

https://sentido-labs.com/en/library/cedro/202106171400/#bina...

That’s the same that `xxd` does, for instance, and it works with C89/C99/C11.

The advantage is that since it uses the same syntax as C23, it is easier to switch to using the compiler: just remove the Cedro pragma `#pragma Cedro 1.0 #embed` and it will compile as C23.

The source code (Apache 2.0) and GitHub link are at: https://sentido-labs.com/en/library/


Looks great. I've been writing cmake hacks to include assets in executables for too long.


Can’t you just add binary data into a custom section of your ELF executable?


Yes, but it's linker-specific and non-portable. It can also come with some annoying limitations, like having to separately provide the data size of each symbol. In some cases this might be introspectable, but again comes at the expense of portability.

ELF-based variants of the IAR toolchain, for example, provide a means of directly embedding a file as an ELF symbol, but without the size information being directly accessible.

GNU ld and LLVM lld do not provide any embedding functionality at all (as far as I can see). You would have to generate a custom object file with some generated C or ASM encoding the binary content.

MSVC link.exe doesn't support this either, but there is the "resource compiler" to embed binary bits and link them in so they can be retrieved at runtime.

Having a universal and portable mechanism which works everywhere will be a great benefit. I'll be using it for compiled or text shaders, compiled or text lua scripts, small graphics, fonts and all sorts.


This article[1] shows how you can use GCC toolchain along with objcopy to create an object file from a binary blob, link it, and use the data within in your own code.

[1] https://balau82.wordpress.com/2012/02/19/linking-a-binary-bl...


Great for Linux. Less so for Mac OS-X and Solaris (all three of which I use at work).


The article address this directly. If you're only targeting one platform then this is reasonably easy (albeit still not as easy as #embed), but if you need to be portable then it becomes a nightmare of multiple proprietary methods.


You probably don't realize that not every system is using ELF binaries....


Sure, but to add binary data to any executable on any platform is more involved.

As an example, see [1]. That will turn any file into a C file with a C array, and I use it to embed a math library ([2]) into the executable so that the executable does not have to depend on an external file.

[1]: https://git.yzena.com/gavin/bc/src/branch/master/gen/strgen....

[2]: https://git.yzena.com/gavin/bc/src/branch/master/gen/lib.bc


Then you don’t get regular optimizations like deduping identical declarations.

Or source line location debug info, though nobody tries to show that for data at the moment.


The less you have to mess with linker scripts, the better.


5 years ago I wrote a small python script [1] to help me solve "the same problem". It reads files in a folder and generates an header file containing the files' data and filenames. Is very simple and was to helping me on a job. It has limitations, don't be too hard on me :)

[1] https://github.com/daxliar/pyker


This will simplify a lot of build pipelines for sure.

One thing that isn't clear from skimming the article, how do you refer to the embedded data again?


> The directive is well-specified, currently, in all cases to generate a comma-delimited list of integers

I.e. you most likely use it to initialize a static variable, and then refer to that variable.


Ah so this basically?

    static const char d[]={
    #embed ...
    };
EDIT: ah it was showing up more like a comment which made it hard to spot.


Yes. There is an example in the article


Everlasting glory to JeanHeyd Meneide, @thephantomderp, for getting this feature into C.

I am wondering, though - where does this stand in C++?


This is a cool feature and I'll likely be using it in the years to come. However, the posix standard command xxd and its -i option can achieve this capability portably today.

It will be useful to achieve it directly in the preprocessor however. I wonder how quickly can it be added to cpp?



I've always used xxd -i for embedding, doesn't have the mentioned problems and works everywhere, as it simply outputs a header file with byte array.


Well, congratulations, you now have a build dependency on Vim. (xxd is not a standard tool, it ships with Vim.)

It's also only suitable for tiny files: compile time and RAM requirements will blow up once you go beyond a couple of megabytes.


> It's also only suitable for tiny files: compile time and RAM requirements will blow up once you go beyond a couple of megabytes.

Do you know what makes it so? Is there a technical argument why the compiler could do better, except maybe for xxd not being specifically optimized for this use case?


The compiler has an as-if rule on its side.

It's allowed to do whatever it wants so long as the results are as-if it did what the standard says. So even though the standard says this is making a big list of integers like your xxd command, the compiler won't do that, because (as a C compiler) it knows perfectly well it would just parse those integers into bytes again, just like the ones it got out of the binary file. It knows the integers would all be valid (it made them) and fit in a byte (duh) and so it can skip the entire back-and-forth.


If I understand correctly what you're saying it's not a problem with xxd itself, but with the format of the data, which in xxd's case is a pair made out of an array of bytes and its length, while the compiler is free to include the binary verbatim and just provide access to it as it was an array/length pair. Am I right about it?


Yeah, these are reasonable arguments against it


The one in the original article I that it performs badly for large files.



The article spends a fair bit of time discussing the build speed and memory use problems with that approach. Like, the benchmark results [0] linked to from this post literally have xxd as one of the rows. It's not a viable option for embedding megabytes of data.

[0] https://thephd.dev/embed-the-details#results


And even if the data is small enough, not every C programmer uses Unix or knows their way around it.


But you have to have build system stuff for that and it's obviously non portable.


True, i dont personally ever have problem with this because i always compile from unix system anyways. (Even for windows)


It is something that I had wanted in C too, for a while, so I am glad that they added this #embed command.


How to read unsigned data? Is there a stadardized parameter, or does this require a vendor extension?


You just make your array type uint8_t or whatever you need as long as it supports integer literals. See section 4 in https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3017.htm#i...


Scary, it's as if the preprocessor has become type-aware. I guess I better don't imagine the result of the preprocessing to look similar to and following the same rules as something I would have written by hand. This might make manual inspection of the preprocessed file a bit painful.


Its not really a pre-processor stage. Probably better to think of it more like a pointer cast to some binary blob. Though it'd be interesting to see what `gcc -E` would produce.


What situations is this useful for?


One particular scenario that people have highlighted is developing for an embedded system that doesn't have any storage except flash memory, and no filesystem. In this kind of system, embedding static resources in the executable is the only reasonable option you have.


> “Touch grass”, some people liked to tell me. “Go outside”, they said (like I would in the middle of Yet Another Gotdang COVID-19 Spike). My dude, I’ve literally gotten company snail mail, and it wasn’t a legal notice or some shenanigans like that! Holy cow, real paper in a real envelope, shipped through German Speed Mail!! This letter alone probably increases my Boomer Cred™ by at least 50; who needs Outside anymore after something like this?

Touch grass indeed. Sure, #embed is a nice feature, but this self-indulgent writing style I can’t stand.


Maidenless commenter.


This is wrong.

It belongs in the linker and you can pull the symbol it creates in with extern. I’ve been doing this for about 25 years.


...and now your solution is non-portable and as a cross-platform developer you need to implement N different build scripts. This is far more elegant.


No C toolchain is portable!

If that's a problem, use Go or another higher level language.


I guess the programs that I write which work on Mac / Windows / Linux / Wasm / i686 / amd64 / armv7 / aarch64 are just a bad dream I had


Only because you know the platform specific nuances...


Isn't one of the supposed advantages of C that it works on many platforms?


The last 40 odd years have proven that it does work but what you write isn’t portable.


That’s not the case, if you take a bit of care. Look at STB for example (https://github.com/nothings/stb) -- I’ve successfully used STB functions on a bunch of different platforms. In both C and C++, even!


Ah yes the “if you take a bit of care” argument. Welcome to the 1% of C programmers club :)


C89 is where C should've stayed at. If you need to convert a file to a buffer and stick that somewhere in your translation unit, use a build system. Don't fuck with C.


Nothing stops you from sticking to C89 if that's you want. Many projects do, and the -std=c89 option will not disappear anytime soon.


Did you read the snail mail letter from someone who does just that?


> "Did you read the snail mail letter from someone who does just that?"

I did. The author struggled embedding files into their executables with makefiles. We don't know anything else beyond that. So what?

People also struggle with memory management in C, an arguably much more difficult and widespread problem. Should we introduce a garbage collector into the C spec? How about we just pull in libsodium into the C standard library because people struggle with getting cryptography right?

OP mentions #embed was a multi-year long uphill battle, with a lot of convincing needed at every turn. That in itself is enough proof that people aren't in clear agreement over there being a single "right" solution. Hence, leave this task to bespoke build systems and be done with it. Let different build systems offer different solutions. Allow for different syntaxes, etc. Leave the core language lean.


It takes literally 5 minutes to write a python script that does this.

It took a long time to get this adopted because people are most likely busy with things that cannot be already solved trivially.


I think it's nice that this will soon be possible to do without adding Python as a dependency in your build system


> trivially

The article covers quite a few reasons why the way things are done without #embed are not quite as trivial as they seem.


I've been doing it for 20 years without any single issue, on fairly large files.

This proposal doesn't even allow to compress or encrypt the data.


And yet the issues in the article are real and non-trivial. Perhaps you just never hit them, or you have a high tolerance for non-trivial solutions.


Haven't the people making the standards other things to do, like, integrating useful features instead of duplicating incbin.h [0] years after that feature worked?

https://github.com/graphitemaster/incbin/blob/main/incbin.h


That doesn't work on MSVC without an external source-generating tool.


> The directive is well-specified, currently, in all cases to generate a comma-delimited list of integers.

While a noble act, this is nearly as inefficient as using a code generator tool to convert binary data into intermediate C source. Other routes to embed binary data don't force the compiler to churn through text bloat.

It would be much better if a new keyword were introduced that could let the backend fill in the data at link time.


You should read or re-read the article and references. There are multiple benchmarks showing this not to be the case. Actually half the article is a (well deserved) rant about how wrong compiler devs were in thinking that parsing intermediate C sources could ever match the new directive. Compiler internal representation of an array of integers also doesn't require a big pile of integer ast's.

According to the benchmarking data this extension is even 2x faster than using the linker `objcopy` to insert a binary at link time as you suggest.


C++ keeps kicking ass!

Feel sorry for crab people.


The article definitely isn’t a glowing praise of C/C++. In fact, including this simple, useful feature that rust has had for a decade now has taken an immense amount of effort and received so much pushback from various parties, in part due to the strangled mess of various compiler limitations and in part because of design-by-committee stupidity.

C/C++ seems to be kicking it’s own ass.


Not to mention that it didn't even get into C++


The article doesn't even mention C++/Java.


Second paragraph of the title article:

> Surprisingly, despite this journey starting with C++ and WG21, the C Committee is the one that managed to get there first

Later it mentions presenting their first formal attempt at this to Belfast 2019, that's a C++ meeting, it's too late for this to go into C++ 20 at that point, but it easily could have been in C++ 23 (it is not).


That's not really the right conclusion to draw from this article


"crab people" means Rust people?


Officially, Rust's R-cog logo is the symbol of Rust. It is a registered trademark of the Foundation.

But it's a bit boring. Unofficially, Rust has a mascot, in the form of a crab named "Ferris". The crab mascot appears in lots of places, and the Unicode crab emoji U+1F980 is often used by Rust programmers to indicate Rust in text. Unlike the trademarked logo, you can have a bit of fun with such an unofficial symbol, for example Jon Gjengset's "Rust for Rustaceans" book cover has a stylised crab wearing glasses with a laptop apparently addressing a large number of other crabs.


Yeah, I know that, that's why I asked if that's what parent meant, because I can't see why Rust was mentioned there


> vendor extensions ... were now a legal part of the syntax. If your implementation does not support a thing, it can issue a diagnostic for the parameters it does not understand. This was great stuff.

I can’t be the only one who thinks magic comment is already an ugly escape hatch, adding a mini DSL to it that can mean anything to anyone just makes it ten times worse. It’s neither beautiful nor great.

> do extensions on #embed to support different file modes, potentially reading from the network (with a timeout), and other shenanigans.

(Emphasis mine.) My god.


To be completely honest, I find the fact that this was raised by the committee to be really obtuse and unnecessary. The same "complaint" could be raised about #include as well.

If you want to include data from a continuous stream from a device node, then you could just as easily have the data piped into a temporary file of defined size and then #embed that. No need to have the compiler cater for a problem of your own making.

As for the custom data types. It's a byte array. Why not leave any structure you wish to impose on the byte array up to the user. They can cast it to whatever they like. Not sure why that's anything to do with the #embed functionality.

Both these things seem to be massive overthinking on the part of the committee members. I'm glad I'm not participating, and I really do thank the author for their efforts there. We've needed this for decades, and I'm glad it's got in even if those ridiculous extensions were the compromise needed to get it there.


I guess you don't know C.

"#" is not a symbol for a comment line but the one for a pre-processor directive. Like #include stdlib.h

In c/c++ you use // and /* */ for comments.


I’ve known C for close to two decades, thank you. I’m using the not at all well defined term “magic comment” to loosely refer to everything that’s not strictly speaking code but has special meaning, which include pre-processor directives.

cpp is definitely a well-hated part of C.


> I’ve known C for close to two decades, thank you. I’m using the not at all well defined term “magic comment”

Please forgive those of us who've been using C since the 80's, or earlier, from assuming you don't know C when you invent your own terminology for preprocessor directives.


This is not a preprocessor directive though, from reading the post I don’t think cpp is expanding the #embed into an array initializer, otherwise there’s no performance benefit at all.


The deliberate wording expands #embed to C's initializer-list

The major performance benefit is from the as-if rule. The compiler is entitled to do whatever it wants so long as the result is as-if it worked the way the standard describes.

So a decent compiler is going to notice that you're #embed-ing this in a byte array, and conclude it should just shovel the whole file into the array here. If it actually made the bytes into integers, and then parsed them, they would of course still fit in exactly one byte, because they're bytes, so it needn't worry about actually doing that which is expensive.

Does it work if you try to #embed a small file as parameters to a printf() ? Yeah, probably, go try it on Godbolt (this is an option there) but for the small file where that's viable we don't care about the performance benefit. It's just a nice trick.


The problem with the preprocessor often is that it is a language inside another language and the preprocessor is designed to be almost completely agnostic of the language it is embedded in. So there might be subtle ways to use the preprocessor so that implementing as-if becomes very unintuitive. I don't have a good intuition about this case, if this is 100% designed in a way that it can never provoke such subtle side-effects. Basically what might end up happening is that the preprocessor has to learn some part of the C language to decide if such an as-if transformation is possible and then branch to either do it or don't.


Thanks for pointing out the difference, I stand corrected.


It's a preprocessor directive. The compiler injects the array based on your instruction.

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3017.htm#a...


> (Emphasis mine.) My god.

yes, C finally catching up with what languages such as F# have been able to do for years with great success https://docs.microsoft.com/en-us/dotnet/fsharp/tutorials/typ... ; wild isn't it to step into the 2010-era of programming ?


I suppose you’re of the opinion that every feature of every language should be added to C, or maybe even assembly.


this isn't adding a new feature. This is replacing a feature that everyone already implemented independently on every other project - some with xxd, some with special embedders such as rcc or windres or whatever, some through CMake directly (like this: https://gist.github.com/sivachandran/3a0de157dccef822a230 for instance) - in a standard and more performant way. Instead of being paid per-project this cost will now be paid per-compiler implementation which is unambiguously good as there are <<< compilers than C projects.


You replied to my quote about pulling network resources and “other shenanigans”, which certainly isn’t what “everyone already implemented independently”. Plus that’s a potential vendor extension, i.e., some may implement it independently, some may not, implementations will likely differ.


> You replied to my quote about pulling network resources and “other shenanigans”, which certainly isn’t what “everyone already implemented independently”.

?? pulling network resources works, today, with what people are using. there's zero difference between "cat foo.txt | xxd -i" and "curl https://my.api/foo.json | xxd -i"


Hell, I #include files mounted over FUSE regularly.



Personally I feel the C committee should've disbanded after the first standard (the C++ one, after the 2003 technical corrigendum). I didn't mind C99 much, but it looks like C(++)reeping featuritis is a nasty habit.

These gratuitous standards prompt newbies to use the new features (it's "modern") and puzzled veterans to keep up and reinternalize understanding of new variants of languages they've been using for decades. There's no real improvement, just churn. Possibly it's one of the instruments of ageism. More incompatibility with existing software and existing programmers.


Having to learn new things throughout your career isn't ageism.


The question is: which other things, and why?

One problem of this new millennium is that the field has developed a tendency to do old things with new languages instead of doing new things with old languages.

Reinventing another polygonal wheel approximation while constantly tweaking the theory used serves to segregate the experienced (who may have trouble accepting such tweaks or their necessity - they already know about the existence of wheels anyway) from the newbies (who have no previous intuitions and no taste for legitimate and spurious novelty).

Newbies are cheap, and new ideas are hard. Let's do some mental rent seeking.


This is a cool feature, but the author doesn't do himself any favors with his style of writing that greatly overestimates the importance of his own feature in the great scheme of things. Remarks like "an extension that should’ve existed 40-50 years ago" make me think, if we should've really bothered all compiler vendors with implementing this 40-50 years ago. After all, you can already a) directly put your binary data in the source file like shown after the preprocessor step and b) read a file at runtime. I'm not saying this isn't useful, but it's a rather niché performance improvement than a core language feature.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: