The problems were not specific to Java, they also exist in most modern programming environments, including FORTRAN and ANSI C. Java just happened to be the hottest language in the news in the late 90s, so the author decided to specifically target Java - "hurts everyone everywhere" was a wordplay of Java's advertisement.
The main author of this report, W. Kahan, was also the original author of IEEE-754. He was (and is) strongly unsatisfied on the state of floating-point in practical systems, and it was not his first time to complain about these problems. One can find his most recent critique from a few years ago.
My understanding is that, Kahan's IEEE-754 is meant to:
1. The use of double and extended precision should be encouraged to safeguard non-experts from floating-point errors. Everything should be at least double precision by default, and extended precision serves as an additional safeguard. In fact, when IEEE-754 was being drafted, Kahan believed 128-bit floating point should also be supported as computers become more powerful in the future. The computational cost was too high at that time, so he settled on 80-bit as seen on the 8087. He criticized Java for not supporting it.
2. Floating-point exceptions should be used and turned on everywhere to safeguard programmers from making mistakes, and possibly to allow programs to handle them as special cases in the logic at runtime. He criticized Java for not supporting it.
3. Unsafe optimizations should not be done, such as automatically using FMA, or using algebraic identity in compiler optimization. He criticized Java for allowing it in some cases.
Unfortunately, as far as I can see, these ideals of IEEE-754 have all but largely disappeared in real-world applications since then, for various practical reasons.
The industry did not move to 128-bit floating point because its performance overhead is too large, the original assessment of IEEE-754 in the 1980s was too optimistic. Similarly, the industry did not accept 80-bit extended precision as a standard but saw it as an oddity of the Intel 8087, even Intel has abandoned it - x86_64's SSE or AVX has removed all support of that, making it an exclusive feature of obsolete i386/i686 machines. My impression is that anything above double precision is no longer used in the industry (IBM POWER does have native quad-precision support, to their credit).
A secondary overhead of higher-precision floats is the memory wall, arguably a more serious problem today. Memory bandwidth has become the most serious overhead for many numerical programs. Modern computers have a machine balance of 100:1, it means you need to do as many as 100 floating-point operations after loading a single value from RAM to reach the machine's peak performance. But this is not compatible with many algorithms with an inherently low arithmetic intensity, including important physics simulations. The use of of FP80 or FP128 will make them unacceptably slower. As a result, today's trend is moving from FP64 to FP32, and even to FP16 or a custom 16-bit format if possible, not vice versa.
The use of floating-point exceptions similarly became unpopular because of performance problems. I'm not an expert on this, but it was my impression that signaling a floating-point exception was so expensive on both hardware and operation-system level that it was never seriously used in most practical programs. So in contrary to IEEE-754, these exceptions never became an integral part of the programming environment. In addition to an expensive Unix signal, potential exceptions also inhibit the efficient pipelining of modern out-of-order, superscalar CPUs.
Finally, the use of unsafe optimizations is prevalent in many applications when rigor is sacrificed for speed.
So overall, the original spirit of IEEE-754 was long gone - for better or worse, unfortunately.
> Everything should be at least double precision by default, and extended precision serves as an additional safeguard
Honestly I don't understand how this would constitute a "solution" or even a "safeguard", really. Using any kind of FP arithmetics without being acutely aware of its quirks is going to cause headache no matter the precision. Conversely, when you do know how to deal with FP, then FP32 can suit many scenarios just fine.
I recently tried to implement a 16bit 64k point fft in C using FP32. It thought it would work easily. Turns out it was very difficult, what worked fine in double gave huge inaccuracies in float. It was due to the large small issue, if you subtract a small FP32 from a large tje result cannot be represented so it's as if the subtraction did not happen. These kinds of errors accumulated and affected the result. I found a workaround for my particular algorithm, but it was eye opening to me that FP32 was not enough to just work in this case.
Catastrophic loss of precision is, as the name implies, catastrophic in terms of the calculation context. For scientific/engineering codes, and things requiring a preservation of resolution for proper functioning, FP32 is rarely sufficient. FP64 is usually better. For ML/AI apps, resolution isn't nearly as important.
I don't think the exception point is accurate; I think it didn't take off because C didn't make exception recovery idiomatic (signal handlers are not idiomatic) and C++ followed C's approach, so the major C++ runtime libraries disabled exceptions.
Floating point exceptions worked well enough in Delphi. The most common exception in practice is divide by zero and it's usually a programming error.
> In fact, when IEEE-754 was being drafted, Kahan believed 128-bit floating point should also be supported as computers
Crazy. Gosper and I spent several dinners trying to come up with a use for the range or precision of 128. We figured there had to be some classified application up at LLL because even at quantum & cosmological scales it didn't make sense.
It also seems like a huge overkill for avoiding accidental over/underflow problems by naive programmers.
IMHO the biggest problem that developers run into is base 2's inability to represent decimal digits exactly. If floating-point numbers were stored (or behaved as if stored) in base 10 then the other pitfalls would seem intuitive and predictable.
Here's one well known language: C23! (_Decimal32, _Decimal64, _Decimal128 [1].) Already supported by GCC [2]. Another language with support is .NET C# (System.Decimal) [3].
As for processors, according to Wikipedia [4], "IBM POWER6 and newer POWER processors include DFP in hardware, as does the IBM System z9 (and later zSeries machines). ... Fujitsu also has 64-bit Sparc processors with DFP in hardware."
It was probably hard to predict at the time, but floats larger than 64 bits never really took off. Turns out that hardware vendors didn't want to spend that many transistors, the x87 stack machine was a disaster, and software settled on the lowest common denominator, which was double. It's ironic that Intel was at the forefront of the floating point specification of the time, as X87 royally screwed up floating point for generations. It is was like pulling teeth to get the thing to do 64-bit math. Thank god x86 gradually moved to SSE, though it only took 25+ years to correct that fiasco.
> Unsafe optimizations should not be done, such as automatically using FMA,
I agree with most of what you said, but automatic use of FMA is not an unsafe optimization. If FMA breaks your code, your code was always broken in the first place.
No. Changing the rounding behavior of floating point code is not generally OK. Numerical algorithms are often extremely carefully written and it is totally unpredictable what adding more precision can do. For example, adding more precision can actually make some algorithms diverge, or take many more iterations to converge, because when designing an algorithm with higher-level knowledge, the rounding behavior can be exploited on purpose.
Automatic FMA is absolutely an unsafe optimization. Don't do it unless software requests it.
FMA can cause some guarantees about vector arithmetic to break (you end up computing different precisions for different vector components), so it's not 100% a safe optimization.
It is unsafe if the code was written by a numerical expert who understood the ISO/IEC 60559:2020 standard. Not many of those, so in all probability your garbage code continues to be garbage (but faster) with FMA.
Only partly; James Coonen and Harold Stone are also responsible. The resulting K-C-S draft eventually became IEEE 754 [1] and contained the Gradual Underflow from the very beginning. That said, the eventual decision was collaborative---DEC, a major opponent, even failed to convince its own commissioner.
This is from March 1998, and predates JVM 1.2 which introduced strictfp (which is important because JVM was otherwise numerically IEEE 754 compliant before strictfp). I believe the core point of this talk has not been "fixed" [1] because it completely missed the problem with IEEE 754 global states. Like, the talk says:
A flag is a type of global variable raised as a side-effect of exceptional floating-point operations. Also
it can be sensed, saved, restored and lowered by a program. When raised it may, in some systems,
serve an extra-linguistic diagnostic function by pointing to the first or last operation that raised it.
Any modern programmer knows that side effects are, while inevitable, hard to tame and some discipline is needed. Pure functional programming, mutable XOR shared, software transactional memory, you name it. This part of talk completely handwaves a difficulty of side effects and forces every language to be handcuffed with those global states and side effects. No good.
Thanks for mentioning Joe Darcy. He continues to work on the JDK at Oracle and occasionally works on numerics issues. Notable for this discussion, he recently updated the Java specifications for JDK 17 to remove non-strict FP semantics (which also entailed removal of the `strictfp` modifier, since the default now is always strict). See JEP 306. [1]
So half the complaint here is that Java doesn't let you use the 80-bit x87 floating-point type. Well, actually it does--if you don't add `strictfp` to your method declarations, Java is allowed to compute intermediate results of float and double computations in higher precision, and this was retrofitted because it's actually somewhat difficult to get the x87 floating-point unit to actually do single- or double-precision (as opposed to extended-precision) arithmetic. But you don't actually want to use the 80-bit type; it's slower in practice, and it has some extra wonkiness over other IEEE 754 types that it's really not worth it. And on 64-bit x86, everyone just uses SSE for floating-point units and thus you don't need to worry about x87 unless you explicitly opt into its insanity.
As a numerical analyst, Kahan is pretty obsessed with use-as-much-precision-as-you-can. But there's a useful rule of thumb: you need about twice the amount of working precision as your final result. Since double precision has a 53-bit mantissa (~16 decimal digits), that means if you need only 8 or fewer decimal digits, you're completely fine with double precision. And furthermore, the experiences I've had with many programmers suggest that getting bit-equivalent results from different machines is a higher priority than squeezing the best possible numerics out of your hardware. HPC does tend to care about the latter a lot more, but that's also an area where the solution is almost always to just use your system's advanced math libraries (e.g., MKL for dense linear algebra).
Ironically, using your system’s math libraries will probably make replicating floating point results harder. Especially on macOS. Accelerate is a curse.
I disagree with your gloss of Kahan’s philosophy. His approach is more along the lines of “do not waste precision”. But this philosophy is not the complete truth; as close as I can state it briefly, my modification would be “do not waste precision that may be needed later”.
Attempting to get bit-exact reproducible results across different hardware is a fool's errand (if you care in the least about performance).
The nature of the beast is that as soon as you change the order of arithmetic you're going to get a different result. Optimized code is going to give you different results on different hardware due to the fact that you need to optimize things differently. Threading, memory alignment and/or different versions of the library software are likely to lead to different results even on the same machine unless the authors of the library go out of the way to promise repeatability.
(If you want to get the same answer, run on a single thread, page align everything you feed in, and never upgrade your system; alternatively write a scalar loop in C, compile with -O0 and pray the compiler doesn't change the order of things on its next upgrade).
> Attempting to get bit-exact reproducible results across different hardware is a fool's errand (if you care in the least about performance).
We did it for Wasm, which follows IEEE-754 semantics exactly for 32-bit and 64-bit floats. (The only nondeterminism is the exact bit pattern you get for NaNs in some circumstances.) Rounding is 100% well-specified. And CPUs have done that for decades. Even vector ISAs have learned that non-IEEE results are not what software wants; all vector ISAs are converging on IEEE-754.
> Optimized code is going to give you different results on different hardware due to the fact that you need to optimize things differently.
This is due to C/C++ (and to some extent Fortran) semantics. It is not hardware.
What do threads have to do with floating point precision?
Oh, it's entirely possible to get bit-reproducible results. Just not in a performance portable fashion.
Different microarchitectures (e.g. how many vector instructions of what size need to be in flight for full occupancy), different numbers of cores (see threading discussion below) and often even differently aligned memory (does it need repacked or not for best performance?) will all require different order of operations to obtain maximum throughput, which means different (but equally valid) results.
For threading in particular if you want to get the same bit-exact answer, you end up constraining yourself to a particular ordering on reduction operations. This in turn either outright prevents techniques such as work-stealing or fires a very prescriptive reduction tree that itself constrains parallelism.
This is entirely driven by hardware and its impacts on performance of algorithms, and applies regardless of the language you're writing in if you want to obtain the best possible performance from a given chip.
I’m not the parent but I imagine they’re referring to e.g., some FFTs use different partitioning strategies in different threading environments, which breaks bit-perfect replication.
There’s also the weirdness that in C++ the floating point environment is thread-local, which can cause all sorts of chaos.
The only reason you would want bit-reproducibility is because you haven't done the numerical analysis and have no clue how many digits of your "answer" to trust.
As far as I know, two sectors claim they need it: finance and climate.
"Do you want a better answer?"
"No, I want the same wrong answer that I got last Tuesday."
> The only reason you would want bit-reproducibility is because you haven't done the numerical analysis and have no clue how many digits of your "answer" to trust.
I can confidently say that this is not the only good reason. Other reasons include:
- You want to compare different runs by hashing outputs (e.g. to find the first computation step where they diverged). Very useful for debugging, and also useful to determine whether you accurately reproduced a result (e.g. a customer problem).
- If your program has a single floating point comparison, there is no such thing as "enough significant digits" - with reasonable assumptions about the distribution of "unreproducability", your logic is now divergent (and your output will jump between different values) with a certain probability. At that point we're no longer talking numerical analysis, it's straight up "divergent results".
There's also "cover your ass". At least I've heard tales of major aerospace companies keeping warehouses of old sun hardware in case they need to demonstrate the simulations they ran back in the 90s were not fabricated...
Did you mean -fbroken-and-not-necessarily-fast-math? [1]
[1] But really, if -ffast-math does turn -funsafe-math-optimizations on, it should have been named similarly. There is a possibility of much safer -ffast-math with almost zero breakage (by assuming a subset of IEEE 754, like the fixed rounding mode). The current -ffast-math is so reckless [2].
I find that -ffast-math is not so bad, so long as I develop and test with it from the beginning. It's much like any of the other more aggressive optimizations in that sense.
Plus, comparing against strict math as I go tends to highlight where I might have been about to do something dodgy anyway.
I was unaware that Java finally got rid of strictfp (though happy that it finally did). It was added in Java 1.2, perhaps in response to this paper, though I don't know the entire timeline accurately.
Where did I imply that? The fault was with the JVM.
“The impetus for changing the default floating-point semantics of the platform in the late 1990's stemmed from a bad interaction between the original Java language and JVM semantics and some unfortunate peculiarities of the x87 floating-point co-processor instruction set of the popular x86 architecture.”
> But there's a useful rule of thumb: you need about twice the amount of working precision as your final result. Since double precision has a 53-bit mantissa (~16 decimal digits), that means if you need only 8 or fewer decimal digits, you're completely fine with double precision.
This doesn't seem right, or at least it's not very general. The more operations you do, the more rounding errors you have, and each operation has the potential to magnify earlier errors. In solving ill-conditioned problems (which are not uncommon) the errors can easily be magnified so much that they're bigger than your signal, even with relatively small and simple situations.
> precision has a 53-bit mantissa (~16 decimal digits), that means if you need only 8 or fewer decimal digits, you're completely fine with double precision
phave you ever played Kerbal Soace Program? have you met the Kraken?
the game simulates a spacecraft you build piece by piece. So you are flying around in space and a thruster fires and maybe some part of the ship is poorly attached and it wobbling. Everything is fine, but then you clip upper atmosphere of mars, and everything goes to shit - tleven though it shoupd pose no threat, the spacecraft spontaneously shakes itself to pieces
That game is plagued by massove problems with floating point errors , they kill you vrew, they ruin your missions, your speed of rotation becomea a NAN
OpenSpace uses 32 bit floating point numbers and can scale from planck length to the whole diameter of the observable universe with practically no loss of precision. The problem with this approach is that KSP would need to rewrite the entire rendering engine and physics engine from scratch to support infinite precision.
Simulating human-scale physics in a solar-system-scale world needs far more than 8 decimal digits. Neptune's orbit is 12 orders of magnitude larger than your spacecraft, no wonder your floats are acting up. You're going to need at least 50-bit precision to make your simulations accurate to the millimeter.
If anything, the fact that KSP is pretty much the only game with serious float issues shows that doubles are just fine for most applications.
Wow, that document is difficult to read. I was expecting a summary of the main points a the top, followed by a set of sections - each with cogent introductions and a progressive series of arguments to prove the point, followed by a concise conclusion.
Instead, this is a rambling, mostly unstructured document where it's nearly impossible to follow the many scattered threads of thought, or even catch which side of some of the arguments he's actually on.
Why doesn't Java (or other languages) support a fixed length decimal by default? Granted Java has BigDecimal, but why isn't there a basic type for this?
For business and accounting systems, this seems like an obvious choice.
> To win, Java has to surpass Microsoft's J++ in in attractiveness to software developers. This means better design better thought through, less prone to error, easier to debug, ... and many other things.
I'm always baffled by how many software engineers with CS degrees don't understand IEEE 754 floats. I had a coworker who was bugging the shit out of me claiming that Macs had a "bug" in several meetings because he didn't understand floats. Recently I heard someone else make the same claim about Ruby. WTF are they teaching people in CS programs?
They‘re teaching them that floats and doubles are numbers with decimal places, and that‘s where most stop. There‘s a reason articles like „Myths programmers believe about floating point“ are written and usually get highly upvoted.
Today BigDecimal (arbitrary precision) should be the easy to use default and float the annoying to invoke exception. But it’s understandable why the designers wouldn’t have done that in the 90s.
Most of the time it doesn’t matter. Double (or float) by default is premature optimization. When you start dealing with millions or billions of values, sure reach for the annoying, counterintuitive, footgun laden numeric types. But that’s not most code.
Well (tens) thousands doubles vs BigDecimals (think of marketing data) and any operations over them is a massive difference.
The premature optimization is beyond uncalled for. Learning to use floating point type is something that most developers should do. No need for weasel words, either (annoying, counterintuitive).
I can't think on any widely-used language that does it right nowadays (except Rust, if you change some flags, and not by default).
For sure, 20 years ago there were some dying languages that behaved differently, and yes, the Java position there did hurt everybody, but it's the same one everybody else took.
IEEE 754 defines decimal floats in addition to binary, but you’ve never used them because no x86 processor has provided hardware implementations of them.
For general purpose computing (approximating real quantities) floats with base-2 exponents are just right on base-2 computers. For dealing with money (and other quantities that are fundamentally integer), they might be the wrong choice, but that's well understood and sort of obvious (use integers to handle integers).
There are many non-professional programmers who get baffled by various "computerisms" such as
0.1 + 0.2 != 0.3
Excel tries to hide this but ends up doing even stranger things if you push it hard enough. I think a lot of people who could use computers to put their skills on wheels just give up because of this "lack of empathy" that manifests here and in other places. If you do
0.1 + 0.2 - 0.3
on a pocket calculator you get the right answer and you should get the same right answer in a Jupyter notebook. The only person who should be exposed to the base 2 arithmetic of the computer is a professional programmer who knows assembly language.
There are numerous social consequences of this that are harmful such as the perception that computer programmers are "grinds" and "nerds" and the idea that "idea people" are more worthy than the people that execute, etc.
If somebody does not understand the implications of a computation such as a floating-point 0.1 + 0.2, they should not be using floating-point. It's not the job of the language to protect them from that. Decimal floating-point will be prone to very similar malfeasances, if that is what you are proposing; somebody unaware of the issues of binary floating point will not be helped by decimal floating point.
I do think it would be a good idea for programming languages to expose more exact number representations (for example, rational numbers, which are supported by some programming languages). But numbers are complicated, and you will never be able to imbue a computer with a number type that always behaves exactly the way that someone who knows nothing of computers would naively expect. Even mathematica does not quite manage it, and mathematica takes some somewhat extreme measures which would not be considered acceptable in many other general-purpose languages.
> If somebody does not understand the implications of a computation such as a floating-point 0.1 + 0.2, they should not be using floating-point. It's not the job of the language to protect them from that.
Then what is the job of the language? Programming languages should nudge their users in the right direction; they should be safe and maintainable by default. Sometimes sharp-edged parts are necessary (e.g. for performance) and they should be made available to those who need them, but they shouldn't be front and center. E.g. floating point literals should be a high/arbitrary precision decimal type by default, similar to what Python does with integer literals.
> Decimal floating-point will be prone to very similar malfeasances, if that is what you are proposing; somebody unaware of the issues of binary floating point will not be helped by decimal floating point.
Disagree. Decimal calculations getting rounded off is a problem that is a lot easier to see and understand.
> Decimal calculations getting rounded off is a problem that is a lot easier to see and understand.
If you carefully, tediously work by hand some calculations such as addition, multiplication, or division (but not, say, square roots, exponentials, or sines, since there are no generally known methods for computing those by hand), you may be able to replicate some specific results of the computer without needing to learn anything new. I don't really see what the point of doing that would be. Oh, and you must use banker's rounding, which is also not generally done. Otherwise, the results will appear to manifest as just some inaccuracy incurred after every operation, just the same as with binary floating point. The full complexity of error analysis remains.
(Note also in particular that addition, multiplication, and division are closed over the rationals, which can be readily represented exactly by computer—I did mention I think rationals are a good thing.)
> If you carefully, tediously work by hand some calculations such as addition, multiplication, or division (but not, say, square roots, exponentials, or sines, since there are no generally known methods for computing those by hand), you may be able to replicate some specific results of the computer without needing to learn anything new.
Adding 0.1 and 0.2 to get 0.3 is hardly advanced mathematics.
> Otherwise, the results will appear to manifest as just some inaccuracy incurred after every operation, just the same as with binary floating point.
Perhaps, but it will be a comprehensible inaccuracy. Running out of decimal places is a well-known and understood phenomenon, and something already experienced with regular calculators.
True but it does have the BigDecimal class which is what you should use for money, unless you are using derivatives pricing when thr person writing the code must understand floating point.
You are entirely right that user-facing programs should strive to hide IEEE 754 artifacts (see, for example, the Android calculator app that uses exact real arithmetic).
But if you are programming, there are tons of quote-unquote computerisms that we have for a good reason but are not really intuitive for newcomers. The whole concept of variables, pointers and general indirections, zero-based indexing, (pseudo)randomness, Unicode, time complexity, concurrency and parallelism and so on. Many (but not all, I admit) professional programmers take them as granted but they are just as arcane as the concept of base-2 floating points.
I just think of floats as fuzzy analog values. The only precise operation you can do is copying. Everything else, you assume it's noisy.
It's not real analog noise, but reasoning about rounding is hard if you're not good at math, it's something you can explain easily, and it prevents you from doing clever stuff that will confuse the next person.
It's the opposite of being an idea person. I just accept them as a practical approximation, not a tool for doing real math, and move on. If you come from electronics it makes perfect sense. Voltages are pretty much always noisy even if it's picovolts.
If I need precision I use ints or dedicated libraries.
That isn't a computerism. It is a feature of floating point arithmetic. Computers are perfectly able to calculate perfectly.
>should get the same right answer in a Jupyter notebook. The only person who should be exposed to the base 2
"Why does my notebook take hours to do a simple task"
>There are numerous social consequences of this that are harmful such as the perception that computer programmers are "grinds" and "nerds" and the idea that "idea people" are more worthy than the people that execute, etc.
The benfits of floating point arithmetic easily outweigh people having to learn it.
"Why does my notebook take hours to do a simple task"
“Because the chip it is running on doesn’t have native base-10 math”?
See also “why my notebook [is fast but] yields incorrect results”.
The benfits of floating point arithmetic easily outweigh people having to learn it.
GP talks about saving non-low-level programmers from base 2 FP, not about removing it. CPUs could use an additional block (or mode) of base 10 exponent FP.
This and other geeky issues make programmers programmers instead of making everyone a programmer. The consequence of this is much heavier than any benefits of base 2 FP.
>“Because the chip it is running on doesn’t have native base-10 math”?
Base 10 fixes exactly zero of the problems with floating point.
Decimal floating point exists as a standard already. It is even part of the upcomming C standard.
But again decimal arithmetic is just as weird as floating point arithmetic. The change of base s irrelevant, except for a few niche applications.
>The consequence of this is much heavier than any benefits of base 2 FP.
Totally false. Decimals don't fix floats. They are just as weird. Changing the base is irrelevant to the inherent properties of floats and using base 10 instead of base 2 just means that with a very high chance you get something even worse.
If you do not understand the basics of floating point arithmetic you should not be programming software. Tough world out there for people who refuse to learn, I know.
Your views are too extreme and elitist, exactly as mentioned before. I don’t think that this position may be considered an argument, as for this thread’s subject it is a lost cause.
Floating point numbers are the most useful approximation to reals we have on a computer.
That isn't an "elitist" view. I don't get what you are on about, seems ridicolous. People need to learn to use the systems they are working with any highschooler can understand what floating point numbers are and why they have flaws. It is really simple.
The problem with binary floating point is that it is input and output as decimal floating point.
Every floating point number is really a fraction of the form
M
-------
b^E
When you write
x = 0.1
you are really asking for 1/10. If b=2, however, you can only get a denominator of 1/2, 1/4, 1/8 or something like that. 1/10 just doesn't exist in that number system, but there is a number A that you get when you ask for 0.1 that round-trips back to 0.1, and the same is true for 0.2 (B) and 0.3 (C). The trouble is those substitute numbers aren't the real numbers and
A != B + C
This has nothing to do with numeric precision, it's always going to be off a bit even if you are using 1024-bit floats or 1048576-bit floats.
The problem punches above its weight because it is targeted right at two fault lines of the mind: (1) people flinch at inconsistencies, 0.1 + 0.2 = 0.3 is an identity and when basic identities are wrong people feel uncomfortable and don't want to proceed; if you are working with an accountant, for instance, and they see something that is not consistent in a way they've been trained has to be consistent, they will just stop until it is consistent. (2) A certain kind of laziness leads people to not get to the root of a problem like this and instead waste a huge amount of time and energy into non-solutions (rounding!) that are just like pushing a bubble around under a rug.
Note that decimal floating point does not require that you use BCD. That is, the mantissa and the exponent of a floating point number are just integers, and unlike the floats, base 7 and base 192 integers are the same numbers.
An obvious idea is to use base 2 for the mantissa and exponent, just have the exponent be base 10. One difficulty is the cost of sliding two numbers so they have the same exponent before you add them, for instance to add
43
723
---
766
you have to express the 723 and 43 in the same base to add them (multiply/add by ten) and multiplication by powers of ten is a lot hard with base 2 math than with base 10 math. You also can represent decimal floating point numbers with a decimal mantissa and face the tradeoff of BCD numbers being wasteful of bits but the factors of 10 being easy to deal with. There is an efficiency gap between hardware binary floating point and hardware decimal floating point but it's not as bad as you might think.
> people flinch at inconsistencies, 0.1 + 0.2 = 0.3 is an identity and when basic identities are wrong people feel uncomfortable and don't want to proceed; if you are working with an accountant, for instance, and they see something that is not consistent in a way they've been trained has to be consistent, they will just stop until it is consistent.
This is an unsolvable problem. If your floats are base-10 "0.1 + 0.2 = 0.3" might work out, but it isn't going to fix "(1/3)3 = 1". And it gets even worse if you do anything involving π.
It is mathematically impossible for a computer to handle all* reals correctly, so something has to give. Binary floats are a reasonable approximation in practice, and anyone wanting something else is free to use arbitrary-precision libraries.
The guy you replied to was talking about base 10 floats. As you can very easily see his example has to work if the arithmetic has the "best possible" property and the rounding mode is "to nearest".
Pythons is unusable for numerics without numpy, which makes it a tiny bit less unusable.
The problems with floats are independent of the basis.
>1/10 just doesn't exist in that number system
And 1/3 doesn't exist for b=10.
I really have no idea what you are on about. The number system you want does not exist, it is a mathematical theorem, floats are the best approximation of real numbers and choosing b=10 is dumb outside of specific applications.
Decimal floating point fixes nothing it just moves around the issues where they less affect numbers in base 10. It is exactly as broken as base 2. You still violate basic identities in b=10, just different ones.
Again, the number system you want doesn't exist and it can't. You can never approximate the real numbers with constant bits, division and arithmetic consistency.
The trouble is that we are using base 10 literals together with base 2 numbers. If you had base 35 literals and base 35 numbers or whatever it would be OK. All I'm asking for is literals that match the numbers I am using. If my float literals were like
431*2^-9
where 2^-9 is 1/512 there would be no semantic gap here.
The fact that you find it so hard to get what I am talking about isn't reflective of your intelligence, it is that this is something that strikes people where they are absolutely weakest, where the gap between the map and the territory can leave you absolutely lost. That's what is so bad about it.
Of course that is why computer professionals suffer with the "grind" and "nerd" slurs, because we tolerate things like the language that puts the C in Cthulhu.
Personally I think Cantor's phony numbers suck and it is an insult that we call them "real" numbers. It's practically forgotten that Turing discovered the theory of computation not by investigating computation but by confronting the problem that there are two kinds of "real" numbers: the ones that are really real because they have names (e.g. 4, sqrt(2), π) and the vastly larger set of numbers that can never be picked out individually (e.g. any description of how to compute a number has to be finite in length but the phony numbers are uncountable.)
I wish Steve Wolfram would grow some balls and reject the axiom of choice.
I am starting to come around to your argument. Having the internal representation in different base than the written representation does produce problems (for example, printing a base-2 float in its shortest base-10 representation is not a trivial problem, with solutions only appearing in the late 90's, see Dragon4 and Grisu3 [1]).
Like with random number generators - computers are so powerful now that it makes sense to make the default PRNG cryptographically secure, to avoid misuse, and leave it up to the experts to swap in a fast (not cryptographically secure) PRNG in the rare cases where the performance is needed (and unpredictability isn't), for example Monte Carlo simulations.
One could argue that for many use cases (most spreadsheets handling money amounts, for example) computers are powerful enough now that "non-integer numbers" should default to some base-10 floating point, so that internal and user representation coincide. Experts that handle applications with "real" numbers can then explicitly switch to "weird float". It is worth a thought.
Decimal floating point floats are exactly as weird as binary floating points.
The only difference is that different numbers are representible. You still have:
- They do not obey mathematical laws for real numbers
- Equality comparison is meaningless
- Multiple operation lead to unboundedly large errors
>Experts that handle applications with "real" numbers can then explicitly switch to "weird float". It is worth a thought.
The weirdness does not go away in decimal floats. It is exactly as weird.
>One could argue that for many use cases (most spreadsheets handling money amounts, for example) computers are powerful enough now that "non-integer numbers" should default to some base-10 floating point
Base 10 floats do not fix money amounts. (1/3)3 is not* equal to 1 in base 10 floats. You can not correctly handle money with floats, as money is not arbitrarily dividable. Changing to base 10 does not fix that.
The core problem with money is that the divison operation is not well defined. As for example $3.333... is not an amount of money that can exist. Even the mathematically correct operations are wrong, you can not fix that with imperfect approximations.
>The trouble is that we are using base 10 literals together with base 2 numbers.
Now that is a really dumb ask. Surely designing computers around the numbers of finger we have for no reason is insane. Again, b=10 DOES NOT FIX FLOATS. It has the exact same problems.
>Personally I think Cantor's phony numbers suck
Personally I think they are the greatest description of the continuum.
>I wish Steve Wolfram would grow some balls and reject the axiom of choice.
Yes. Really looking forwards to such hits as "you can't split the continuum" and "points do not exist", contradicting 100% of human intuition.
I don't really want base 2 literals and base 2 numbers, I want base 10.
The point is that when the literals don't match your numbers you get particularly strange problems that I think scare people away from computers. We just lose them.
The other problems with floating point math create much less cognitive dissonance than that does.
If you do not understand base 2 and why computers use it I am glad that you are unable to effectively program a computer.
I actually hope that "we" loose people who do not put in that tiny amount of effort to learn something so simple. They have absolutely no buisiness developing software.
If you are unwilling to learn such absolutely basic concepts as what base 2 is, you really should be excluded.
Floats are probably the best approximation of irrational numbers near zero. Rational numbers don’t need to be approximated, unless you’re doing something like tight SIMD loops where the hardware itself limits precision, or letting an end user dictate how much precision he wants to see at the moment.
There are corners we don’t need to cut anymore, because a half century of Moore’s Law has already paid for better tools if only we would claim them.
oh, i thought you meant base 2 as opposed to base 16 (like the 360)
there is a significant performance penalty for bcd arithmetic, so bcd floating point has never been attractive to the customers for floating point: cray buyers, fortran programmers, gamers, analog circuit designers, climatologists
those people don't really care about the beginner issues you mention
but a lot of us are using cpython in jupyter to prototype algorithms we want to run as fast as possible, so we want it to behave like the floating-point hardware does
Floats are such an embarrassment to have to explain to new developers. Then you have to explain that there is no intention to fix it. And then there are all these developers who make the field look like a cult by endlessly chanting how it works the way it should work. Who seem not to know the difference between correct and incorrect, the difference between right and wrong.
Floats cannot be "fixed" because they are not broken. Floats are an extremely limited subset of the general rationals, and the general rationals are an extremely limited subset of the general reals.
It's theoretically impossible to compute with the general reals. Full stop.
It's theoretically possible to compute with the general rationals (Common Lisp for example lets you do so) but it's often impractical because you can easily end up running out of memory and/or time for the computation to complete.
It's practical to compute with floats because you can guarantee hard limits on memory usage and time. You just have to live with the fact that you can't represent the vast majority of real numbers and you lose associative arithmetic.
There are various other, weirder, computational number systems out there but the tradeoffs don't go away; they just move around.
> It's theoretically impossible to compute with the general reals. Full stop.
You could limit yourself to the computable numbers. Although you do have an issue determining equality on general computable numbers...
But yeah, I would agree that floating-point is generally the best compromise out there for a general-purpose computational approximation to real numbers.
There are infinitely many real numbers (uncountably even), but your computer can only handle finitely many things. Something has to give.
While many people complain about it, IEEE 754 is a very thoughtful and thorough standard, and the reason it is still around is that nothing unambiguously better has been proposed yet.
Big-number libraries break down if you want to do anything other than be a 4-function calculator (and honestly, even division can be questionable, since some big-number libraries use fixed base-10 and thus even ⅓ isn't accurately representable). Want to throw in a call to sin or exp? Your only realistic option is some sort of floating-point, and that quickly boils down to "do we use IEEE 754 single or double precision? Or are we in the rare case when we need quad [1]?"
[1] I believe the biggest use of quad precision is evaluating the accuracy of double precision arithmetic.
If you are doing signal processing, ML, simulations, gaming or the like, you can know going in that you will need to use primitive types from a performance standpoint which sometimes means fixed point but often means floating point.
Occasionally yes. But the niche where you need something other than decimal bignums is narrow, and the niche where you need that but don't need to get into the weeds of custom float formats etc. is narrower still. IEEE 754 shouldn't be front-and-center in general-purpose programming languages.
I've never even heard of people using custom float formats outside of an fpga. And bignum seems useful for things that are not cpu-bound. If you need simd, or vectorization, including gpu acceleration, you use the format the hardware and compiler/optimizer support to meet your throughput needs. And on x86_64 aarch64 or spirv, thats a ieee floating point for better or worse.
> I've never even heard of people using custom float formats outside of an fpga.
It's normal and expected for ML these days (not completely custom, but smaller than the IEEE floats). And I think for gaming graphics too.
> If you need simd, or vectorization, including gpu acceleration, you use the format the hardware and compiler/optimizer support to meet your throughput needs. And on x86_64 aarch64 or spirv, thats a ieee floating point for better or worse.
If you need to do it on the CPU, sure. (Even then, you probably don't want to use full IEEE754, you will likely disable subnormals for performance). If you're using the GPU then you'll likely be using something else. The niche that IEEE754 fits is pretty narrow - it works for non-HPC physics simulations because it was designed for that, but that's about it.
Many simulations and gaming are both better off using fixed point. The range available with a 64-bit integer is staggering and precision varying with distance from an arbitrary origin is a bug, not a feature for those cases.
The range available with 54 bit fixed point is also staggering.
And almost any code that would work with 54 bit fixed point works even better with 64 bit floating point, and the floating point version is much easier to code.
So sure, floating point isn't optional when you start off assuming the same number of bits. But if you treat it as a small overhead to make code more robust to large numbers, easier to code, and often faster to run, then it looks a lot better.
Yes, the respective sins of both floating- and fixed-point math are significantly reduced when you have more bits (for a demo of this try using both at 16 bits; you will have to code very carefully and be very aware of the faults of each).
For many situations though, I find the graceful degradation of floats to cause more subtle bugs, which can be a problem.
Range warnings are going to be calculation specific. Adding a non-zero fixed point number "y" N times to a fixed-point number "x" will result in either an overflow or x+N*y.
If integer overflows trap (curse you Intel, C89), then repeated additions (important in many simulations) will either work as expected or crash.
Floating point operations are (for practical purposes) highly privileged because of extremely mature hardware implementations. Hardware implementations that make other forms of calculations more tractable are possible (and have existed in the past) and should be considered when evaluating FPs fitness for purpose, otherwise we will be stuck in the IEEE local maximum forever.
Hilarious. Fixed point is awful it was thrown out the moment people could use floats on their computers. Fixed point numbers cause an enormous programmer overhead and do not fix the problems.
> Fixed point numbers cause an enormous programmer overhead and do not fix the problems.
How so? Is that because of inherent problems with the outcomes of fixed-point arithmetic, or are they just clumsy to use and no-one's written a decent library that makes dealing with fixed-point numbers straightforward?
Because you need to be extremely careful about overflows/underflows. All operations you perform suddenly become difficult and require careful analysis. With fixed point numbers you need to ensure that every intermediate result of your operation gives an in range result.
You can not remove that tedium by a library, since any potential application has different requirements and now you need to start out your program by defining those requirements for your library. All operations you perform need to be analysed based on your initial requirements.
Also, fixed point arithmetic does not fix floats. You have an uncountable number of real numbers and you try to fit that into 32 or so bytes. It seems simple to understand that any such projectiom whatever you try has enormous drawbacks.
That floats do not behave like real numbers is the consequence of their design requirements. Fixed point just means instead of being able to pretend that floating point operations are sometimes inaccurate real number operations, you have to deal with constan, domain specific renormalization and have to be extremely careful aboit choosing scaling.
> you need to be extremely careful about overflows/underflows. [...] you need to ensure that every intermediate result of your operation gives an in range result.
How is that any different from ordinary integer operations/arithmetic with ints/longs/etc...?
> Also, fixed point arithmetic does not fix floats.
> How is that any different from ordinary integer operations/arithmetic with ints/longs/etc...?
One of the key advantages of floating-point is that it is scale-independent--it doesn't matter if you're doing your calculations in meters, feet, miles, kilometers, parsecs, AUs, nanometers, Planck lengths--you'll get the same relative accuracy. If you're using fixed point (or integers), you instead have to take care to make sure that you scale things such that your units are not too large or too small.
Floating-point is essentially binary scientific notation, and it should be no surprise that it's a good format if you're in a field that already uses scientific notation all the time.
> it doesn't matter if you're doing your calculations in meters, feet, miles, kilometers, parsecs, AUs, nanometers, Planck lengths--you'll get the same relative accuracy.
Yes, but if I'm working with lengths where I know micrometers are good enough, but I also want micrometer (or, at least, better-than-millimeter) precision, but I want to actually deal in meters and I know I'm not going to be dealing with lengths longer than 100km, then fixed point would be ideal for that.
Simlarly, if I want to deal in dollars, but need cent precision (or, tenth of a cent precision), and need it to be precise, then fixed point would be ideal for that too.
I think the disconnect might be that I wasn't considering fixed-point arithmetic as a general replacement for floating-point arithmetic, I'm considering it as a complement to floating-point, to be used in the cases where it makes sense to do so.
The impression I got from the commenter I was originally replying to was "fixed point is absolutely terrible and there is never a reason to prefer it over floats". If that wasn't an intended implication, I guess I'm kind of arguing into the void.
>I know I'm not going to be dealing with lengths longer than 100km, then fixed point would be ideal for that.
Huge fallacy. No, you can not use fixed point numbers like that, it can not work. It is irrelevant what the actual maximum/minimum scale you are dealing with is. The thing which matters is the largest/smallest intermediate value you need. You need to consider every step in all your algorithms.
Imagine calculating the distance of two objects being 50km in x and 50km in y direction apart. Even though the input and output values fit within your range, the result is nonsensical if you use naive fix point arithmetic. Floating point allows you to write down the mathematical formula exactly, using fixed point arithmetic, you can not do that.
Looking at the maximum and minimum resolution you need is a huge fallacy when working with fixed point arithmetic and one big reason why everyone avoids using it. You need to carefully analyze your entire algorithm to use it.
>The impression I got from the commenter I was originally replying to was "fixed point is absolutely terrible and there is never a reason to prefer it over floats".
My position is that sometimes there might be a situation where fixed point arithmetic could be useful. If you are willing to put in a significant amount of time and effort into analyzing your system and dataflow it can be implememted successfully.
It is a far more complex and errorprone system and if you aren't careful it will bite you. In all cases floating point should be the default option and only deviated from if there are very good reasons.
Everyone here knows that floating point isn't perfectly representing the real numbers. What is the point of this comment? What does it have to do with what I posted? Why give another example when another was talked about already?
The other examples involve things like running out of memory and time, the impossibility of real numbers, precision of giant numbers. While the actual problem begins very close to 1+1.
Forget everything else, I need to pay my bills, this involves money.
I didn't say anything else because it is hard not to make sarcastic jokes and devolve into a rant.
You break down a problem to it's smallest parts then you solve those small problems. It seems very basic.
The smallest problem is c = a + b NOT a + b
One can do c = 2.03 , there is no problem storing the number.
One can also do 103 + 100 like c = ((a100) + (b100))/100
This looks even more hideous than c = asfdsADD(a,b) but at least you don't need to hurl around a big num lib to do 1+1.
If things get ever so slightly more complicated than 1+1, what should be a bunch of nice looking formulas looks horrible. Heaven forbid one wants a percentage of something. I would have to think how to accomplish that without the lib.
I have a very slow brain with very little memory and many threads, I'd much rather spend cpu cycles.
>How is that any different from ordinary integer operations/arithmetic with ints/longs/etc...?
It is different, because you never want to take the square root of an integer.
Floats allow you to pretend you are working with sometimes inaccurate real numbers. That is the magic behind it and why fixed point arithmetic was abandoned almost immediately when floating point became fast and easily available.
There are plenty of libraries which makes dealing with fixed-point relatively easy.
I think GP is alluding to issues you get when you have a mix of scales. Like floating-point, fixed-point will have a limited precision range around the origin. In contrast to floating-point, the range for fixed-point is smaller but evenly spaced.
Say you have 32:32 fixed-point. You can then represent numbers in multiple of ~0.2e-9. So if you need to calculate distances in nanometers, perhaps due to a short timestep in a simulation, you have hardly any precision to work with.
The obvious way around this is to pull out a scale factor, say 1e-9 since you know you'll be in the nano-range. Now the numbers you're working with are back on the order of 1 with lots of precision, however you need to apply the scale factor when you actually want to use the number, and now you have to be careful not to overflow when multiplying two or more scaled numbers. This is part of the programmer overhead GP alluded to.
> The obvious way around this is to pull out a scale factor, say 1e-9 since you know you'll be in the nano-range. Now the numbers you're working with are back on the order of 1 with lots of precision, however you need to apply the scale factor when you actually want to use the number, and now you have to be careful not to overflow when multiplying two or more scaled numbers.
Do that rescaling automatically and have a system that just keeps track of the scale and you've just implemented floats. (:
Consider a simple newtonian physics simulation. What you say is probably true for both force and acceleration. It may be true for velocity, but is almost certainly not true for position.
Consider an object with a very small velocity traveling away from the origin with no forces acting upon it. With floating-point arithmetic, it will eventually stop at some distance away from the origin. In general objects with small velocity will only move if they are close to the origin.
Sure, but most PIC methods are centered at their element or local domain. (You can likely even do the same with ewald/nbody). I definitely agree with you though if you need high accuracy on position when everything is on the same huge grid.
A simulation that computes bulk properties by simulating particles directly is very rare. At least for me. Observable like temperature, electric/magnetic field strength, stress, velocity, etc are much more common outputs. These often all vary over many order of magnitude across a domain.
Working with continuum mechanics, performing large inverted solves on matrices in fixed precision, or heck, even an FFT sounds terrifying to guard against overflow.
People like you are the reason why, even though computers have become incredibly fast, almost all software is complete shit and geting worse faster than hardware can keep up with.
If your software isn't manipulating Gigabytes of data or solving a hard math problem everything should be instantaneous.
Just because you might not understand basic CS knowledge is no reason to deprive the world of the best approximation to real numbers we have.
By the way. Big num libraries absolutely do not solve the problems of float. They, just as floats, can suffer from unbounded large errors. And 0.3 isn't representable regardless how long you make mantissa and exponent.
> People like you are the reason why, even though computers have become incredibly fast, almost all software is complete shit and geting worse faster than hardware can keep up with.
I appreciate the sentiment but in this case I do not want to hurl around a bignum lib to add up 2 tiny numbers.
> If your software isn't manipulating Gigabytes of data or solving a hard math problem everything should be instantaneous.
I agree, NO amount of bandwidth should be spend on this, build in functionality should do the small number arithmetic.
> Just because you might not understand basic CS knowledge is no reason to deprive the world of the best approximation to real numbers we have.
Meanwhile, back on the construction site I'm hammering screws.
No, I don't want to take your nails away from you.
Many people approach them as "oh, they're decimals" when they absolutely are not... but there are no convenient alternatives in most languages. So people use the convenient one that they see 99% of other code using, rather than something that is more likely to fit their expectations.
That's the problem. People are surrounded by screws and hammers and they understandably choose to hammer in the screws rather than 1) knowing a screwdriver exists, and 2) hunting down the few three-handed screwdrivers in existence and trying to use them.
I'm sorry, but what's hard to say about "all floating point arithmetic operations can contain errors". It's such a simple statement, that every developers learn in CS 101, that you don't even need to understand the reason to remember.
All of the flaws of floating point numbers are the mathematical results of its requirements. You have a fixed size datatype which almost always behaves like the real numbers.
Floats can be fixed just as much free energy can be invented.
The main author of this report, W. Kahan, was also the original author of IEEE-754. He was (and is) strongly unsatisfied on the state of floating-point in practical systems, and it was not his first time to complain about these problems. One can find his most recent critique from a few years ago.
My understanding is that, Kahan's IEEE-754 is meant to:
1. The use of double and extended precision should be encouraged to safeguard non-experts from floating-point errors. Everything should be at least double precision by default, and extended precision serves as an additional safeguard. In fact, when IEEE-754 was being drafted, Kahan believed 128-bit floating point should also be supported as computers become more powerful in the future. The computational cost was too high at that time, so he settled on 80-bit as seen on the 8087. He criticized Java for not supporting it.
2. Floating-point exceptions should be used and turned on everywhere to safeguard programmers from making mistakes, and possibly to allow programs to handle them as special cases in the logic at runtime. He criticized Java for not supporting it.
3. Unsafe optimizations should not be done, such as automatically using FMA, or using algebraic identity in compiler optimization. He criticized Java for allowing it in some cases.
Unfortunately, as far as I can see, these ideals of IEEE-754 have all but largely disappeared in real-world applications since then, for various practical reasons.
The industry did not move to 128-bit floating point because its performance overhead is too large, the original assessment of IEEE-754 in the 1980s was too optimistic. Similarly, the industry did not accept 80-bit extended precision as a standard but saw it as an oddity of the Intel 8087, even Intel has abandoned it - x86_64's SSE or AVX has removed all support of that, making it an exclusive feature of obsolete i386/i686 machines. My impression is that anything above double precision is no longer used in the industry (IBM POWER does have native quad-precision support, to their credit).
A secondary overhead of higher-precision floats is the memory wall, arguably a more serious problem today. Memory bandwidth has become the most serious overhead for many numerical programs. Modern computers have a machine balance of 100:1, it means you need to do as many as 100 floating-point operations after loading a single value from RAM to reach the machine's peak performance. But this is not compatible with many algorithms with an inherently low arithmetic intensity, including important physics simulations. The use of of FP80 or FP128 will make them unacceptably slower. As a result, today's trend is moving from FP64 to FP32, and even to FP16 or a custom 16-bit format if possible, not vice versa.
The use of floating-point exceptions similarly became unpopular because of performance problems. I'm not an expert on this, but it was my impression that signaling a floating-point exception was so expensive on both hardware and operation-system level that it was never seriously used in most practical programs. So in contrary to IEEE-754, these exceptions never became an integral part of the programming environment. In addition to an expensive Unix signal, potential exceptions also inhibit the efficient pipelining of modern out-of-order, superscalar CPUs.
Finally, the use of unsafe optimizations is prevalent in many applications when rigor is sacrificed for speed.
So overall, the original spirit of IEEE-754 was long gone - for better or worse, unfortunately.