Curiously enough, today, such a hack is completely pointless (at least on x86): ...

jacquesm · on Oct 22, 2009

Yes, but what a beautiful little hack it is. The fact that modern processors implement it in hardware is actually a tribute to this tricks ingenuity and elegance.

After all, to have major chip manufacturers implement your software hack in hardware is quite a compliment, assuming they use the same mechanism.

andreyf · on Oct 22, 2009

This particular hack, yes, but the general point stands. Sometimes hacks make for great code. The point is that this is good code:

    float h = 0.5f*x;
    int i = *(int*)&x;
    i = 0x5f3759df - (i>>1);
    x = *(float*)&i;
    return x*(1.5f - h*x*x);

That's 3 local variables, 5 short statements... makes me wonder how quickly one could search/test all combinations of sensible 5-statement functions to find other such "hacks"...

DarkShikari · on Oct 22, 2009

When you're willing to make an approximation you can come up with all sorts of very short functions for common operations.

Here's one from my project, which takes an integer x and calculates a (float) log2(x) in an absurdly low number of operations.

    static inline float x264_log2( uint32_t x )
    {
        int lz = __builtin_clz( x );
        return x264_log2_lut[(x<<lz>>24)&0x7f] + x264_log2_lz_lut[lz];
    }

    const float x264_log2_lut[128] = {
        0.00000, 0.01123, 0.02237, 0.03342, 0.04439, 0.05528, 0.06609, 0.07682,
        0.08746, 0.09803, 0.10852, 0.11894, 0.12928, 0.13955, 0.14975, 0.15987,
        0.16993, 0.17991, 0.18982, 0.19967, 0.20945, 0.21917, 0.22882, 0.23840,
        0.24793, 0.25739, 0.26679, 0.27612, 0.28540, 0.29462, 0.30378, 0.31288,
        0.32193, 0.33092, 0.33985, 0.34873, 0.35755, 0.36632, 0.37504, 0.38370,
        0.39232, 0.40088, 0.40939, 0.41785, 0.42626, 0.43463, 0.44294, 0.45121,
        0.45943, 0.46761, 0.47573, 0.48382, 0.49185, 0.49985, 0.50779, 0.51570,
        0.52356, 0.53138, 0.53916, 0.54689, 0.55459, 0.56224, 0.56986, 0.57743,
        0.58496, 0.59246, 0.59991, 0.60733, 0.61471, 0.62205, 0.62936, 0.63662,
        0.64386, 0.65105, 0.65821, 0.66534, 0.67243, 0.67948, 0.68650, 0.69349,
        0.70044, 0.70736, 0.71425, 0.72110, 0.72792, 0.73471, 0.74147, 0.74819,
        0.75489, 0.76155, 0.76818, 0.77479, 0.78136, 0.78790, 0.79442, 0.80090,
        0.80735, 0.81378, 0.82018, 0.82655, 0.83289, 0.83920, 0.84549, 0.85175,
        0.85798, 0.86419, 0.87036, 0.87652, 0.88264, 0.88874, 0.89482, 0.90087,
        0.90689, 0.91289, 0.91886, 0.92481, 0.93074, 0.93664, 0.94251, 0.94837,
        0.95420, 0.96000, 0.96578, 0.97154, 0.97728, 0.98299, 0.98868, 0.99435,
    };

    /* Avoid an int/float conversion. */
    const float x264_log2_lz_lut[32] = {
        31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
    };

The resulting asm, just 7 instructions:

      bsr     ecx, eax
      xor     ecx, 0x1f
      movss  xmm0, [ecx*4+x264_log2_lz_lut]
      shl     eax, cl
      shr     eax, 0x18
      and     eax, 0x7f
      addss  xmm0, [eax*4+x264_log2_lut]

axod · on Oct 22, 2009

Could duplicate the x264_log2_lut table, then you wouldn't need the 'and eax, 0x7f'.... 6 instructions.

DarkShikari · on Oct 22, 2009

Yeah, though at that point I'd be start to worry about whether that's a fair L1 cache tradeoff, given that it only saves a single (integer!) instruction and requires 512 bytes of memory.

axod · on Oct 22, 2009

Sure, always a trade off when you get to that sort of point, and not much payoff.

eru · on Oct 22, 2009

What kind of project is it?

DarkShikari · on Oct 22, 2009

I was hoping the names would make it slightly obvious, but if not, it's the x264 video encoder: http://www.videolan.org/developers/x264.html

Bietchiebatchie · on Oct 22, 2009

Actually it is not completely pointless today: Some months ago I used it to speed up an force-directed algorithm which spent 80% of its time in calculating normalized vectors between two points - and the fast inverse sqrt actually helped to speed up the program by a factor of 2.5 to 3.0. The reason why i used it is quite simple: In (unsafe) .NET code you can execute all the necessary arithmetic operations (pointer casts, bit shifts) but you are not allowed to execute arbitrary x86 operations.