Assembly Floating Point

Next: Assembly SIMD Instructions, Previous: Assembly Functional Units, Up: Assembly Coding

15.8.6 Floating Point

Floating point arithmetic is used in GMP for multiplications on CPUs with poor integer multipliers. It's mostly useful for mpn_mul_1, mpn_addmul_1 and mpn_submul_1 on 64-bit machines, and mpn_mul_basecase on both 32-bit and 64-bit machines.

With IEEE 53-bit double precision floats, integer multiplications producing up to 53 bits will give exact results. Breaking a 64x64 multiplication into eight 16x32->48 bit pieces is convenient. With some care though six 21x32->53 bit products can be used, if one of the lower two 21-bit pieces also uses the sign bit.

For the mpn_mul_1 family of functions on a 64-bit machine, the invariant single limb is split at the start, into 3 or 4 pieces. Inside the loop, the bignum operand is split into 32-bit pieces. Fast conversion of these unsigned 32-bit pieces to floating point is highly machine-dependent. In some cases, reading the data into the integer unit, zero-extending to 64-bits, then transferring to the floating point unit back via memory is the only option.

Converting partial products back to 64-bit limbs is usually best done as a signed conversion. Since all values are smaller than 2^53, signed and unsigned are the same, but most processors lack unsigned conversions.

Here is a diagram showing 16x32 bit products for an mpn_mul_1 or mpn_addmul_1 with a 64-bit limb. The single limb operand V is split into four 16-bit parts. The multi-limb operand U is split in the loop into two 32-bit parts.

                     +---+---+---+---+
                     |v48|v32|v16|v00|    V operand
                     +---+---+---+---+
     
                     +-------+---+---+
                 x   |  u32  |  u00  |    U operand (one limb)
                     +---------------+
     
     ---------------------------------
     
                         +-----------+
                         | u00 x v00 |    p00    48-bit products
                         +-----------+
                     +-----------+
                     | u00 x v16 |        p16
                     +-----------+
                 +-----------+
                 | u00 x v32 |            p32
                 +-----------+
             +-----------+
             | u00 x v48 |                p48
             +-----------+
                 +-----------+
                 | u32 x v00 |            r32
                 +-----------+
             +-----------+
             | u32 x v16 |                r48
             +-----------+
         +-----------+
         | u32 x v32 |                    r64
         +-----------+
     +-----------+
     | u32 x v48 |                        r80
     +-----------+

p32 and r32 can be summed using floating-point addition, and likewise p48 and r48. p00 and p16 can be summed with r64 and r80 from the previous iteration.

For each loop then, four 49-bit quantities are transferred to the integer unit, aligned as follows,

     |-----64bits----|-----64bits----|
                        +------------+
                        | p00 + r64' |    i00
                        +------------+
                    +------------+
                    | p16 + r80' |        i16
                    +------------+
                +------------+
                | p32 + r32  |            i32
                +------------+
            +------------+
            | p48 + r48  |                i48
            +------------+

The challenge then is to sum these efficiently and add in a carry limb, generating a low 64-bit result limb and a high 33-bit carry limb (i48 extends 33 bits into the high half).