Tuesday, March 30, 2021

Floating point number madness

 We'd like to represent floating point numbers in binary format.


The IEEE 754 format uses the 32-bit to encode a single precision floating point number as follow.


31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

|    |___Exponent________| |_____________Fraction___________________________|

Sign         



Normal: There is a one to the left of the fraction.   

                     (-1)S   x 1.x2e-127

Signed Zero:  Exponent=0 Fraction=0    

                     (-1)S  x 0

Subnormal/Denormal: Exponent=0 shifted by 1 But considering all the fraction values between 0 & 1  

                     (-1)S  x 0.f x 2-126

Infinity pos/neg Exponent=255 Fraction=0

                      (-1)S oo

NaN Exponent = 255 Fraction!=0

                       (-1)S   if b22=0 qNaN 

                                 if b22=1 sNan

b22 is bit 22nd of the Fraction