# floating point data type IEEE 754 format

Floating point data type can represent the real number in our C++ program.Integer type cannot represent a number with decimal point value but floating point type can this is because float has a different internal format(known as floating point data type IEEE 754 format). In int type the binary number is converted to integer value using a base-2 numeral system.In this system if we have a binary number say 101011 ,to convert this into an integer value each bit starting from the right most will be assigned a value. The first bit assigned is 0 and it is increased as we move towards the left.Look at the picture below.

In the next step,the number assigned will be raised as power with the base value as 2 and it is multiplied by the binary digit.The resulting values are added and so it becomes,

1* + 0 * + 1 * + 0 * + 1 * + 1 * = 43 ,

Converting a binary number into int type is simple and straightforward.However,such method fails to represent a real number(a number with fractional value) and so to make our Computer support a real number a new method was introduced and standardized,this new method is known as IEEE 754 format.The floating point data type utilized this method to represent a real number,how this method works is shown below.

#### The workings of floating point data type IEEE 754 format

To understand the internal format of floating point type let’s consider a 32 bits binary number say,0 01010111 11100000000000011101011. To make this binary number represent as real number using the IEEE 754 method the 32 bits is divided into three parts:

i) The first part consists of only one bit,the left most bit.This bit is known as sign bit (s for short form).It will represent the +ve or -ve sign.If this bit is 1 the value will have – sign,if this bit is 0 it is a +ve number.

ii) The second part consist of 8 bits starting from the bit next to the left most i.e from 30th to 23th bit.These 8 bits is known as exponent (e for short form).

iii) The remaining 23 bits,from 22nd to 0 bit is known as mantissa ( m for short form).

So we have,

After dividing the bits into three parts we will use the formula given below to obtained it’s real number value.

* ( ) * ( 1 + m / )

I won’t be discussing here how this formula originated,if you want more information you can Google it.Now then let’s apply the formula to convert this 32 bits 0 01010111 11100000000000011101011 into real number/floating point value.

i) The sign part, = 1 , so the value is +ve .

ii) The exponent part :The 8 bits is 01010111,we can use the base-2 numeral system to obtain the value of e ,so e=87 .

iii)The mantissa part: The bits is 11100000000000011101011 ,so m=7340267(using the base-2 numeral system) .

Substituting the value of s ,e and m in the formula.

* ( 1+m/ )

= 1 * * (1 + 7340267/ )

= 9.094947017729282379150390625e-13 * (1+0.87502801418304443359375)

= 1.7053280445752938554448974173283e-12  (e stands for 10)

The value obtained is rather very small.To make the value larger we should make the value e larger.In the next section we will see the maximum and minimum value obtainable from the formula.We will also see how to convert a real number into binary digit.

#### The value 0 , infinity , Nan , largest value , smallest value and denormalized number

The value 0(zero) , infinity and nan(Not a Number) also known as undefined number are obtain only when the binary digit exhibit an exceptional case under certain circumstances.The required conditions are discuss below.

The value 0 :

The floating point value is 0 if the exponent(e) and mantissa(m) are both 0.If s=1,then it is -0(known as -ve zero) and if s=0, then +0(known as +ve zero) .If e and m are zero then the binary digit is of the form.

Fig. 0 binary format in floating point system.
Note:the formula is not used to evaluate the value,instead the value is deduced directly by looking at the binary digit.

Infinity:

If all the exponent bits are 1 and the mantissa value is 0 then the value is denoted as infinity and of course when sign value is 1 it is -∞ and if 0 it is +∞.

Fig. Infinity floating point value.

NaN (Not a Number) :

You will usually come across such value in mathematical program,say,when you divide any non-zero number by 0,nan is the result.The binary format of this value will have all the exponent bits as 1 and the mantissa is any non-zero value

Fig. nan floating point value.

#### Smallest value float can represent

If we look at the formula
* () * ( 1+m/ ) ,the value becomes small if e and m are made small.The smallest value e and m can represent is 0,but if e and m are both 0 the value will be considered as 0(zero) .So, for e the smallest value next to 0 is taken which is 1 and m is taken as 0.Substituting them in the formula we get.

* () * ( 1+0/ )

(-1) *

So ,we get two values,
-1 * ≈ -1.17549435e-38 and
≈ 1.17549435e-38 .

#### Largest value float can represent

To obtain a value very large number the exponent and mantissa should be made as large as possible. The largest possible value of exponent is -1 which is obtained when all the bits are 1 and the largest possible value of mantissa is -1 when all the bits are 1.But,if all the bits of the exponent are made 1 and mantissa as any largest possible non-zero value then the value is deduced as infinity.So, the exponent value is taken as ((-1) -1 ) ( 255-1,the largest value next to -1 ) and the mantissa value as -1. Substituting them we get,

* () * ( 1 +( – 1 )/ )

* () * ( 2 – )

We can obtain two values with the same magnitude but with +ve and -ve sign.

() * ( 2 – ) ≈ 3,40282347∙e+38 and
-1 * () * ( 2 – ) ≈ -3.40282347∙e+38 .

So the maximum value float type can represent is 3,40282347∙e+38.Any values greater than this is known as overflow.

The smallest value float can represent is -3.40282347∙e+38 .Any values smaller than this is known as underflow.

The range of value float type can represent and the number 0, infinitiy and nan is shown in the number line below.

Fig. float number in number line

#### Denormalized number

If the exponent is 0 and mantissa has non-zero value (between 0.1 and 1) then the value is considered as denormalized number.The formula for such number is,

* * ( m/ )

or

* * ( m/ )

#### Double type

In case of double whose size is 64 bits,the sign bit is the left most bit which is same as the float type and the exponent consist of 11 bits (62 to 52) while the mantissa has 52 bits (51 to 100) .So,double can represent double precision value and hence the name double .

#### Converting floating point value to binary format

The binary format of floating point value can be obtained by reversing the steps we followed for converting the binary number to floating point value.Let’s try to implement the reverse step and convert 125.125 to binary format

First convert the decimal value to binary digit format using base-2 system.
125 = 1111101

Second convert the decimal point value to binary digit format
.125 = .001

1111101 + .001 = 1111101.001

Turn the resultant binary digits into the exponential form so,
1111101.001= 1.111101001 x

This can be written as

( ) * ( ) * 1.111101001

Compare this expression with the formula,
* ( ) * ( 1 + m / ) .

Which gives ,
s=0 ,

e-127=6
=> e=6+127 (applying the mathematical equation rule)
=> e = 133 , e=10000101

1 + m/ = 1.111101001
or
m/ = .11110100100000000000000
or
m/ = 11110100100000000000000. *

=> m = 11110100100000000000000

So,arranging s,e and m according to the binary format we get 125.125 as 0 10000101 11110100100000000000000

#### Error originating from floating point type

Floating point format can give rise to some unexpected error in our program.Some of the errors are given below.

i)Accuracy error due to rounding off.

Let’s look at the program below.

#include <iostream>

using namespace std;

int main( )
{
float f1=123456789 , f2=123456788 ,
f3=123456797 , f4=123456796 ,
f ;

f=f1-f2 ;
cout << f << endl ;

f=f2-f3 ;
cout << f << endl ;

cin.get() ;
return 0 ;

}

The output is
8
0

It should be 1 and 1 ,but it isn’t, why?. I’ll show you here why is 123456789-123456788=8 and the same concept is applicable for 123456797-123456796=0. To understand why,we need to convert the value assigned to f1 and f2 into their floating point binary format.So,first of all we will convert the values into their binary format using base-2 numeral system.

123456789= 111 0101 1011 1100 1101 0001 0101
123456788= 111 0101 1011 1100 1101 0001 0100

We can write the binary digit in the exponential form with base value as 2,so,

123456789= 1.11 0101 1011 1100 1101 0001 0101 x and

123456788= 1.11 0101 1011 1100 1101 0001 0100 x .

If we compare them with the formula given above,for 123456789 ,

m=.11 0101 1011 1100 1101 0001 0101
=>m=11 0101 1011 1100 1101 0001 0.101 *

Since mantissa can have only 23 bits,it is rounded off using round to nearest method so,

m=11 0101 1011 1100 1101 0001 1    (note the change in the last digit from 0 to 1 after the round off)

Similarly for 123456788,

m=11 0101 1011 1100 1101 0001 0

So,the floating point binary format of,

123456789= 0 10011001 11010110111100110100011 and
123456788= 0 10011001 11010110111100110100010

Now let’s subtract 123456788 from 123456789.To subtract them,first write their binary format into the exponential form.

123456789= 1.11010110111100110100011 *
123456788= 1.11010110111100110100010 * ,

1.11010110111100110100011 *
-1.11010110111100110100010 *
——————————————————-
0.00000000000000000000001 *

Now 0.00000000000000000000001 can be written as 1.00 * . So,the value becomes

1.00 * *

= 1.00 *

= 1.00 *

= 1000

Converting it into base 10(or decimal) form we get it as 8,hence the output.To solve this problem we can use double type.Look at the code below.

double d1=123456789 , d2=123456788 ,
d3=123456797 , d4=123456796 ,
d ;

d=d1-d2 ;
cout << d << endl ;

d=d3-d4 ;
cout << d << endl ;

The output is,

1
1

Using double type does not make any bits to round off so all the bits are preserve and hence we get correct value as the output.

i)Using different types in a calculation.

To understand the meaning of this error consider the program below.

#include <iostream>

using namespace std;

int main( )
{
float a ;
double b , c , d ;

a = 223167182.333457 ; ///Float type
b = 223167182.333457 ; ///Double
c = 223167182.333457 ; ///Double

d=a-b ;
cout<< d << endl ;

d=c-b ;
cout<< d << endl ;

cin.get();
return 0;

}

The output is,

1.66654
0

When the variables are all double type the output is 0 but when one of the variable is flat type we get 1.66654 as the output.The error arises due to the difference in the precision value float and double can represent.To avoid such error use the same type for every value in mathematical calculation.

#### Conclusion

The floating point IEEE 754 format still has some drawbacks like the rounding off error.Such error is bound to occur with IEEE 754 format.Well no choice here cause it is the only method available till today and to solve this problem a new innovation is require.Who knows the world might just be waiting for your innovation!