Open In App

IEEE Standard 754 Floating Point Numbers

Last Updated : 16 Mar, 2020
Summarize
Comments
Improve
Suggest changes
Like Article
Like
Save
Share
Report
News Follow

The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point computation which was established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating point implementations that made them difficult to use reliably and reduced their portability. IEEE Standard 754 floating point is the most common representation today for real numbers on computers, including Intel-based PC’s, Macs, and most Unix platforms.

There are several ways to represent floating point number but IEEE 754 is the most efficient in most cases. IEEE 754 has 3 basic components:

  1. The Sign of Mantissa –
    This is as simple as the name. 0 represents a positive number while 1 represents a negative number.
  2. The Biased exponent –
    The exponent field needs to represent both positive and negative exponents. A bias is added to the actual exponent in order to get the stored exponent.
  3. The Normalised Mantissa –
    The mantissa is part of a number in scientific notation or a floating-point number, consisting of its significant digits. Here we have only 2 digits, i.e. O and 1. So a normalised mantissa is one with only one 1 to the left of the decimal.

IEEE 754 numbers are divided into two based on the above three components: single precision and double precision.




TYPES SIGN BIASED EXPONENT NORMALISED MANTISA BIAS
Single precision 1(31st bit) 8(30-23) 23(22-0) 127
Double precision 1(63rd bit)

11(62-52) 52(51-0) 1023

Example –

85.125
85 = 1010101
0.125 = 001
85.125 = 1010101.001
       =1.010101001 x 2^6 
sign = 0 

1. Single precision:
biased exponent 127+6=133
133 = 10000101
Normalised mantisa = 010101001
we will add 0's to complete the 23 bits

The IEEE 754 Single precision is:
= 0 10000101 01010100100000000000000
This can be written in hexadecimal form 42AA4000

2. Double precision:
biased exponent 1023+6=1029
1029 = 10000000101
Normalised mantisa = 010101001
we will add 0's to complete the 52 bits

The IEEE 754 Double precision is:
= 0 10000000101 0101010010000000000000000000000000000000000000000000
This can be written in hexadecimal form 4055480000000000 

Special Values: IEEE has reserved some values that can ambiguity.

  • Zero –
    Zero is a special value denoted with an exponent and mantissa of 0. -0 and +0 are distinct values, though they both are equal.

  • Denormalised –
    If the exponent is all zeros, but the mantissa is not then the value is a denormalized number. This means this number does not have an assumed leading one before the binary point.

  • Infinity –
    The values +infinity and -infinity are denoted with an exponent of all ones and a mantissa of all zeros. The sign bit distinguishes between negative infinity and positive infinity. Operations with infinite values are well defined in IEEE.

  • Not A Number (NAN) –
    The value NAN is used to represent a value that is an error. This is represented when exponent field is all ones with a zero sign bit or a mantissa that it not 1 followed by zeros. This is a special value that might be used to denote a variable that doesn’t yet hold a value.
EXPONENT MANTISA VALUE
0 0 exact 0
255 0

Infinity
0 not 0 denormalised
255 not 0

Not a number (NAN)

Similar for Double precision (just replacing 255 by 2049), Ranges of Floating point numbers:

Denormalized Normalized Approximate Decimal
Single Precision ± 2-149 to (1 – 2-23)×2-126 ± 2-126 to (2 – 2-23)×2127 ± approximately 10-44.85 to approximately 1038.53
Double Precision ± 2-1074 to (1 – 2-52)×2-1022 ± 2-1022 to (2 – 2-52)×21023 ± approximately 10-323.3 to approximately 10308.3

The range of positive floating point numbers can be split into normalized numbers, and denormalized numbers which use only a portion of the fractions’s precision. Since every floating-point number has a corresponding, negated value, the ranges above are symmetric around zero.

There are five distinct numerical ranges that single-precision floating-point numbers are not able to represent with the scheme presented so far:

  1. Negative numbers less than – (2 – 2-23) × 2127 (negative overflow)
  2. Negative numbers greater than – 2-149 (negative underflow)
  3. Zero
  4. Positive numbers less than 2-149 (positive underflow)
  5. Positive numbers greater than (2 – 2-23) × 2127 (positive overflow)

Overflow generally means that values have grown too large to be represented. Underflow is a less serious problem because is just denotes a loss of precision, which is guaranteed to be closely approximated by zero.

Table of the total effective range of finite IEEE floating-point numbers is shown below:

Binary Decimal
Single ± (2 – 2-23) × 2127 approximately ± 1038.53
Double ± (2 – 2-52) × 21023 approximately ± 10308.25

Special Operations –

Operation Result
n ÷ ±Infinity 0
±Infinity × ±Infinity ±Infinity
±nonZero ÷ ±0 ±Infinity
±finite × ±Infinity ±Infinity
Infinity + Infinity
Infinity – -Infinity
+Infinity
-Infinity – Infinity
-Infinity + – Infinity
– Infinity
±0 ÷ ±0 NaN
±Infinity ÷ ±Infinity NaN
±Infinity × 0 NaN
NaN == NaN False


Similar Reads

Multiplying Floating Point Numbers
Prerequisite - IEEE Standard 754 Floating Point Numbers Problem:- Here, we have discussed an algorithm to multiply two floating point numbers, x and y. Algorithm:- Convert these numbers in scientific notation, so that we can explicitly represent hidden 1. Let ‘a’ be the exponent of x and ‘b’ be the exponent of y. Assume resulting exponent c = a+b.
2 min read
Introduction of Floating Point Representation
1. To convert the floating point into decimal, we have 3 elements in a 32-bit floating point representation: i) Sign ii) Exponent iii) Mantissa Sign bit is the first bit of the binary representation. '1' implies negative number and '0' implies positive number. Example: 11000001110100000000000000000000 This is negative number.Exponent is decided by
4 min read
Floating Point Representation - Basics
There are posts on representation of floating point format. The objective of this article is to provide a brief introduction to floating point format. The following description explains terminology and primary details of IEEE 754 binary floating-point representation. The discussion confines to single and double precision formats. Usually, a real nu
10 min read
Canonical and Standard Form
Canonical Form - In Boolean algebra, the Boolean function can be expressed as Canonical Disjunctive Normal Form known as minterm and some are expressed as Canonical Conjunctive Normal Form known as maxterm. In Minterm, we look for the functions where the output results in "1" while in Maxterm we look for functions where the output results in "0". W
5 min read
How to create a System Restore point in Windows 10 ?
A System Restore is a backup of a system configuration settings of the Windows Operating System, that helps the system recover to an earlier date than when the System Restore was made. This reverts all the settings as they were when the restore point was made. Therefore, could be used to recover from an unstable or malfunctioning Operating System.
2 min read
Fixed Point Representation
Fixed Point Representation means that represents real numbers in a computer system, where the position of the decimal of the (or binary) point is fixed. This is in difference to floating point representation, where the position of the point can "float." In fixed point representation, the number is split into an integer part and a fractional part. R
7 min read
8086 program to determine squares of numbers in an array of n numbers
Problem - Write a program in 8086 microprocessor to find out the squares of 8-bit n numbers, where size “n” is stored at offset 500 and the numbers are stored from offset 501 and store the result numbers into offset 501.(assuming squares comes out to be in limit of 8 bit only). Example - Algorithm - Store 500 to SI and Load data from offset 500 to
2 min read
8086 program to determine cubes of numbers in an array of n numbers
Problem - Write a program in 8086 microprocessor to find out the cubes of 8-bit n numbers, where size “n” is stored at offset 500 and the numbers are stored from offset 501 and store the result numbers into offset 501.(assuming cubes comes out to be in limit of 8 bit only). Example - Algorithm - Store 500 to SI and Load data from offset 500 to regi
2 min read
8085 program to find maximum and minimum of 10 numbers
Problem - Write an assembly language program in 8085 microprocessor to find maximum and minimum of 10 numbers. Example - Minimum: 01H, Maximum: FFH In CMP instruction: If Accumulator > Register then carry and zero flags are reset If Accumulator = Register then zero flag is set If Accumulator < Register then carry flag is set Assumption - List
3 min read
8085 program to search a number in an array of n numbers
Problem - Write an assembly language program in 8085 to search a given number in an array of n numbers. If number is found, then store F0 in memory location 3051 otherwise store 0F in 3051. Assumptions - Count of elements in an array is stored at memory location 2050. Array is stored from starting memory address 2051 and number which user want to s
4 min read
8085 program to find maximum of two 8 bit numbers
Problem - Write a assembly language program to find maximum of two 8 bit numbers in 8085 microprocessor. Assumptions - Starting memory locations and output memory locations are 2050, 2051 and 3050 respectively. Example - Algorithm - Load value in the accumulator Then, copy the value to any of the register Load next value in the accumulator Compare
1 min read
8085 program to add two 8 bit numbers
Problem - Write an assembly language program to add two 8 bit numbers stored at address 2050 and address 2051 in 8085 microprocessor. The starting address of the program is taken as 2000. Example - Algorithm - Load the first number from memory location 2050 to accumulator.Move the content of accumulator to register H.Load the second number from mem
2 min read
8085 program to multiply two 8 bit numbers
Problem - Multiply two 8 bit numbers stored at address 2050 and 2051. Result is stored at address 3050 and 3051. Starting address of program is taken as 2000. Example - Algorithm - We are taking adding the number 43 seven(7) times in this example.As the multiplication of two 8 bit numbers can be maximum of 16 bits so we need register pair to store
3 min read
8085 program to add two 16 bit numbers
Problem: Write an assembly language program to add two 16 bit numbers by using: 8-bit operation16-bit operation Example: 1. Addition of 16-bit numbers using 8-bit operation: It is a lengthy method and requires more memory as compared to the 16-bit operation. Algorithm: Load the lower part of the first number in the B register.Load the lower part of
3 min read
8086 program to subtract two 16-bit numbers with or without borrow
Problem - Write a program to subtract two 16-bit numbers where starting address is 2000 and the numbers are at 3000 and 3002 memory address and store result into 3004 and 3006 memory address. Example - Algorithm - Load 0000H into CX register (for borrow) Load the data into AX(accumulator) from memory 3000 Load the data into BX register from memory
2 min read
8086 program to add two 16-bit numbers with or without carry
Problem - Write a program to add two 16-bit numbers where starting address is 2000 and the numbers are at 3000 and 3002 memory address and store result into 3004 and 3006 memory address. Example - Algorithm - Load 0000H into CX register (for carry) Load the data into AX(accumulator) from memory 3000 Load the data into BX register from memory 3002 A
2 min read
8086 program to multiply two 16-bit numbers
Problem - Write a program to multiply two 16-bit numbers where starting address is 2000 and the numbers are at 3000 and 3002 memory address and store result into 3004 and 3006 memory address. Example - Algorithm - First load the data into AX(accumulator) from memory 3000 Load the data into BX register from memory 3002 Multiply BX with Accumulator A
1 min read
8086 program to determine largest number in an array of n numbers
Problem - Write a program in 8086 microprocessor to find out the largest among 8-bit n numbers, where size “n” is stored at memory address 2000 : 500 and the numbers are stored from memory address 2000 : 501 and store the result (largest number) into memory address 2000 : 600. Example - Algorithm - Load data from offset 500 to register CL and set r
2 min read
8085 program to swap two 8-bit numbers
Problem: Write an assembly language program to swap two 8-bit numbers stored in an 8085 microprocessor. Assumption: Suppose there are two 8-bit numbers. One 8-bit number is stored at location 2500 memory address and another is stored at location 2501 memory address. Let 05 be stored at location 2500 and 06 be stored at location 2501 (not necessaril
4 min read
8085 program to add three 16 bit numbers stored in registers
Problem - Write an assembly language program to add three 16 bit numbers stored in register HL, DE, BC and store the result in DE with minimum number of instructions. Example - Assumptions - Numbers to be added are already stored in register HL, DE, BCNumbers stored in register are such that final result should not be greater than FFFF DAD D perfor
3 min read
8085 program to subtract two 8-bit numbers with or without borrow
Problem – Write a program to subtract two 8-bit numbers with or without borrow where first number is at 2500 memory address and second number is at 2501 memory address and store the result into 2502 and borrow into 2503 memory address. Example – Algorithm – Load 00 in a register C (for borrow)Load two 8-bit number from memory into registersMove one
3 min read
8086 program to add two 8 bit BCD numbers
Problem - Write a program in 8086 microprocessor to find out the addition of two 8-bit BCD numbers, where numbers are stored from starting memory address 2000 : 500 and store the result into memory address 2000 : 600 and carry at 2000 : 601. Example - Algorithm - Load data from offset 500 to register AL (first number) Load data from offset 501 to r
2 min read
8086 program to find sum of odd numbers in a given series
Problem - Write an Assembly Language Program to find sum of odd numbers in a given series containing 8 bit numbers stored in a continuous memory location and store the result in another memory location. Example - Example Explanation - 500 offset stores the counter value of the series and the elements of the series starts from 501 to 504 offset. In
3 min read
8085 program to add 2-BCD numbers
Problem – Write a program to add 2-BCD numbers where starting address is 2000 and the numbers is stored at 2500 and 2501 memory addresses and store sum into 2502 and carry into 2503 memory address. Example – Algorithm – Load 00H in a register (for carry)Load content from memory into register pairMove content from L register to accumulatorAdd conten
3 min read
8085 program to find larger of two 8 bit numbers
Problem - Write a program in 8085 microprocessor to find out larger of two 8-bit numbers, where numbers are stored in memory address 2050 and 2051, and store the result into memory address 3050. Example - Algorithm - Load two numbers from memory 2050 & 2051 to register L and H .Move one number(H) to Accumulator A and subtract other number(L) fr
3 min read
8085 program to multiply two 8 bit numbers using logical instructions
Prerequisite - Logical instructions in 8085 microprocessor Problem - Write a assembly language program multiply two 8 bit numbers and store the result at memory address 3050 in 8085 microprocessor. Example - The value of accumulator(A) after using RLC instruction is: A = 2n*A Where n = number of times RLC instruction is used. Assumptions - Assume t
2 min read
8085 program to swap two 16 bit numbers using Direct addressing mode
Problem - Write a program to swap two 16-bit numbers using direct addressing mode where starting address is 2000 and the first 16-bit number is stored at 3000 and the second 16-bit number is stored at 3002 memory address. Example – Algorithm – Load a 16-bit number from memory 3000 into a register pair (H-L)Exchange the register pairsLoad a 16-bit n
2 min read
8085 program to swap two 8 bit numbers using Direct addressing mode
Problem - Write a program to swap two 8-bit numbers using direct addressing mode where starting address is 2000 and the first 8-bit number is stored at 3000 and the second 8-bit number is stored at 3001 memory address. Example – Algorithm – Load a 8-bit number from memory 3000 into accumulatorMove value of accumulator into register HLoad a 8-bit nu
2 min read
8086 program to add two 16 bit BCD numbers with carry
Problem - Write an assembly language program to add two 16 bit BCD numbers with carry in 8086 microprocessor. Example - Algorithm - Load the lower part of both the 16 bit BCD numbers in different locations.Add each number by adding first its lower part.Repeat the above step also by adding the carry if any.Make the lower part of register 00 and add
2 min read
8086 program to multiply two 8 bit numbers
Problem - Write a program in 8086 microprocessor to multiply two 8-bit numbers, where numbers are stored from offset 500 and store the result into offset 600. Examples - Inputs and output are given in Hexadecimal representation. Algorithm - Load data from offset 500 to register AL (first number) Load data from offset 501 to register BL (second numb
1 min read