Number storage formats

Design factors

Words and bits

Suppose b bits are available to store a certain type of number: this is usally expressed in terms of multiples of words (collection of w bits). Because binary logic is used for addressing and processing, w is generally a power of 2. In old days, it used to be w=22, now 64 bit words are common.

Simplicity of computation logic

Some representation formats require simpler and more efficient circuits for performing basic arithmetic operations than others. This is an important factor in choosing the representation format.

Special numbers

ONe may want to reserve space in representation set for storing special numbers like +Inf, -Inf, NaN (for storing results of illegal operations).

Integers

Suppose b bits are available to store a number.

Unsigned

Any x{0}N can be stored in the natural binary representation. So, in b bits, 2b unsigned numbers x(0:2b1) can be stored.

Signed

While storing negative numbers along with positive numbers, one has to distinguish it from positive numbers, one requires a sign bit.

Use sign bit + absolute value

A straightforward way to store xZ is to set the sign bit to 1 and store xN in the remaining b1 bits. In this case, x(2b1+1:2b11) can be stored. Note that there is redundancy in possible representations of 0: the sign bit may be 1 or 0 (one of which can then be taken to mean NaN). Thus, a total of 2b1 numbers can be stored.

Offset/ biased storage

One can take a large natural number called k the bias. Then, one can store 2b (as against 2b1 in another representation) numbers xk:2bk1 by simply storing x+k{0}N.

1’s complement storage

Suppose we use the bias 2b1 to store xZ. This amounts to inverting bits in the binary representation of xN, is called (one’s) complement representation of xZ.

2’s complement bias

Here again, we used biased representation only to store -ve numbers, we store 2b+x=2b|x|: this is the 2’s complement of x - we discuss this below.

If b=3, the numbers 4:1 have representations 100:111.

Note that this representation of x effectively constitutes the use of the most significant (b-th) bit as a sign bit - distinguishing xZ from x{0}N. Thus, we can store x2b1:2b11: a total of 2b numbers.

Addition of -ve and +ve numbers (ie subtraction) becomes slightly easier: the circuit used to add two unsigned numbers will work fine.

IEEE floating point

Division of bits

The bits provided for storage are divided into the following components: 1 sign (±) bit, M bits to store part of the mantissa, E bits to store a function of the exponent. These components are described below.

Number stored

This imitates scientific notation of numbers: 1.2332109. Stores ±(1+f/2M)2e+2E11. ±(1+f/2M):= mantissa, f:= significand or precision; e := exponent stored in the biased representation; k=2E11 := exponent bias.

Note that rather than use a sign bit in the exponent, the biased representation is used.

As scientific notation is used, 1 in (1+f) assumed, so the number of bits needed is effectively reduced by 1 bit!

IEEE standards

\textbf{Single precision}: M= 23 bits, E = 8 bits.

\textbf{Double precision}: M= 52 bits, E = 11 bits.

Reserved numbers

0 := is stored as ±1 2k, which amounts setting all non-sign bits to 0: note that -0 and +0 are distinct (to indicate different underflow conditions while performing arithmetic).

±=±1.0 2k+2E1, which amounts to setting f=0, exponent bits being 1111…

NaN =±1.f2k+2E1: identical to ±, except f0.

Range

Allowing for the reserved numbers and considering the range of M and E bits, we can observe the range.

Smallest non 0: ±1.0..01 2k. Largest num: ±1.11 2k+2E11=±22E1.

So, underflow or overflow rare.

Increasing gaps in different ranges

In [1,2]: 2M; In [2, 4]: 2M+1; In [2j,2j+1]: 2M+j; relative gap only 2M. So, ‘Floating point’. Matlab eps: (num next to 1) - 1: 2M=2522.2204 1016.

Representation Accuracy

Let fl:RQ be a function which maps any number to its floating point representation.

Machine epsilon

In case of floating point representations, we can guarantee that xR, given that x in range of floating point number system: fl(x)xxϵ.

In case of a floating point number system with M mantissa bits, ϵM=2M1=253. This is because, in storing the number 1.f2t: 1] We assume that sufficient bits are available to store the exponent, 2]M bits are available to store f.

Yet, note that fl(ϵM)=ϵM.

Error guarantee view!

Then, roundoff error [fl(aϵ)=a] guaranteed. So, fl(1+1016)=1.

Accuracy of arithmetic operations

IEEE ensures: fl(xy)=(xy)(1+ϵ),ϵϵM if =+/×. See this by finding x(1+ϵ)y(1+ϵ). fl(xy) is written as .

For complex numbers and div, use ϵM=2M2.