Files
2025-02-SystemProgramming/notes/2.md
2025-10-11 08:39:17 +09:00

4.2 KiB

Floating Point

Fractional Binary Number

representation:

  • for w = i + j + 1 bits data b
\sum_{k = -j}^{i}b_k\times 2^k

for example:

  • 5+3/4 = 23/4 = 101.11_2
  • 1 7/16 = 23/16 = 1.0111_2

Limitations

  • Can only exactly represent numbers of the form of x/2^k
  • Just one setting of binary point within the w bits, which means that very small value or very large value cannot be represented

IEEE Floating Point Definition

IEEE Standard 754

Driven by numerical concerns:

  • Nice standards for rounding, overflow, underflow
  • But Hard to make fast in hardware
    • Numberical Analysts predominated over hw designers in defining standard

Representation

Form

(-1)^s M 2^E
  • s: sign bit
  • M: mantissa fractional value in [1.0,2.0)
  • E: exponent

Encoding

---
title: "Single Precision"
config:
    packet:
        bitsPerRow: 32
        rowHeight: 32
---
packet
+1: "s"
+8: "exp"
+23: "frac"
---
title: "Double Precision"
config:
    packet:
        bitsPerRow: 32
        rowHeight: 32
---
packet
+1: "s"
+11: "exp"
+52: "frac"

There is three kinds of float: normalized, denormalized, special

normalized

E = \exp - B B = 2^{k-1}-1 where k is number of exp bits

  • single: 127
  • double: 1023

M = 1.xxxxx minumum when \text {frac} = 0000...\quad (M = 1.0) maximum when \text{frac }= 1111... \quad (M = 2.0 - \epsilon)

denormalized

when exp=000...0

\exp = 1 - Bias

M = 0.xxxxx

special

when exp = 111...1

  • case exp = 111...1, frac = 000...0

    • repr \infty
    • operation that overflows
  • case exp = 111...1, frac = 111...1

    • repr NaN
    • repr case when no numeric value can be determined
      • e.g., sqrt(-1), inf - inf, inf * 0
#include <stdio.h>

int main() {
  unsigned x_a = 0b0'11111111'00000000000000000000000;
  unsigned x_b = 0b0'11111111'00000000000000000000001;
  unsigned x_c = 0b0'01111111'00000000000000000000000;
  float a = *(float*)&x_a;
  float b = *(float*)&x_b;
  float c = *(float*)&x_c;
  double cx = c;
  printf("%08x: %f\n", x_a, a);
  printf("%08x: %f\n", x_b, b);
  printf("%08x: %f\n", x_c, c);
  printf("%016llx: %f\n", *(unsigned long long *)&cx, cx);
  return 0;
}
while ! [ -f 2_1.out ]; do sleep .1; done; ./2_1.out

Properties

  • FP0 is Same as Int0

  • Can (almost) use unsigned int comparison

Arithmetic

x + y = \text{Round}(x+y) x \times y = \text{Round}(x\times y)

Idea:

  1. compute exact result
  2. Make it fit into desired precision
    • overflow if too large
    • round to fit into frac

Rounding

  • Twowards zero
  • Round down
  • Round up
  • Nearest Even*(default)

Nearest Even is default rounding mode Any other kind rounding mode is hard to get without dropping into assembly, but C99 has support for rounding mode management.

This rounding mode is used because reduced statistically bias.

For binary fractional numbers:

  • "Even" when least significant bit is 0
  • "Half way" when bits to right of rounding position is 100..._2

so for example of rounding to neareast 1/4:

Binary Value Rounded Action
10.00011 10.00 (<1/2)Down
10.00110 10.01 (>1/2)Up
10.11100 11.00 (=1/2)Up
10.10100 10.10 (=1/2)Down

BBGRXXX

  • G: Guard bit: LSB of result
  • R: Round bit: first bit of removed
  • X: Sticky bits: OR of remaining bits(001 = 1, 000 = 0)

Round up conditions

  1. R = 1, S = 1 -> >.5
  2. G = 1, R = 1, S = 0 -> Round to even
#include <stdio.h>

int main() {
  unsigned long long
           tb = 0b0'10000010000'0000000000000000000001010000000000000000000000000000;
  unsigned xb =    0b0'10000001'01000000000000000000011;
  double t = *(double*)&tb;
  float x = t;
  for(int i=31; i>=0;i--) {
    if(i == 31 - 1) {
      printf("/");
    } else if (i == 31 - 1 - 8){
      printf("/");
    }
    printf("%d", !!((*(unsigned *)&x) & (1<<i)));
  }
  printf("\n");
  printf("%f", x);
}
while ! [ -f 2_2.out ]; do sleep .1; done; ./2_2.out

Float Quiz