yenru0/2025-02-SystemProgramming

Fork 0

Files

yenru0 4c2c0363b3 complement many

2025-10-11 08:39:17 +09:00

4.2 KiB

Raw Blame History

Floating Point

Fractional Binary Number

representation:

for w = i + j + 1 bits data b

\sum_{k = -j}^{i}b_k\times 2^k

for example:

5+3/4 = 23/4 = 101.11_2
1 7/16 = 23/16 = 1.0111_2

Limitations

Can only exactly represent numbers of the form of x/2^k
Just one setting of binary point within the w bits, which means that very small value or very large value cannot be represented

IEEE Floating Point Definition

IEEE Standard 754

Driven by numerical concerns:

Nice standards for rounding, overflow, underflow
But Hard to make fast in hardware
- Numberical Analysts predominated over hw designers in defining standard

Representation

Form

(-1)^s M 2^E

s: sign bit
M: mantissa fractional value in [1.0,2.0)
E: exponent

Encoding

---
title: "Single Precision"
config:
    packet:
        bitsPerRow: 32
        rowHeight: 32
---
packet
+1: "s"
+8: "exp"
+23: "frac"

---
title: "Double Precision"
config:
    packet:
        bitsPerRow: 32
        rowHeight: 32
---
packet
+1: "s"
+11: "exp"
+52: "frac"

There is three kinds of float: normalized, denormalized, special

normalized

E = \exp - B B = 2^{k-1}-1 where k is number of exp bits

single: 127
double: 1023

M = 1.xxxxx minumum when \text {frac} = 0000...\quad (M = 1.0) maximum when \text{frac }= 1111... \quad (M = 2.0 - \epsilon)

denormalized

when exp=000...0

\exp = 1 - Bias

M = 0.xxxxx

special

when exp = 111...1

case exp = 111...1, frac = 000...0
- repr \infty
- operation that overflows
case exp = 111...1, frac = 111...1
- repr NaN
- repr case when no numeric value can be determined
  - e.g., sqrt(-1), inf - inf, inf * 0

#include <stdio.h>

int main() {
  unsigned x_a = 0b0'11111111'00000000000000000000000;
  unsigned x_b = 0b0'11111111'00000000000000000000001;
  unsigned x_c = 0b0'01111111'00000000000000000000000;
  float a = *(float*)&x_a;
  float b = *(float*)&x_b;
  float c = *(float*)&x_c;
  double cx = c;
  printf("%08x: %f\n", x_a, a);
  printf("%08x: %f\n", x_b, b);
  printf("%08x: %f\n", x_c, c);
  printf("%016llx: %f\n", *(unsigned long long *)&cx, cx);
  return 0;
}

while ! [ -f 2_1.out ]; do sleep .1; done; ./2_1.out

Properties

FP0 is Same as Int0
Can (almost) use unsigned int comparison

Arithmetic

x + y = \text{Round}(x+y) x \times y = \text{Round}(x\times y)

Idea:

compute exact result
Make it fit into desired precision
- overflow if too large
- round to fit into frac

Rounding

Twowards zero
Round down
Round up
Nearest Even*(default)

Nearest Even is default rounding mode Any other kind rounding mode is hard to get without dropping into assembly, but C99 has support for rounding mode management.

This rounding mode is used because reduced statistically bias.

For binary fractional numbers:

"Even" when least significant bit is 0
"Half way" when bits to right of rounding position is 100..._2

so for example of rounding to neareast 1/4:

Binary Value	Rounded	Action
`10.00011`	`10.00`	(<1/2)Down
`10.00110`	`10.01`	(>1/2)Up
`10.11100`	`11.00`	(=1/2)Up
`10.10100`	`10.10`	(=1/2)Down

BBGRXXX

G: Guard bit: LSB of result
R: Round bit: first bit of removed
X: Sticky bits: OR of remaining bits(001 = 1, 000 = 0)

Round up conditions

R = 1, S = 1 -> >.5
G = 1, R = 1, S = 0 -> Round to even

#include <stdio.h>

int main() {
  unsigned long long
           tb = 0b0'10000010000'0000000000000000000001010000000000000000000000000000;
  unsigned xb =    0b0'10000001'01000000000000000000011;
  double t = *(double*)&tb;
  float x = t;
  for(int i=31; i>=0;i--) {
    if(i == 31 - 1) {
      printf("/");
    } else if (i == 31 - 1 - 8){
      printf("/");
    }
    printf("%d", !!((*(unsigned *)&x) & (1<<i)));
  }
  printf("\n");
  printf("%f", x);
}

while ! [ -f 2_2.out ]; do sleep .1; done; ./2_2.out

4.2 KiB Raw Blame History