207 lines
4.2 KiB
Markdown
207 lines
4.2 KiB
Markdown
# Floating Point
|
|
|
|
## Fractional Binary Number
|
|
|
|
representation:
|
|
|
|
* for $w = i + j + 1$ bits data $b$
|
|
$$\sum_{k = -j}^{i}b_k\times 2^k$$
|
|
|
|
for example:
|
|
* $5+3/4 = 23/4 = 101.11_2$
|
|
* $1 7/16 = 23/16 = 1.0111_2$
|
|
|
|
**Limitations**
|
|
* Can only exactly represent numbers of the form of $x/2^k$
|
|
* Just one setting of binary point within the $w$ bits, which means that very small value or very large value cannot be represented
|
|
|
|
## IEEE Floating Point Definition
|
|
|
|
**IEEE Standard 754**
|
|
|
|
Driven by numerical concerns:
|
|
* Nice standards for rounding, overflow, underflow
|
|
* But Hard to make fast in hardware
|
|
* Numberical Analysts predominated over hw designers in defining standard
|
|
|
|
|
|
### Representation
|
|
|
|
**Form**
|
|
|
|
$$(-1)^s M 2^E$$
|
|
|
|
* $s$: sign bit
|
|
* $M$: mantissa fractional value in $[1.0,2.0)$
|
|
* $E$: exponent
|
|
|
|
**Encoding**
|
|
|
|
```mermaid
|
|
---
|
|
title: "Single Precision"
|
|
config:
|
|
packet:
|
|
bitsPerRow: 32
|
|
rowHeight: 32
|
|
---
|
|
packet
|
|
+1: "s"
|
|
+8: "exp"
|
|
+23: "frac"
|
|
```
|
|
|
|
```mermaid
|
|
---
|
|
title: "Double Precision"
|
|
config:
|
|
packet:
|
|
bitsPerRow: 32
|
|
rowHeight: 32
|
|
---
|
|
packet
|
|
+1: "s"
|
|
+11: "exp"
|
|
+52: "frac"
|
|
```
|
|
|
|
There is three kinds of `float`: **normalized**, **denormalized**, **special**
|
|
|
|
**normalized**
|
|
|
|
$E = \exp - B$
|
|
$B = 2^{k-1}-1$ where $k$ is number of exp bits
|
|
* single: 127
|
|
* double: 1023
|
|
|
|
$M = 1.xxxxx$
|
|
minumum when $\text {frac} = 0000...\quad (M = 1.0)$
|
|
maximum when $\text{frac }= 1111... \quad (M = 2.0 - \epsilon)$
|
|
|
|
**denormalized**
|
|
|
|
when `exp=000...0`
|
|
|
|
$\exp = 1 - Bias$
|
|
|
|
$M = 0.xxxxx$
|
|
|
|
**special**
|
|
|
|
when `exp = 111...1`
|
|
|
|
* case `exp = 111...1, frac = 000...0`
|
|
|
|
* repr $\infty$
|
|
* operation that overflows
|
|
|
|
* case `exp = 111...1, frac = 111...1`
|
|
|
|
* repr `NaN`
|
|
* repr case when no numeric value can be determined
|
|
* e.g., `sqrt(-1)`, `inf - inf`, `inf * 0`
|
|
|
|
```c {cmd="gcc-14" args=[-x c $input_file --std=c23 -O0 -m32 -o 2_1.out]}
|
|
#include <stdio.h>
|
|
|
|
int main() {
|
|
unsigned x_a = 0b0'11111111'00000000000000000000000;
|
|
unsigned x_b = 0b0'11111111'00000000000000000000001;
|
|
unsigned x_c = 0b0'01111111'00000000000000000000000;
|
|
float a = *(float*)&x_a;
|
|
float b = *(float*)&x_b;
|
|
float c = *(float*)&x_c;
|
|
double cx = c;
|
|
printf("%08x: %f\n", x_a, a);
|
|
printf("%08x: %f\n", x_b, b);
|
|
printf("%08x: %f\n", x_c, c);
|
|
printf("%016llx: %f\n", *(unsigned long long *)&cx, cx);
|
|
return 0;
|
|
}
|
|
```
|
|
|
|
```sh {cmd hide}
|
|
while ! [ -f 2_1.out ]; do sleep .1; done; ./2_1.out
|
|
```
|
|
|
|
### Properties
|
|
|
|
* FP0 is Same as Int0
|
|
|
|
* Can (almost) use unsigned int comparison
|
|
|
|
### Arithmetic
|
|
|
|
$x + y = \text{Round}(x+y)$
|
|
$x \times y = \text{Round}(x\times y)$
|
|
|
|
Idea:
|
|
1. compute exact result
|
|
2. Make it fit into desired precision
|
|
* overflow if too large
|
|
* **round** to fit into frac
|
|
|
|
#### Rounding
|
|
|
|
* Twowards zero
|
|
* Round down
|
|
* Round up
|
|
* **Nearest Even***(default)
|
|
|
|
**Nearest Even** is default rounding mode
|
|
Any other kind rounding mode is hard to get without dropping into assembly, but C99 has support for rounding mode management.
|
|
|
|
This rounding mode is used because **reduced statistically bias**.
|
|
|
|
For binary fractional numbers:
|
|
* "Even" when least significant bit is $0$
|
|
* "Half way" when bits to right of rounding position is $100..._2$
|
|
|
|
so for example of rounding to neareast $1/4$:
|
|
|
|
| Binary Value | Rounded | Action |
|
|
| ------------ | ------- | ---------- |
|
|
| `10.00011` | `10.00` | (<1/2)Down |
|
|
| `10.00110` | `10.01` | (>1/2)Up |
|
|
| `10.11100` | `11.00` | (=1/2)Up |
|
|
| `10.10100` | `10.10` | (=1/2)Down |
|
|
|
|
`BBGRXXX`
|
|
|
|
* `G`: **G**uard bit: LSB of result
|
|
* `R`: **R**ound bit: first bit of removed
|
|
* `X`: **S**ticky bits: OR of remaining bits(001 = 1, 000 = 0)
|
|
|
|
Round up conditions
|
|
1. R = 1, S = 1 -> `>.5`
|
|
2. G = 1, R = 1, S = 0 -> Round to even
|
|
|
|
```c {cmd="gcc-14" args=[-x c $input_file --std=c23 -O0 -m32 -o 2_2.out]}
|
|
#include <stdio.h>
|
|
|
|
int main() {
|
|
unsigned long long
|
|
tb = 0b0'10000010000'0000000000000000000001010000000000000000000000000000;
|
|
unsigned xb = 0b0'10000001'01000000000000000000011;
|
|
double t = *(double*)&tb;
|
|
float x = t;
|
|
for(int i=31; i>=0;i--) {
|
|
if(i == 31 - 1) {
|
|
printf("/");
|
|
} else if (i == 31 - 1 - 8){
|
|
printf("/");
|
|
}
|
|
printf("%d", !!((*(unsigned *)&x) & (1<<i)));
|
|
}
|
|
printf("\n");
|
|
printf("%f", x);
|
|
}
|
|
```
|
|
```sh {cmd hide}
|
|
while ! [ -f 2_2.out ]; do sleep .1; done; ./2_2.out
|
|
```
|
|
|
|
|
|
|
|
|
|
## Float Quiz |