complement many
This commit is contained in:
209
notes/2.md
209
notes/2.md
@@ -1,6 +1,207 @@
|
||||
# Machine Level Programming
|
||||
# Floating Point
|
||||
|
||||
아키텍쳐(ISA)
|
||||
* intel(x86): CISC
|
||||
* ARM(aarch64, aarch32): RISC
|
||||
## Fractional Binary Number
|
||||
|
||||
representation:
|
||||
|
||||
* for $w = i + j + 1$ bits data $b$
|
||||
$$\sum_{k = -j}^{i}b_k\times 2^k$$
|
||||
|
||||
for example:
|
||||
* $5+3/4 = 23/4 = 101.11_2$
|
||||
* $1 7/16 = 23/16 = 1.0111_2$
|
||||
|
||||
**Limitations**
|
||||
* Can only exactly represent numbers of the form of $x/2^k$
|
||||
* Just one setting of binary point within the $w$ bits, which means that very small value or very large value cannot be represented
|
||||
|
||||
## IEEE Floating Point Definition
|
||||
|
||||
**IEEE Standard 754**
|
||||
|
||||
Driven by numerical concerns:
|
||||
* Nice standards for rounding, overflow, underflow
|
||||
* But Hard to make fast in hardware
|
||||
* Numberical Analysts predominated over hw designers in defining standard
|
||||
|
||||
|
||||
### Representation
|
||||
|
||||
**Form**
|
||||
|
||||
$$(-1)^s M 2^E$$
|
||||
|
||||
* $s$: sign bit
|
||||
* $M$: mantissa fractional value in $[1.0,2.0)$
|
||||
* $E$: exponent
|
||||
|
||||
**Encoding**
|
||||
|
||||
```mermaid
|
||||
---
|
||||
title: "Single Precision"
|
||||
config:
|
||||
packet:
|
||||
bitsPerRow: 32
|
||||
rowHeight: 32
|
||||
---
|
||||
packet
|
||||
+1: "s"
|
||||
+8: "exp"
|
||||
+23: "frac"
|
||||
```
|
||||
|
||||
```mermaid
|
||||
---
|
||||
title: "Double Precision"
|
||||
config:
|
||||
packet:
|
||||
bitsPerRow: 32
|
||||
rowHeight: 32
|
||||
---
|
||||
packet
|
||||
+1: "s"
|
||||
+11: "exp"
|
||||
+52: "frac"
|
||||
```
|
||||
|
||||
There is three kinds of `float`: **normalized**, **denormalized**, **special**
|
||||
|
||||
**normalized**
|
||||
|
||||
$E = \exp - B$
|
||||
$B = 2^{k-1}-1$ where $k$ is number of exp bits
|
||||
* single: 127
|
||||
* double: 1023
|
||||
|
||||
$M = 1.xxxxx$
|
||||
minumum when $\text {frac} = 0000...\quad (M = 1.0)$
|
||||
maximum when $\text{frac }= 1111... \quad (M = 2.0 - \epsilon)$
|
||||
|
||||
**denormalized**
|
||||
|
||||
when `exp=000...0`
|
||||
|
||||
$\exp = 1 - Bias$
|
||||
|
||||
$M = 0.xxxxx$
|
||||
|
||||
**special**
|
||||
|
||||
when `exp = 111...1`
|
||||
|
||||
* case `exp = 111...1, frac = 000...0`
|
||||
|
||||
* repr $\infty$
|
||||
* operation that overflows
|
||||
|
||||
* case `exp = 111...1, frac = 111...1`
|
||||
|
||||
* repr `NaN`
|
||||
* repr case when no numeric value can be determined
|
||||
* e.g., `sqrt(-1)`, `inf - inf`, `inf * 0`
|
||||
|
||||
```c {cmd="gcc-14" args=[-x c $input_file --std=c23 -O0 -m32 -o 2_1.out]}
|
||||
#include <stdio.h>
|
||||
|
||||
int main() {
|
||||
unsigned x_a = 0b0'11111111'00000000000000000000000;
|
||||
unsigned x_b = 0b0'11111111'00000000000000000000001;
|
||||
unsigned x_c = 0b0'01111111'00000000000000000000000;
|
||||
float a = *(float*)&x_a;
|
||||
float b = *(float*)&x_b;
|
||||
float c = *(float*)&x_c;
|
||||
double cx = c;
|
||||
printf("%08x: %f\n", x_a, a);
|
||||
printf("%08x: %f\n", x_b, b);
|
||||
printf("%08x: %f\n", x_c, c);
|
||||
printf("%016llx: %f\n", *(unsigned long long *)&cx, cx);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
```sh {cmd hide}
|
||||
while ! [ -f 2_1.out ]; do sleep .1; done; ./2_1.out
|
||||
```
|
||||
|
||||
### Properties
|
||||
|
||||
* FP0 is Same as Int0
|
||||
|
||||
* Can (almost) use unsigned int comparison
|
||||
|
||||
### Arithmetic
|
||||
|
||||
$x + y = \text{Round}(x+y)$
|
||||
$x \times y = \text{Round}(x\times y)$
|
||||
|
||||
Idea:
|
||||
1. compute exact result
|
||||
2. Make it fit into desired precision
|
||||
* overflow if too large
|
||||
* **round** to fit into frac
|
||||
|
||||
#### Rounding
|
||||
|
||||
* Twowards zero
|
||||
* Round down
|
||||
* Round up
|
||||
* **Nearest Even***(default)
|
||||
|
||||
**Nearest Even** is default rounding mode
|
||||
Any other kind rounding mode is hard to get without dropping into assembly, but C99 has support for rounding mode management.
|
||||
|
||||
This rounding mode is used because **reduced statistically bias**.
|
||||
|
||||
For binary fractional numbers:
|
||||
* "Even" when least significant bit is $0$
|
||||
* "Half way" when bits to right of rounding position is $100..._2$
|
||||
|
||||
so for example of rounding to neareast $1/4$:
|
||||
|
||||
| Binary Value | Rounded | Action |
|
||||
| ------------ | ------- | ---------- |
|
||||
| `10.00011` | `10.00` | (<1/2)Down |
|
||||
| `10.00110` | `10.01` | (>1/2)Up |
|
||||
| `10.11100` | `11.00` | (=1/2)Up |
|
||||
| `10.10100` | `10.10` | (=1/2)Down |
|
||||
|
||||
`BBGRXXX`
|
||||
|
||||
* `G`: **G**uard bit: LSB of result
|
||||
* `R`: **R**ound bit: first bit of removed
|
||||
* `X`: **S**ticky bits: OR of remaining bits(001 = 1, 000 = 0)
|
||||
|
||||
Round up conditions
|
||||
1. R = 1, S = 1 -> `>.5`
|
||||
2. G = 1, R = 1, S = 0 -> Round to even
|
||||
|
||||
```c {cmd="gcc-14" args=[-x c $input_file --std=c23 -O0 -m32 -o 2_2.out]}
|
||||
#include <stdio.h>
|
||||
|
||||
int main() {
|
||||
unsigned long long
|
||||
tb = 0b0'10000010000'0000000000000000000001010000000000000000000000000000;
|
||||
unsigned xb = 0b0'10000001'01000000000000000000011;
|
||||
double t = *(double*)&tb;
|
||||
float x = t;
|
||||
for(int i=31; i>=0;i--) {
|
||||
if(i == 31 - 1) {
|
||||
printf("/");
|
||||
} else if (i == 31 - 1 - 8){
|
||||
printf("/");
|
||||
}
|
||||
printf("%d", !!((*(unsigned *)&x) & (1<<i)));
|
||||
}
|
||||
printf("\n");
|
||||
printf("%f", x);
|
||||
}
|
||||
```
|
||||
```sh {cmd hide}
|
||||
while ! [ -f 2_2.out ]; do sleep .1; done; ./2_2.out
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
## Float Quiz
|
||||
Reference in New Issue
Block a user