complement many

2025-10-11 08:39:17 +09:00
parent 3d651c4a8a
commit 4c2c0363b3
10 changed files with 992 additions and 6 deletions
--- a/notes/2.md
+++ b/notes/2.md
@@ -1,6 +1,207 @@
-# Machine Level Programming
+# Floating Point

-아키텍쳐(ISA)
-* intel(x86): CISC
-* ARM(aarch64, aarch32): RISC
+## Fractional Binary Number

+representation:
+
+* for $w = i + j + 1$ bits data $b$
+$$\sum_{k = -j}^{i}b_k\times 2^k$$
+
+for example:
+* $5+3/4 = 23/4 = 101.11_2$
+* $1 7/16 = 23/16 = 1.0111_2$
+
+**Limitations**
+* Can only exactly represent numbers of the form of $x/2^k$
+* Just one setting of binary point within the $w$ bits, which means that very small value or very large value cannot be represented
+
+## IEEE Floating Point Definition
+
+**IEEE Standard 754**
+
+Driven by numerical concerns:
+* Nice standards for rounding, overflow, underflow
+* But Hard to make fast in hardware
+  * Numberical Analysts predominated over hw designers in defining standard
+
+
+### Representation
+
+**Form**
+
+$$(-1)^s M 2^E$$
+
+* $s$: sign bit
+* $M$: mantissa fractional value in $[1.0,2.0)$
+* $E$: exponent
+
+**Encoding**
+
+```mermaid
+---
+title: "Single Precision"
+config:
+    packet:
+        bitsPerRow: 32
+        rowHeight: 32
+---
+packet
+1: "s"
+8: "exp"
+23: "frac"
+```
+
+```mermaid
+---
+title: "Double Precision"
+config:
+    packet:
+        bitsPerRow: 32
+        rowHeight: 32
+---
+packet
+1: "s"
+11: "exp"
+52: "frac"
+```
+
+There is three kinds of `float`: **normalized**, **denormalized**, **special**
+
+**normalized**
+
+$E = \exp - B$
+$B = 2^{k-1}-1$ where $k$ is number of exp bits
+* single: 127
+* double: 1023
+
+$M = 1.xxxxx$
+minumum when $\text {frac} = 0000...\quad (M = 1.0)$
+maximum when $\text{frac }= 1111... \quad (M = 2.0 - \epsilon)$
+
+**denormalized**
+
+when `exp=000...0`
+
+$\exp = 1 - Bias$
+
+$M = 0.xxxxx$
+
+**special**
+
+when `exp = 111...1`
+
+* case `exp = 111...1, frac = 000...0`
+
+  * repr $\infty$
+  * operation that overflows
+
+* case `exp = 111...1, frac = 111...1`
+
+  * repr `NaN`
+  * repr case when no numeric value can be determined
+    * e.g., `sqrt(-1)`, `inf - inf`, `inf * 0`
+
+```c {cmd="gcc-14" args=[-x c $input_file --std=c23 -O0 -m32 -o 2_1.out]}
+#include <stdio.h>
+
+int main() {
+  unsigned x_a = 0b0'11111111'00000000000000000000000;
+  unsigned x_b = 0b0'11111111'00000000000000000000001;
+  unsigned x_c = 0b0'01111111'00000000000000000000000;
+  float a = *(float*)&x_a;
+  float b = *(float*)&x_b;
+  float c = *(float*)&x_c;
+  double cx = c;
+  printf("%08x: %f\n", x_a, a);
+  printf("%08x: %f\n", x_b, b);
+  printf("%08x: %f\n", x_c, c);
+  printf("%016llx: %f\n", *(unsigned long long *)&cx, cx);
+  return 0;
+}
+```
+
+```sh {cmd hide}
+while ! [ -f 2_1.out ]; do sleep .1; done; ./2_1.out
+```
+
+### Properties
+
+* FP0 is Same as Int0
+
+* Can (almost) use unsigned int comparison
+
+### Arithmetic
+
+$x + y = \text{Round}(x+y)$
+$x \times y = \text{Round}(x\times y)$
+
+Idea: 
+1. compute exact result
+2. Make it fit into desired precision
+   * overflow if too large
+   * **round** to fit into frac
+
+#### Rounding
+
+* Twowards zero
+* Round down
+* Round up
+* **Nearest Even***(default)
+
+**Nearest Even** is default rounding mode
+Any other kind rounding mode is hard to get without dropping into assembly, but C99 has support for rounding mode management.
+
+This rounding mode is used because **reduced statistically bias**.
+
+For binary fractional numbers:
+* "Even" when least significant bit is $0$
+* "Half way" when bits to right of rounding position is $100..._2$
+
+so for example of rounding to neareast $1/4$:
+
+| Binary Value | Rounded | Action     |
+| ------------ | ------- | ---------- |
+| `10.00011`   | `10.00` | (<1/2)Down |
+| `10.00110`   | `10.01` | (>1/2)Up   |
+| `10.11100`   | `11.00` | (=1/2)Up   |
+| `10.10100`   | `10.10` | (=1/2)Down |
+
+`BBGRXXX`
+
+* `G`: **G**uard bit: LSB of result
+* `R`: **R**ound bit: first bit of removed
+* `X`: **S**ticky bits: OR of remaining bits(001 = 1, 000 = 0)
+
+Round up conditions
+1. R = 1, S = 1 -> `>.5`
+2. G = 1, R = 1, S = 0 -> Round to even
+
+```c {cmd="gcc-14" args=[-x c $input_file --std=c23 -O0 -m32 -o 2_2.out]}
+#include <stdio.h>
+
+int main() {
+  unsigned long long
+           tb = 0b0'10000010000'0000000000000000000001010000000000000000000000000000;
+  unsigned xb =    0b0'10000001'01000000000000000000011;
+  double t = *(double*)&tb;
+  float x = t;
+  for(int i=31; i>=0;i--) {
+    if(i == 31 - 1) {
+      printf("/");
+    } else if (i == 31 - 1 - 8){
+      printf("/");
+    }
+    printf("%d", !!((*(unsigned *)&x) & (1<<i)));
+  }
+  printf("\n");
+  printf("%f", x);
+}
+```
+```sh {cmd hide}
+while ! [ -f 2_2.out ]; do sleep .1; done; ./2_2.out
+```
+
+
+
+
+## Float Quiz