2025-02-SystemProgramming/notes/4.md

# Optimization

There's more to performance than asymptotic complexity(time complexity).

But all the instructions are not consume the same amount of time. Constant factors matter too! So we need to understand system to optimize performance.
* How programs are compiled and executed
* How modern processors and memory system operate
* How to measure performance and identify bottlenecks
* How to improve performance without destroying code modularity and generality

Provide efficent mapping of program to machine code
* Register allocation
* Code selection and ordering (scheduling)
* Dead code elimination
* Elimininating minor inefficiencies

**Don't improve asymptotic efficiency**.

## Generally Useful Optimizations

### Code Motion(Hoisting)

Reduce frequecy where computation performed. If it will always produce the same result, then move it to a place where it is computed once and reused.
Especially moving code out of loop.

```c {cmd=gcc args=[-Og -x c -c $input_file -o 4_1.o]}
void set_row(double *a, double *b, long i, long n) {
    long j;
    for (j = 0; j < n; j++) {
        a[i * n + j] = b[j];
    }
}
```

<table>
<tr><th>Default</th><th>Optimized</th></tr>
<tr><td>

```c {cmd=gcc args=[-O1 -x c -c $input_file -o 4_2.o]}
void set_row(double *a, double *b, long i, long n) {
    long j;
    for (j = 0; j < n; j++) {
        a[i * n + j] = b[j];
    }
}
```
</td><td>

```c
void set_row_opt(double *a, double *b, long i, long n) {
    long j;
    int ni = n * i;
    for (j = 0; j < n; j++) {
        a[ni + j] = b[j];
    }
}
```
</td></tr>
<tr>
<td>

```sh {cmd hide}
while ! [ -r 4_1.o ]; do sleep .1; done; objdump -d 4_1.o
```
`imul` is located in the loop.
</td>
<td>

```sh {cmd hide}
while ! [ -r 4_2.o ]; do sleep .1; done; objdump -d 4_2.o
```
can see that `imul` is located out of the loop.
</td>
</tr>
</table>

Above two codes have same number of instructions. But optimized version has **fewer executed instructions**.

GCC will do this with `-O1` options

### Reduction in Strength

Replace costly operation with simpler one.

for example: power of 2 multiply to shift operation. normally, multiply and divide are expensive exmaple. on Intel Nehalem, `imul` requires 3 CPU cylcles on the other hand, `add` requires 1 cycle.

<table>
<tr><th>Default</th><th>Optimized</th></tr>
<tr><td>

```c
void test_reduction(double *a, double *b, long i, long n) {
    int i, j;
    for (i = 0;i < n; i++) {
        int ni = n * i;
        for (j = 0; j < n; j++) {
            a[ni + j] = b[j];
        }
    }
}
```
</td><td>

```c
void test_reduction_opt(double *a, double *b, long i, long n) {
    int i, j;
    int ni = 0;
    for (i = 0;i < n; i++) {
        for (j = 0; j < n; j++) {
            a[ni + j] = b[j];
        }
        ni += n;
    }
}
```
</td></tr>
</table>

### Share Common Subexpressions

Reuse portations of expressions

GCC will do this with `-O1`

<table>
<tr><th>Default</th><th>Optimized</th></tr>
<tr><td>

```c {cmd=gcc args=[-O1 -x c -c $input_file -o 4_3.o]}
double test_scs(double* val, long i, long j, long n) {
    double up, down, left, right;

    up = val[(i - 1) * n + j];
    down = val[(i + 1) * n + j];
    left = val[i * n + (j - 1)];
    right = val[i * n + (j + 1)];
    return up + down + left + right;
}
```
</td><td>

```c
double test_scs_opt(double *a, double *b, long i, long n) {
    double up, down, left, right;

    long inj = i * n + j;

    up = a[inj - n];
    down = a[inj + n];
    left = b[inj - 1];
    right = b[inj + 1];
    return up + down + left + right;
}
```
</td></tr>
</table>

```sh {cmd hide}
while ! [ -r 4_3.o ]; do sleep .1; done; objdump -d 4_3.o
```

Above dump shows only one `imul`, which shows that share common subexpressions are applied.

### Remove Unnecessary Procedure

Think with your intuition.

## Optimization Blockers

Compilers cannot always optimize your code.

```c
void lower(char *s) {
    size_t i;
    for (i = 0; i < strlen(s); i++) {
        if (s[i] >= 'A' && s[i] <= 'Z') {
            s[i] -= ('A' - 'a');
        }
    }
}
```

Above code's performance is bad. time quadruples when double string length.
Because `strlen` is executed on every loop. so `strlen` is $O(n)$, therefore overall performance of `lower` is $O(n^2)$

Therefore we optimized by Code Motion by moving the calculation length parts to out of the loop.

```c
void lower(char *s) {
    size_t i;
    size_t len = strlen(s);
    for (i = 0; i < len; i++) {
        if (s[i] >= 'A' && s[i] <= 'Z') {
            s[i] -= ('A' - 'a');
        }
    }
}
```

### #1 Procedure Calls

Procedure may have side effects. and Function may not return same value for given arguments.

So compiler treats procedure call as a black box. Weak optimizations near them. Therefore strong optimizations like **Code Motion** are not applied.

In order to apply strong optimizations, First, use of inline function with `-O1` option, or **do your self**.

### Memory Aliasing

```c {cmd=gcc args=[-O1 -x c -c $input_file -o 4_4.o]}
void sum_rows(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
        b[i] = 0;
        for (j = 0; j < n; j++) {
            b[i] += a[i * n + j];
        }
    }
}
```
```sh {cmd hide}
while ! [ -r 4_4.o ]; do sleep .1; done; objdump -d 4_4.o -Msuffix
```

Compiler leave `b[i]` on every iteration. Because compiler must consider possibility that the updates will affect program behavior. (`b` and `a` is shared, memory aliasing)

Memory aliasing means two different memory references specify single location.
in C, it is easy to have happen. because address arithmetic and direct access to storage structures.

```c {cmd=gcc args=[-O1 -x c -c $input_file -o 4_5.o]}
void sum_rows(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
        double val = 0;
        for (j = 0; j < n; j++) {
            val += a[i * n + j];
        }
        b[i] = val;
    }
}
```
```sh {cmd hide}
while ! [ -r 4_5.o ]; do sleep .1; done; objdump -d 4_5.o -Msuffix
```

By introducing local local variables, we can easy to get optimized code.

## Exploiting Instruction-Level Parallelism(ILP)

Execute multiple instructions at the same time. it can reduce average instruction cycle, which needs general understanding of modern processor design: HW can execute many operations in parallel.

* performance limited by data dependency

simple transformations can yield dramatic performance improvement.

### Superscalar Processors

Issue and Execute Multiple Instructions in one cycle.

pipelining -> data dependency.


for example Haswell CPU Functional Units
* 2 load
* 1 store
* 4 integer
* 2 FP mult
* 1 FP add
* 1 FP div
* 1 int mult

### Programming with AVX2

YMM register: 256bit, total 16 registers.

**SIMD Operations**

for single precision
`vaddps %ymm0, %ymm1, %ymm1`:

for double precision

`vaddpd %ymm0, %ymm1, %ymm1`