6.7 KiB
Optimization
There's more to performance than asymptotic complexity(time complexity).
But all the instructions are not consume the same amount of time. Constant factors matter too! So we need to understand system to optimize performance.
- How programs are compiled and executed
- How modern processors and memory system operate
- How to measure performance and identify bottlenecks
- How to improve performance without destroying code modularity and generality
Provide efficent mapping of program to machine code
- Register allocation
- Code selection and ordering (scheduling)
- Dead code elimination
- Elimininating minor inefficiencies
Don't improve asymptotic efficiency.
Generally Useful Optimizations
Code Motion(Hoisting)
Reduce frequecy where computation performed. If it will always produce the same result, then move it to a place where it is computed once and reused. Especially moving code out of loop.
void set_row(double *a, double *b, long i, long n) {
long j;
for (j = 0; j < n; j++) {
a[i * n + j] = b[j];
}
}
| Default | Optimized |
|---|---|
|
|
|
can see that |
Above two codes have same number of instructions. But optimized version has fewer executed instructions.
GCC will do this with -O1 options
Reduction in Strength
Replace costly operation with simpler one.
for example: power of 2 multiply to shift operation. normally, multiply and divide are expensive exmaple. on Intel Nehalem, imul requires 3 CPU cylcles on the other hand, add requires 1 cycle.
| Default | Optimized |
|---|---|
|
|
Share Common Subexpressions
Reuse portations of expressions
GCC will do this with -O1
| Default | Optimized |
|---|---|
|
|
while ! [ -r 4_3.o ]; do sleep .1; done; objdump -d 4_3.o
Above dump shows only one imul, which shows that share common subexpressions are applied.
Remove Unnecessary Procedure
Think with your intuition.
Optimization Blockers
Compilers cannot always optimize your code.
void lower(char *s) {
size_t i;
for (i = 0; i < strlen(s); i++) {
if (s[i] >= 'A' && s[i] <= 'Z') {
s[i] -= ('A' - 'a');
}
}
}
Above code's performance is bad. time quadruples when double string length.
Because strlen is executed on every loop. so strlen is O(n), therefore overall performance of lower is O(n^2)
Therefore we optimized by Code Motion by moving the calculation length parts to out of the loop.
void lower(char *s) {
size_t i;
size_t len = strlen(s);
for (i = 0; i < len; i++) {
if (s[i] >= 'A' && s[i] <= 'Z') {
s[i] -= ('A' - 'a');
}
}
}
#1 Procedure Calls
Procedure may have side effects. and Function may not return same value for given arguments.
So compiler treats procedure call as a black box. Weak optimizations near them. Therefore strong optimizations like Code Motion are not applied.
In order to apply strong optimizations, First, use of inline function with -O1 option, or do your self.
Memory Aliasing
void sum_rows(double *a, double *b, long n) {
long i, j;
for (i = 0; i < n; i++) {
b[i] = 0;
for (j = 0; j < n; j++) {
b[i] += a[i * n + j];
}
}
}
while ! [ -r 4_4.o ]; do sleep .1; done; objdump -d 4_4.o -Msuffix
Compiler leave b[i] on every iteration. Because compiler must consider possibility that the updates will affect program behavior. (b and a is shared, memory aliasing)
Memory aliasing means two different memory references specify single location. in C, it is easy to have happen. because address arithmetic and direct access to storage structures.
void sum_rows(double *a, double *b, long n) {
long i, j;
for (i = 0; i < n; i++) {
double val = 0;
for (j = 0; j < n; j++) {
val += a[i * n + j];
}
b[i] = val;
}
}
while ! [ -r 4_5.o ]; do sleep .1; done; objdump -d 4_5.o -Msuffix
By introducing local local variables, we can easy to get optimized code.
Exploiting Instruction-Level Parallelism(ILP)
Execute multiple instructions at the same time. it can reduce average instruction cycle, which needs general understanding of modern processor design: HW can execute many operations in parallel.
- performance limited by data dependency
simple transformations can yield dramatic performance improvement.
Superscalar Processors
Issue and Execute Multiple Instructions in one cycle.
pipelining -> data dependency.
for example Haswell CPU Functional Units
- 2 load
- 1 store
- 4 integer
- 2 FP mult
- 1 FP add
- 1 FP div
- 1 int mult
Programming with AVX2
YMM register: 256bit, total 16 registers.
SIMD Operations
for single precision
vaddps %ymm0, %ymm1, %ymm1:
for double precision
vaddpd %ymm0, %ymm1, %ymm1