Videos: Optimization 2
slides: sse-fp.pdf, optimization-2.pdf |
Note: the second optional video is not included in the playlist.
For further optimizations, we need to understand pipelined architectures and how the processor detects and handles instruction-level parallelism. But we start with an aside on floating-point operations at the machine level, since we’ve only covered integer and address handling previously.
- Introduction to SSE2 instructions (as typically used in x86-64) to support floating-point arithmetic.
- About SSE2 support for explicitly parallel, single-instruction multiple-data (SIMD) operations. We don’t use these in the rest of the videos, but it’s good to know that they exist.
- In case you’re curious, SSE2 is one point in a series of floating-point and SIMD instruction sets.
- An introduction to processors that reorder and pipeline instruction sequences.
- A more specific investigation of how instructions are pipelined and how it can effect the overall time needed to perform a computation.
- Attempting to expose opportunities for parallelism to the compiler by “unrolling” a loop to handle multiple array elements in a single iteration.
- Making unrolling more successful by choosing to implement a different order of operation, which is acceptable fine for many applications.
- How the effect of unrolling can be limited by the amount of parallelism available from pipelining, and how additional functional units can help overcome those limits.
- Testing and reaching the limits of unrolling to improve performance.
- About branch prediction and how it helps enable instruction-level parallelism.