Johann Steinbrecher and Weijia Shang
Loop unrolling, Uniform dependence algorithms, Thread level parallelism, Loop skewing
In most previous work, loops are unrolled for optimizing the performance on single core architectures by reducing the loop overhead, exploiting instruction level parallelism or improving data locality. With the advent of multi core systems, multiple threads are required to saturate the computation resources. This paper applies loop unrolling to create multiple threads where a thread performs one iteration from a super iteration. We characterize loop unrolling by the unrolling factor, the number of iterations in a super iteration and the unrolling direction, the choice of iterations to form the super iteration. Data dependencies exist between different iterations. In this paper, a linear schedule is used to group independent iterations to form a wavefront, a hyperplane in the loop iteration space. Then, the loop is unrolled along the wavefront which guarantees all iterations in the same super iteration are independent. A sufficient condition for creating communication free threads is presented and applied. An example, the matrix multiplication algorithm is used to demonstrate the method. Our technique is applied to the matrix multiplication, the N-body simulation and a MRI reconstruction application and compared against traditional methods. The benchmarks score speedups of up to 11.53 on a 12 core machine.
Important Links:
Go Back