Y. Zhang, K. Ootsu, T. Yokota, and T. Baba (Japan)
Thread pipelining, commodity Chip Multiprocessors, clus tered data communication, communication overhead.
In recent years, Chip Multiprocessors (CMPs) is emerging as a dominant architecture for higher performance. To fully utilize multiple cores to increase the performance of legacy sequential programs, the key is automatic thread extraction from them. A thread pipelining technique, called Decou pled Software Pipelining (DSWP), with great applicable to loop was proposed recently. However, this technique couldn’t be directly applied to present commodity CMPs, because the communication overhead is so large that paral lelization benefits are greatly offset. This paper proposes a new non-speculative thread pipelining technique, called clustered decoupled software pipelining (CDSWP), as an extension to DSWP. The goal of CDSWP is to multithread sequential programs on com modity CMPs without adding additional hardware. The main insight of CDSWP is that the thread pipelining com municates a clustered data set instead of a single data. In this way, the false sharing can be eliminated, and the aver age cache latency can be reduced greatly. According to the preliminary experiment on four commodity CMPs archi tectures, we have achieved maximum loop speedup ranging from 16% to 68% on benchmark programs.
Important Links:
Go Back