New architectural trends and extreme scale parallelism challenge the efficient mapping of applications to large scale and emerging systems. Fine-grained Asynchronous Manytask Runtimes (AMR) offer a unique flexibility to hide the
varying latencies of contentious operations, and facilitate the effective exploitation of abundant resources. By decomposing work into smaller asynchronous chunks, the impact of unpredictable operations can be mitigated through a tighter overlapping of tasks. Unfortunately, the cost of this decomposition comes at a potentially steep price in the form of runtime overhead which can at times rival the cost of the computation itself. This leads
to the question, how large should a task be?
A common practice for application developers is to experimentally determine the granularity of a task after a code has been parallelized. Instead, we propose a new methodology based on an extended Roofline model to provide practical upper bounds on the throughput performance of an application. First,
we extend the Roofline model to support not only latency hiding analysis, but also a multidimensional amortized analysis. By combining this new methodology with a serial application and an AMR implementation, we can predict the worst case runtime overhead attribution of individual runtime features prior to the development of parallel code. Thus, this runtimecentric methodology can provide a vehicle for application/runtime codesign by providing a comprehensive bottleneck analysis based on existing runtime features.
Revised: June 5, 2018 |
Published: September 13, 2016
Citation
Suetterlein J.D., J.B. Landwehr, A. Marquez, J.B. Manzano Franco, and G.R. Gao. 2016.Extending the Roofline Model for Asynchronous Many-Task Runtimes. In IEEE International Conference on Cluster Computing (CLUSTER 2016), Septemer 12-16, 2016, Taipei, Taiwan, 493-496. Los Alamitos, California:IEEE Computer Society.PNNL-SA-119731.doi:10.1109/CLUSTER.2016.47