Performance Analysis Thrust
The performance of the state-of-the-art applications on parallel machines is far below the limit set by Amdahl's law. Whether the machine is based on many-core, GPUs, FPGAs, or a heterogeneous combination, usually the most significant bottleneck is accessing data from the memory system. Thus, memory analysis and optimization are critical. The major challenge of memory analysis tools is delivering detailed insight without orders-of-magnitude of additional time, space, and execution resources.
Detailed application insight requires analysis of both movement and data reuse. Such insight is typically gathered with memory reuse and modeling tools or memory simulators but requires orders-of-magnitude of additional resources (time and execution) for tracing and space for data (intermediate and final). Some recent measurement techniques permit low-overhead application analysis but address only limited representations of locality such as reuse distance.
Our research focuses on low-overhead, high-resolution memory characterization, analysis, and prediction. Our characterization tasks are developing is a memory analysis toolset that combines high-resolution system-level memory trace analysis and low overhead measurement, both with respect to time and space. Our analysis tasks are developing diagnostic application signatures at multiple levels that provide abstract insight into opportunities for improving locality, memory ordering, and schedules for performance and power. Our modeling tasks are developing detailed models of advanced memory systems to reason about novel memory architectures for data-driven science.
As memory systems are the primary bottleneck in many workloads, effective hardware/software co-design requires a detailed understanding of memory behavior.
MemGaze
As memory systems are the primary bottleneck in many workloads, effective hardware/software co-design requires a detailed understanding of memory behavior. Unfortunately, current analysis of word-level sequences of memory accesses incurs time slowdowns of O(100×). MemGaze is a memory analysis toolset that combines high-resolution trace analysis and low overhead measurement, both with respect to time and space. MemGaze provides high-resolution by collecting world-level memory access traces, where the highest resolution supported is back-to-back sequences. In particular, it leverages emerging Processor Tracing support to collect data. It achieves low-overhead in space and time by leveraging sampling and various methods of hardware support for collecting traces. MemGaze provides several post-mortem trace processing methods, including multi-resolution analysis for locations vs. operations; accesses vs. spatio-temporal reuse, and reuse (distance, rate, volume) vs. access patterns.
Spatial Affinity
Memory systems achieve best performance when accesses have high spatial locality. Spatial affinity enables reasoning about how to best obtain locality by changing allocations, data layouts, or code organization. We have developed three complementary metrics to elucidate spatial affinity and provide precise notions about how pairs of memory locations interact within time windows. We demonstrate analysis on several applications to deliver both high-level and detailed insight into application and memory system performance.
- Spatial affinity analysis (Yasodha)
- Diagnostic application signatures: opportunities for improving locality or memory ordering
- Spatial affinity enables reasoning about how to best obtain locality by changing allocations, data layouts, or code organization. We have developed three complementary metrics to elucidate spatial affinity and provide
- Custom power templates from spatial-temporal locality (Nanda)
- Formulate fine-grained power control policies that coordinate processor power states with workload's concurrency and spatial-temporal locality
Develop new approach for predictively coordinating processor power (frequency) with workload's spatial-temporal locality and concurrency. Demonstrate new ability to collect fine-grained power information, coordinated with workload operations and spatial-temporal locality. Auto-generate customized modified application that improves power efficiency while sacrificing minimal performance
- Near-data processing for AI/ML (Jason Hou, summer)
- Offload data-reducing processing and filtering tasks within the Petastorm AI/ML framework so that tasks use near-data processing on storage backends within distributed file systems.
Accelerate AI/ML applications using near-data processing within the storage layer. Offload data-reducing processing and filtering tasks within the Petastorm AI/ML framework so that tasks use near-data processing on storage backends within distributed file systems.
- Emerging (with Dhruv): Evaluate memory architectures, memory performance, and contention using co-design with lightweight actor- and location- based workload analysis
Publications
- Shiyue Hou, Nathan R. Tallent, Li Wang, and Ningfang Mi, “Performance analysis of data processing in distributed file systems with near data processing,” in 11th International Symposium on Networks, Computers and Communications (ISNCC). October 2024.
- Yasodha Suriyakumar, Nathan R. Tallent, Andrés Marquez, and Karen Karavanic, "MemFriend: Understanding memory performance with spatial-temporal affinity," in Proc. of the International Symposium on Memory Systems (MemSys 2024), September 2024.
- Ozgur O. Kilic, Nathan R. Tallent, Yasodha Suriyakumar, Chenhao Xie, Andrés Marquez, and Stephane Eranian, "MemGaze: Rapid and effective load-level memory and data analysis," in Proc. of the 2022 IEEE Conf. on Cluster Computing, IEEE, Sep 2022.
- Ozgur O. Kilic, Nathan R. Tallent, and Ryan D. Friese, "Rapid memory footprint access diagnostics," in Proc. of the 2020 IEEE Intl. Symp. on Performance Analysis of Systems and Software, IEEE Computer Society, May 2020.