Challenges
Challenges To Be Addressed:
Developing and effectively deploying scientific applications to HPC platforms requires at least three categories of tools: performance analysis, debugging, and correctness checking. Without performance analysis and tuning of applications, less than 10% of peak performance is typically achieved caused by bottlenecks in the memory system, inefficient use of the memory hierarchy. Further, parallel applications additionally are impacted by bottlenecks caused by load imbalance, inefficient communication, and contention caused by large scale parallel I/O. Rapidly increasing core counts and concurrencies magnify the impact of these bottlenecks.
The primary challenge of a performance analysis tool is to identify these bottlenecks and the sequence of events that can cause them. Another challenge is to identify how well the application is using available system resources, such as available parallelism and memory, communication, or I/O bandwidth, exploiting the available resources to their fullest extent. Understanding parallel performance requires information from many sources which implies that the hardware, the operating system, runtime system, and programming model all need to offer access to the information that they manage/control.
For applications that will run on Exascale platforms, reducing data movement, already an important goal, will become more critical for both performance and power consumption reasons. Very few of the current generation tools target reduced power consumption as a goal for application level optimization.
As we move towards Exascale systems, we will need a new generation of performance tools that will:
- Enable automatic analysis capabilities , identifying and locating performance bottlenecks, attribute them to their root causes, and associate them to application source;
- Have access to information controlled/managed by programming models, operating system, runtime system, and hardware architecture; Performance data is becoming more hierarchical in modern architectures.
- Offer new in-situ analysis and presentation techniques to turn measurement/sampled performance data into application insight; The gap between performance assessment by skilled performance engineers and the performance realized by codes must be closed.
- Measure and analyze on-chip and off-chip network traffic, exposing congestion issues in the communication and I/O;
- Measure and analyze memory system performance and hardware support for performance, providing insight into memory locality, data movement across memory hierarchy, unnecessary replication, inefficient allocations, etc. , and assisting to reduce data movement;
- Measure and analyze thread metrics, such as loop overhead measurements, detection of synchronization bottlenecks, and the startup time of threaded regions;
- Measure and analyze power consumption of codes, correlating the results to the application source code; these tools need to make use of system wide monitors available at the board or rack level, as well as processor or chip set internal sensors;
- Monitor health and status of system resources, which include fault detection and feedback into the software stack; and
- Provide real-time analysis results and responses to the entire software stack in order to help with the optimization of application codes while they run. For example, process migration may be triggered by the identification of load imbalance by the performance analysis tool).
In order to deal with codes with billions of concurrent threads, we will also need a new generation of debugging tools that automatically or semi-automatically reduce the problem to some form of hierarchical debugging. For example, a tool that first identifies a group of cores where the “wolf cries,” enabling deep diving of root cause analysis on a single core or another, smaller group of cores.
The new generation of debugging tools will need to be preceded by static and dynamic checks that can identify and mitigate errors. Such verification tools have been successfully applied to small scale MPI programs, but considerable new methods will be required to extend them to Exascale systems and to additional and new programming models.
And last, but not least, as with previous generations of tools we will need a strategy to ensure support and availability of the tools across multiple hardware generations. The strategy needs to include training of tools for users and it needs to address the coordination of the various tool development groups as well as the coordination with application development groups. For existing tools, these are not research activities and a new funding model needs to be discussed, which may include Facilities, vendors, and companies to support the full life-cycle of the tools.