## TASK 3: Research science-based simulations on present GPUs and future extensions.

As part of the proposed work we will develop technologies for supporting finite element simulations on multicore streaming processors. The upward trends in graphics processor (GPU) performance are impressive, even relative to progressively more powerful, conventional CPUs. Of course a variety of forces in manufacturing, marketing, and applications are driving this, but the growing consensus is that the streaming architecture embodied in most modern graphics processors has inherent advantages in scalability.

Most current streaming processors rely on a SIMD programming model in which a single kernel is applied to a very large number of data items with no data dependencies among streams. Extensions to the C language, such as CUDA, provide this SIMD capability on standard C data structures, and the interested parties seem to be converging on open standards (e.g. OpenCL) for general purpose computing on the streaming architectures. The future of high-performance computing promises a variety of new architectures that embody the streaming paradigm.

The SCI institute will leverage its considerable track record and expertise in scientific computing on GPUs to pursue the development of new technologies and algorithms for finite element simulations on GPU clusters. Thus work will take place in the context of the NVIDIA Center of Excellence at the University of Utah and the SCI Institute–one of three in the country. As part of this Center the SCI Institute, along with Hewlett Packard and NVIDIA, installed a large GPU computing cluster that consists of 32 NVIDIA S1070 systems (128 GPUs). The GPUs are connected to 64 high-end Hewlett Packard CPU servers by a high-speed Infiniband network, purchased with grants from NSF, NIH, and the state of Utah. The new high performance computing system will have a peak performance of over 128 Teraflops.

In this context we will investigate several technical issues. We will investigate alternative methods for solving linear systems on unstructured grids with streaming architectures. The conventional approach is to construct a sparse linear system and rely on a generic sparse linear solver for the GPU (e.g. the Concurrent Number Cruncher), which is promising but has demonstrated on a fraction of the potential compute power of these new architectures. The alternative is rely specific knowledge of the topology of the grid and the equations to map such grids in GPUs in a way that takes into account the limitations of memory and bandwidth. Understanding these tradeoffs is an imperative for making effective use of these vast computational resources. We will also study the related problem of how to efficiently map certain classes of finite element simulations to the complete memory/bandwidth hierarchy of a GPU cluster. This will entail a hierarchical decomposition of the problem across multiple nodes, processors, blocks, and, finally, onto the SIMD threads of the GPU. The outcomes of this project will be quantitative comparisons of alternative approaches for appropriate physical systems, new algorithms and technologies for efficient solutions of finite element simulations on GPU clusters, and software implementations that are compatible with the UINTA infrastructure described in the previous section and that are available to the community at large.

** **