### Many Core Processors GROUP

#### Staff

Leader: Péter Szolgay PhD DsC;

Researchers: Zoltán Nagy PhD , András Kiss PhD;

PhD student: Csaba Nemes, Antal Hiba, László Füredi and several BSc and MSc students;

#### Contact

#### solution of complex computationally intensive problems

There is an increased demand for simulations of complex system which requires high performance but the scaling down of transistors can not support the high performance computing segment. The clock frequency of conventional microprocessors can not be increased due to power dissipation limitations. There are two ways to use the increased number of transistors either to create huge number of simple processing units or implement relatively small number of complex processor cores. Unfortunately, application performance does not scale up linearly by simply connecting more and more multi-core processors in high performance computing clusters. To overcome this limitation different accelerator architectures are appeared in the high performance computing market such as graphics processing units (GPU), the IBM Cell architecture and Field Programmable Gate Arrays (FPGA). The number of processing elements on these architectures is ranging from 10 to 1000s promising very high computing performance while they require smaller area and dissipating less power compared to the conventional multi-core processors. Unfortunately the design methodology which is required to efficiently use the computational capabilities of these new devices is missing and different algorithmic thinking and fundamentally new design concepts should be developed.

Some key challenges in this field which requires high computing performance are the simulation of complex spatial-temporal dynamical systems for example: computational fluid dynamics (CFD), molecular dynamics (MD) or handling huge databases such as similarity search in chemical or biological databases Another common property of these problems is the massive available parallelism where a one to one mapping is possible between grid points / atoms / molecule descriptions and the processing elements can operate in parallel. This description can be treated as Virtual Cellular Machine where the structure of the processing elements and the connections between them can be defined according to the requirements of the application. Direct implementation of a Virtual Cellular Machine is usually not possible due to area limitations or data dependency issues therefore a Virtual Cellular Machine should be mapped to a Physical Cellular Machine which can utilize the special features of the target architecture (IBM Cell: SPU, GPU: stream processor or FPGA: DSP slice, on-chip memory).

Numerical simulation of complex problems evolving in time plays an important role in scientific and engineering applications. Accurate behavior of dynamical systems can be understood using large scale simulations, which traditionally require expensive supercomputing facilities. A wide range of industrial processes and scientific phenomena involve gas or fluids flows over complex obstacles, e.g. air flow around vehicles and buildings, the flow of water in the oceans or liquid in BioMEMS. In engineering applications the temporal evolution of non-ideal, compressible fluids is quite often modeled by the system of Navier-Stokes equations. It is based on the fundamental laws of mass-, momentum- and energy conservation, extended by the dissipative effects of viscosity, diffusion and heat conduction.The most obvious way to solve complex spatio-temporal problems is numerical approximation over a regular mesh structure, however, practical applications usually contain complex boundaries which can be handled by unstructured meshes more efficiently.

Implementation on the IBM Cell architecture

By using the first order Lax-Friedrichs discretization method a C based CFD solver is developed which is optimized for the SPEs of the Cell architecture. The relatively small local memory of the SPEs does not allow to store all the required data, an efficient buffering method is developed to save memory bandwidth. To utilize the power of the Cell architecture computation work should be distributed between the SPEs. In spite of the large memory bandwidth of the architecture the memory bus can be easily saturated. Therefore an appropriate arrangement of data between SPEs can greatly improve computing performance. One possible solution is to form a pipeline using the SPEs to compute several iterations in parallel. In this case continuous data flow and synchronization is required between the neighboring SPEs but this communication pattern is well suited for the ring structure of the EIB.

To show the efficiency of our solution a complex test case was used, in which a Mach 3 flow over a forward facing step was computed. The simulated region is a two dimensional cut of a pipe which has closed at the upper and lower boundaries, while the left and right boundaries are open. The direction of the flow is from left to right and the speed of the flow at the left boundary is 3-time the speed of sound constantly. The solution contains shock waves reflecting from the closed boundaries. This problem was solved on a 128x512 sized grid with 0.0001s timestep.

Density map after 4s simulation time on the Cell processor

Compared to an Intel Core2Duo microprocessor running on 2GHz clock frequency our Cell based solution is 33 times faster even using a single SPE during the computation. Utilizing all SPEs of the Cell architecture the computation can be carried out two orders of magnitude faster while the power dissipation of the architectures are in the same range.

Implementation on FPGA

A framework for accelerating the solution of the 2D Euler equations using explicit unstructured finite volume discretization is implemented on FPGA. Efficient use of the on-chip memory is provided by a node reordering algorithm. Irregular memory access patterns are eliminated by the on-chip memory buffers which results in higher available memory bandwidth and full utilization of the arithmetic unit.

When an explicit finite volume method is used during the solution of a PDE a complex mathematical expression must be computed on the neighborhood of each node. The complexity of the resulting arithmetic unit is determined by the governing equations and the discretization method used. Usually the arithmetic unit is constructed using dozens of floating-point units which makes manual design very tedious and error-prone. Therefore an automatic tool was developed to generate the arithmetic unit using the discretized governing equations.

Performance comparison of the architecture using a single processor running on 390MHz showed that 30 times speedup can be achieved compared to a high performance Intel Xeon E5620 microprocessor running on 2.4GHz clock frequency. Computing performance can be further improved by implementing three processors on one FPGA reaching 90 times speedup.

Unstructured mesh of a scramjet intake

Density map after 4s simulation time

#### Infrastructure

IBM CELL Cluster

1x IBM BladeCenter H Chasis: up to 14 blades, 1G Ethernet switch, Infiniband 4X DDR switch

7x QS22 Blade: two PowerXCell 8i processors at 3.2 GHz, 16Gbyte RAM, Infiniband 4X DDR

7x LS22 Blade: two quad-core AMD Opteron processors, 8Gbyte RAM, 73Gbyte SAS drive, Infiniband 4X DDR

FPGA development systems

2x Alpha Data ADM-XRC-6T1: Xilinx XC6VSX475T FPGA, 2Gbyte SDRAM, PCI Express® Gen2 x4

2x Alpha Data ADM-XRC-7K1: Xilinx XC7K410T FPGA, 512Mbyte SDRAM, PCI Express® Gen2 x4

#### Publications

- Z. Nagy, Cs. Nemes, A. Hiba, A. Kiss, Á. Csík, P. Szolgay, „Accelerating Unstructured Finite Volume Solution of 2-D Euler Equations on FPGAs” Proc. of Conference on Modelling Fluid Flow, CMFF’2012, pp. 941-948, Budapest, Hungary, Sept. 4-7, 2012
- Z. Nagy, Cs. Nemes, A. Hiba, A. Kiss, Á. Csík, P. Szolgay, „FPGA Based Acceleration of Computational Fluid Flow Simulation on Unstructured Mesh Geometry” Proc. of 22nd International Conference on Field Programmable Logic and Applications, FPL’2012, Oslo, Norway, Aug. 29-31, 2012
- S. Kocsárdi, Z. Nagy, Á. Csík, P. Szolgay “Simulation of 2D inviscid, adiabatic, compressible flows on emulated digital CNN-UM”, International Journal of Circuit Theory and Applications, Vol. 37, Issue 4, pp. 569-585, 2009, DOI: 10.1002/cta.565
- S. Kocsárdi, Z. Nagy, Á. Csík, P. Szolgay “Simulation of two-dimensional supersonic flows on emulated digital CNN-UM”, EURASIP Journal on Advances in Signal Processing Special Issue: CNN Technology for Spatiotemporal Signal Processing, vol. 2009, Article ID 923404, 11 pages, 2009. doi:10.1155/2009/923404
- Z. Nagy, A. Kiss, S. Kocsárdi, M. Retek, Á. Csík, P. Szolgay, „A Supersonic Flow Simulation on IBM Cell Processor Based Emulated Digital Cellular Neural Networks” Proc. of CMFF’2009, pp. 502-509, Budapest, Hungary, 2009
- Z. Nagy, A. Kiss, S. Kocsárdi, Á. Csík, „Computational Fluid Flow Simulation on Body Fitted Mesh Geometry with IBM Cell Broadband Engine Architecture” Proc. of ECCTD’2009, pp. 827-830, Antalya, Turkey, 2009
- Z. Nagy, L. Kék, Z. Kincses, A. Kiss, P. Szolgay, „Toward exploitation of cell multi-processor array in time-consuming applications by using CNN model”, International Journal of Circuit Theory and Applications, Vol. 36, Issue 5-6, pp. 605-622, 2008, DOI: 10.1002/cta.508