

#### F.D. Witherden

Department of Ocean Engineering Texas A&M University





# Introduction

- However, over the last decade—on a cost basis—the performance of many industrial CFD codes has plateaued.
- In this presentation we will **investigate the root cause** of this and review alternative coding paradigms and hardware that can **get solver performance back on track**.









- This relationship places **practical limits** on how high a chip can be clocked and still be power efficient.
- The solution here is to **increase the amount of work** we do per clock cycle.

- One issue is that many instructions, especially those operating on floating point data, take **multiple cycles to return a result**.
- A solution to this is **pipelining** which enables a new instruction to start execution before the current one has finished.













| How a CPU Works   |                 |                                       |                          |  |
|-------------------|-----------------|---------------------------------------|--------------------------|--|
| Processor         | Instruction Set | Issue Width<br>(Instructions / Cycle) | Max Clock Speed<br>(GHz) |  |
| Intel Golden Cove | x86-64          | 6                                     | 5.8                      |  |
| AMD Zen 4         | x86-64          | 6                                     | 5.4                      |  |
| Apple Firestorm   | AARCH64         | 8                                     | 3.2                      |  |
| Fujitsu A64FX     | AARCH64         | 4                                     | 2.2                      |  |
|                   |                 |                                       |                          |  |

- For numerical applications the key operation is the floating point operation or FLOP (+ or – or \*).
- To improve efficiency most architectures support a **fused multiply-add** instruction (FMA) which computes:

 $c \leftarrow a \cdot b + c$  (two FLOPs).

- The best means of further improving performance is to increase the **amount of work done by each instruction**.
- This can be accomplished by having the instructions operate on small vectors in lieu of simple scalars.

- Also known as **single instruction multiple data** (SIMD) typical vector lengths are between 128- and 512-bits.
- SIMD capabilities are a core part of all recent processor architectures.



- Increasing the vector length is a simple means of **improving peak performance**.
- However, not all codes can fully utilise large vectors.
- As such general purpose processors are yet to go beyond 512-bits.

| Processor         | Vector Width | Multiply-Add Rate<br>(Per Cycle) | Max DP FLOPs<br>(Per Cycle) |
|-------------------|--------------|----------------------------------|-----------------------------|
| Intel Golden Cove | 512-bit      | 2 MADD                           | 32                          |
| AMD Zen 4         | 512-bit      | 1 MADD<br>1 ADD                  | 24                          |
| Apple Firestorm   | 128-bit      | 4 MADD                           | 16                          |
| Fujitsu A64FX     | 512-bit      | 2 MADD                           | 32                          |

- Having reached the practical limit of what is possible for a single general purpose core, the simplest means of improving performance is to **replicate them**.
- This leads us to multi-core chips with the number of cores on a single package being **between 8 and 128**.

- A typical processor has either 16 or 32 general purpose integer registers and either 16 or 32 vector registers.
- Clearly, this is **not sufficient** to contain all of the data needed for any non-trivial problem.



- The solution here is to attach some memory to our processor.
- This is usually some kind of **dynamic memory** which is **cheap** and has **reasonable densities**.





- This places practical limits on the latency and bandwidth of main memory.
- Specifically latency is usually ~50 ns and bandwidth for an eight channel DDR4 configuration is ~250 GiB/s.



## The Memory Wall

- To put these numbers into perspective a six-issue core running at 3 GHz can execute almost 1,000 instructions in 50 ns!
- If we can **dual-issue 512-bit FMA's** this is about the same amount of time as is needed to perform 4,800 double precision floating point operations.

## The Memory Wall

- Now, let us consider bandwidth.
- Consider a function to perform the following 'AXPY' operation:

$$\mathbf{y} \leftarrow \alpha \mathbf{x} + \mathbf{y},$$

where **x** and **y** are vectors and  $\alpha$  is a scalar.

## The Memory Wall

- This simple vector addition operation is a **building-block of many linear algebra kernels**.
- Running through our vectors each loop iteration requires us to load a component of x and y and write a component of y.

## The Memory Wall

 On a 2 GHz core with 512-bit vectors that can sustain two loads and one store per cycle our bandwidth requirements are:

$$\left\{2 \times \frac{512}{8} + \frac{512}{8}\right\} \times 2 \cdot 10^9 = 358 \text{ GiB/s!}$$





## The Memory Wall

- Although it is possible to increase memory bandwidth it is **not economical at scale**.
- Most general purpose (non-HPC) applications are not bandwidth limited and thus it is not worth the extra expense and power.









- An alternative to fusion on CPUs is cache blocking.
- Idea is to break up our loops into small blocks **b** such that the outputs **remain resident in cache**.

```
for (int j = 0; j < n; j += b) {
    for (int i = j; i < j + b; i++)
        a[i] += b[i];
    for (int i = j; i < j + b; i++)
        a[i] += c[i];
}</pre>
```

## Cache Blocking

- Key advantage is that it enables existing **tried**, **tested**, **and optimised kernels to be used**—only now we call them more frequently with different starting offsets and smaller element counts.
- Not a new idea; has been used by BLAS for decades.

|                                                   | Cache                                  | e Bloo              | cking                        |                          |
|---------------------------------------------------|----------------------------------------|---------------------|------------------------------|--------------------------|
| Intel Sapphire<br>Rapids Xeon<br>2 Ghz / 56 cores | Capacity<br>(KiB)                      | Latency<br>(Cycles) | Bandwidth<br>(Bytes / cycle) | Net Bandwidth<br>(GiB/s) |
| L1<br>(Private per core)                          | 48                                     | 5                   | 128                          | 13,351                   |
| L2<br>(Private per core)                          | 2,048                                  | 14                  | ~50                          | 5,215                    |
| L3<br>(Shared)                                    | 1,920 (per core)<br>107,968 (56 cores) | 88                  | < 32                         | < 1,000                  |

## Cache Blocking

- Effectiveness depends on the working set of the application relative to the size of the cache being blocked for.
- When solving the Euler equations using DG on a p = 4 hexahedra storing U and F(U) for eight elements requires 160 KB.













- This makes them more efficient, but also **more difficult to program**, as the hardware is doing less work for you.
- Moreover, the **minimum problem size** required to fully utilise a GPU is typically much larger than is required by a CPU.



145

|                                     | Clock Speed<br>(GHz) | Power<br>(W) | DP TFLOP/s<br>(Vector/Matrix) | Ratio<br>(W per TFLOP/s) |
|-------------------------------------|----------------------|--------------|-------------------------------|--------------------------|
| Intel Sapphire Rapids<br>(56 cores) | 2.00                 | 350          | 3.6<br>3.6                    | 97.7<br>97.7             |
| NVIDIA H100<br>(132 cores)          | 1.98                 | 700          | 34.0<br>66.9                  | 20.9<br>10.5             |
| AMD Mi250X<br>(2 × 110 cores)       | 1.70                 | 560          | 47.9<br>95.7                  | 11.7<br>5.9              |

## GPUs

- GPUs also typically come with high bandwidth memory.
- However, this **comes at the cost of capacity**, which can be a problem for some (typically implicit) solvers.
- Furthermore, as cache blocking is not practical on GPUs they often make less efficient use of bandwidth.

|                                       | GPUs        |                          |                              |  |  |
|---------------------------------------|-------------|--------------------------|------------------------------|--|--|
|                                       | Memory Type | Memory Capacity<br>(GiB) | Memory Bandwidth<br>(TiB /s) |  |  |
| Intel Sapphire Rapids<br>(One Socket) | DDR5        | 1,536                    | 0.25                         |  |  |
| NVIDIA H100                           | НВМ3        | 80                       | 3.0                          |  |  |
| AMD Mi250X                            | HBM2e       | 128<br>(2 × 64)          | 3.2<br>(2 × 1.6)             |  |  |
|                                       |             |                          |                              |  |  |



- Thankfully, there is a strong trend towards **fully unified memory** which will eliminate this issue.
- The first such HPC GPU doing this is the upcoming AMD MI300A, but we can expect other vendors to follow suit.
- The transfer problem is solved!

# GPUs

- Practically, the biggest downside of GPUs is the use of **vendor-specific programming languages**:
  - NVIDIA: CUDA.
  - AMD: HIP.
  - Intel: OpenCL and oneAPI.

- This makes it difficult to achieve **performance portability** and can lead to **vendor lock-in**.
- Irrespective of which environment one uses there is one common problem: **kernel launch latency**.
- This makes it difficult to port codes **function-by-function** even if memory is unified.

## GPUs

- As such porting a code to GPUs is a substantial undertaking and a lot of work is often required before observing any performance gains.
- Often it is easier to **rewrite a code from scratch**, e.g., Nek5000 to nekRS.



