Chapter 4: Computer Architecture and Scheduling

4.1 Architecture of a modern GPU

above is a high-level view of a CUDA-capable GPU
it’s organized into an array of highly threaded streaming multiprocessors (SMs), each of which contains a bunch of processing units called CUDA cores or just cores (the green blocks)
all cores within an SM share control logic and memory resources
e.g., an A100 has 108 SMs with 64 cores each
In addition to local memory on the SM, the GPU has a global off-chip device memory with tight integration with the SMs—this is the DRAM, usually high-bandwidth memory (HBM) in the newest GPUs

4.2 Block scheduling

when a kernel is called, the CUDA runtime system launches a grid of threads divided into blocks
all threads within a block are always scheduled simultaneously on the same SM
often the grid contains more blocks than can simultaneously execute across SMs, so the runtime system tracks which are running and which still need to be assigned
because threads within a block are launched together and share hardware resources, it’s relatively easy to facilitate their interaction via synchronization (4.3) and/or shared memory (ch. 5)

4.3 Synchronization and transparent scalability

CUDA allows threads within a block to communication with one another using barrier synchronization
when a thread reaches a call to __syncthreads(), it will wait there for the rest of the threads within ints block to reach the same point—this allows for the sharing of information
one important characeristic of __syncthreads() is that using it within if-else statements can lead to undefined behavior. in the example below
```
void incorrect_barrier(int n) {
  ...
  if (threadIdx.x % 2 == 0) {
      ...
      __syncthreads()
  }
  else {
      ...
      __syncthreads()
  }
}
```
either all the threads in the block will execute the if path or the else path
this kind of set-up can also lead to deadlocks if threads are waiting for a condition to be true before proceeding
the runtime system also has to make sure each thread actually gets the resources it needs to complete execution, because if a thread gets stuck, this will also lead to a deadlock
the runtime system satisfies these constraints by assigning resources to all threads in a block as a unit, and simultaneous execution ensures that wait times aren’t too long
one advantage of this segregation of resources by block is modularity—none of the blocks has to wait for another, so they can be executed in any order
this gives the system flexibility in scheduling–for low-power systems like mobile devices, only a few blocks may execute at a time, but in more powerful devices, more blocks can be executed simultaneously
this facilitates transparent scalability: the ability to execute the same code on different hardware

4.4 Warps and SIMD hardware

In general, you can’t assume anything about the execution order of threads within a block
However, once a block has been assigned to an SM, it’s further divided into 32-thread units called warps
Every thread in a warp is fed the same instruction at the same time by the SM – this is an example of SIMD (single-instruction, multiple-data) hardware
Adjacent threads within a block (based on threadIdx) are assigned to the same warp
That is, in a 1D block organization, threads 0-31 form the first warp, 32-63 form the second, etc.
If the grid is 2D, threads are organized into warps based on their row-major order (like how 2D arrays are stored in memory)

If the grid is 3D, you basically lay out each 2D \( y \times z \) array in row-major order sequentially for each \(x\). That is, for a \(2 \times 8 \times 4\) grid (64 threads total), the first warp would be the \(8 \times 4\) array corresponding to \(x = 0\) and the second warp would be the \(8 \times 4\) array corresponding to \(x = 1\)
The cores in a GPU are organized into processing blocks - in the image below, we can see two processing blocks with 8 cores each. In the A100, there are 64 total cores organized into four blocks with 16 cores each

Each processing block shares an instruction fetch/dispatch unit, which feeds each warp assigned to it the same instruction at the same time (but possibly different instructions to each warp)
This sharedness is a design choice that allows less of the chip to be taken up by instruction/control units and more by cores to increase throughput
The figure below contrasts the standard Von-Neumann stored program architecture (recall the program counter keeps track of the memory address of the next instruction to be executed) with a GPU-ified version to enable SIMD parallelism

4.5 Control divergence

SIMD architectures work best when all threads in a warp are doing the same thing–then they can all just follow the current instruction together
However, certain threads may take different execution paths depending on their values–this happens a lot when checking boundary conditions for thread scheduling (e.g., making sure threadIdx.x - or whatever coordinate - is within the bounds of the input array)
This is called control divergence, and GPUs handle it by doing multiple passes of instructions through the warp, disabling threads for instructions/control paths which do not apply to them–obviously, this costs extra time and slows down computation
In older GPUs, this cost was higher because each pass had to be performed sequentially, but in newer models starting with the Volta series (e.g., V100s) these passes can be run concurrently
Control divergence can also occur in for loops if different threads complete their iterations early. In this case, they’re deactivated on subsequent steps

When control divergence occurs as a result of a boundary condition (i.e., if a threadIdxis within the bounds of the data), the amount of impact it has on runtime decreases as the overall size of the input data (total number of threads) grows
This is because these boundary conditions will occur in an increasingly small proportion of warps as the size of the data grows
Another implication of control divergence is that even though all threads in a warp are passed the same instruction at the same time, you still can’t assume execution times are synched because some threads might go down different control paths–you can ensure synchronization by using a barrier method like __syncwarp()

4.6 Warp scheduling and latency tolerance

Typically, many more threads are assigned to an SM than can simultaneously execute on it
The reason for this is that if a thread has to performance a long-latency operation (e.g., it has to wait for some data to load from DRAM or is deactivated due to control divergence), then having extra warps waiting around ready to go means that the GPU can schedule that warp to run while the first one waits (a classic method in concurrency) - this is called zero-overhead scheduling
in general, the techniques a GPU uses to keep cores active during long-latency operations are called latency tolerance or latency hiding
this ability is the main reason that GPUs don’t dedicate nearly as much chips space to things like caches, because they hold thread information in the hardware registers and can perform this zero-overhead switching (while CPUs most save and restore execution states)
for example, in the A100, each SM has 64 cores (each of which executes one thread at a time), but can have up to 2048 threads assigned to it (thus 32 threads per core)