Memory coalescing in CUDA (2) – Matrix Transpose

Mar 5, 2024 cuda basics tech

Table of Contents

Background

In the VecAdd page, we’ve introduced the memory coalescing in global memory access. This post will follow the topic with another interesting application: Matrix transposing.

The following content will briefly touch on the following topics:

Tiles in matrix, this is the basis of optimization matrix computation
A simple trick to avoid bank conflict in shared memory access

Kernels

The code for all the kernels locates in 1-matrix-transpose-coalesce.cu.

Read coalesced

template <typename T>
__global__ void transpose_read_coalesce(
    const T* __restrict__ input,
    T* __restrict__ output,
    int n,
    int m) {
  int i = blockIdx.x * blockDim.x + threadIdx.x; // the contiguous tid
  int j = blockIdx.y * blockDim.y + threadIdx.y;
  if (i < n && j < m) {
    output[i * m + j] = input[j * n + i];
  }
}

Write coalesced

template <typename T>
__global__ void transpose_write_coalesce(
    const T* __restrict__ input,
    T* __restrict__ output,
    int n,
    int m) {
  int i = blockIdx.x * blockDim.x + threadIdx.x; // the contiguous tid
  int j = blockIdx.y * blockDim.y + threadIdx.y;
  if (i < n && j < m) {
    output[j * n + i] = input[i * m + j];
  }
}

Both read and write coalesced by tiling with shared memory

The tiling method is a common methodology for optimizing matrix operation. It divides the matrix into smaller, manageable blocks or “tiles” that can fit into shared memory.

Let’s divide the matrix into tiles of size \(TILE \times TILE\), and the overall transpose could be decoupled into two sub-levels:

the inter-tile transpose, that is move the tile to the target position; and secondly,
the intra-tile transpose, that is transpose the elements within a single tile

Inter-tile transpose

Each tile is processed by a thread block, so the tile coordinate is (blockIdx.y, blockIdx.x), and the target coord is (blockIdx.x, blockIdx.y).

We can continue to process the elements within each tile.

Intra-tile transpose

Within a tile, we will read the elements, store the transposed version in the shared memory, and then store the tile in global memory, with the coord determined by the intra-tile transpose phase.

There are two copies:

Copying the tile from the input matrix and storing a transposed version into shared memory
Copying the tile from shared memory into the output matrix in global memory

Only one side is in global memory in both copies, so it can perform a memory coalescing access pattern. Both copies are performed by all the threads collectively within a thread block.

To make a coalesced memory access, in the first copy, a thread reads element of coord of (threadIdx.y, threadIdx.x), and the memory offset threadIdx.y * M + threadIdx.x is contignuous for adjacent threads. In the second copy, the thread block needs to copy a tile to global memory, similarly, a thread should process the element of (threadIdx.y, threadIdx.x) in the output tile.

Kernel with constant tile size

template <typename T, int TILE>
__global__ void transpose_tiled_coalesce0(
    const T* __restrict__ input,
    T* __restrict__ output,
    int n,
    int m) {
  assert(blockDim.x == blockDim.y && blockDim.x == TILE_DIM);

  // TILE + 1 to avoid bank conflict
  // By padding the shared memory array with an extra element, the consecutive threads access
  // memory locations that fall into different banks to avoid bank conflict
  __shared__ T tile[TILE][TILE + 1];

  int i = blockIdx.y * blockDim.y + threadIdx.y;
  int j = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < m && j < n) {
    tile[threadIdx.x][threadIdx.y] = input[i * n + j];
  }
  __syncthreads();
  i = blockIdx.x * blockDim.x + threadIdx.y;
  j = blockIdx.y * blockDim.y + threadIdx.x;
  if (i < n && j < m) {
    output[i * m + j] = tile[threadIdx.y][threadIdx.x];
  }
}

Note that, since each thread processes only one element, so both blockDim.x and blockDim.y should equal to TILE, and TILE is a constant value.

Kernel with dynamic tile size

It is possible to allocate the shared memory dynamically, making the TILE a variable that could be assigned with blockDim.x or blockDim.y on the fly.

template <typename T>
__global__ void transpose_tiled_coalesce1(
    const T* __restrict__ input,
    T* __restrict__ output,
    int n,
    int m) {
  const size_t TILE = blockDim.x;
  assert(blockDim.x == blockDim.y);

  extern __shared__ T tile[];

  int i = blockIdx.y * blockDim.y + threadIdx.y;
  int j = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < m && j < n) {
    tile[threadIdx.x * (TILE + 1) + threadIdx.y] = input[i * n + j];
  }
  __syncthreads();
  i = blockIdx.x * blockDim.x + threadIdx.y;
  j = blockIdx.y * blockDim.y + threadIdx.x;
  if (i < n && j < m) {
    output[i * m + j] = tile[threadIdx.y * (TILE + 1) + threadIdx.x];
  }
}

Performance

In NVIDIA GTX 3080, these kernels have a pretty close performance:

Kernel	Latency
Read coalesced	0.0476
Write coalesced	0.0474
tiled	0.0478

Reference

CUDA Coalesced Memory Access - blog of Lei Mao