Basics

Question definition

Given an array of \(n\) integers, the goal is to compute the sum of all elements within the array.

Solutions

The implementations for all kernel versions can be found at 2-reduce.cu on GitHub.

Naive Version with `atomicAdd`

The simplest approach involves utilizing each thread to perform an atomicAdd operation on the output variable. Here’s how the kernel is defined:

__global__ void reduce_naive_atomic(int* g_idata, int* g_odata, unsigned int n)
{
    unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int gridSize = blockDim.x * gridDim.x;

    int sum = 0;
    for (unsigned int i = idx; i < n; i += gridSize)
    {
        sum += g_idata[i];
    }

    atomicAdd(g_odata, sum);
}

And the kernel launcher is straightforward, invoking the kernel a single time:

Get GPU Properties

Mar 11, 2024 gpu basics tech

In `cuda_runtime.h`, there are several APIs for retrieving properties for the installed GPU.

Here is the code of the example.

On a Nvidia GTX 3080 GPU, the properties are as below:

Device 0 properties:
  Max block dimensions: 1024 x 1024 x 64
  Max grid dimensions: 2147483647 x 65535 x 65535
  Shared memory bank size: 4 bytes
  Max shared memory per block: 49152 bytes
  Max registers per block: 65536
  Warp size: 32
  Multiprocessor count: 68
  Max resident threads per multiprocessor: 1536 = 48 warps
  L2 cache size: 5242880 bytes
  Global L1 cache supported: yes
  Total global memory: 9 GB
  Processor clock: 1 MHZ
  Memory clock: 9 MHZ

Memory coalescing in CUDA (2) – Matrix Transpose

Mar 5, 2024 cuda basics tech

Background

In the VecAdd page, we’ve introduced the memory coalescing in global memory access. This post will follow the topic with another interesting application: Matrix transposing.

The following content will briefly touch on the following topics:

Tiles in matrix, this is the basis of optimization matrix computation
A simple trick to avoid bank conflict in shared memory access

Kernels

The code for all the kernels locates in 1-matrix-transpose-coalesce.cu.

Read coalesced

template <typename T>
__global__ void transpose_read_coalesce(
    const T* __restrict__ input,
    T* __restrict__ output,
    int n,
    int m) {
  int i = blockIdx.x * blockDim.x + threadIdx.x; // the contiguous tid
  int j = blockIdx.y * blockDim.y + threadIdx.y;
  if (i < n && j < m) {
    output[i * m + j] = input[j * n + i];
  }
}

Write coalesced

template <typename T>
__global__ void transpose_write_coalesce(
    const T* __restrict__ input,
    T* __restrict__ output,
    int n,
    int m) {
  int i = blockIdx.x * blockDim.x + threadIdx.x; // the contiguous tid
  int j = blockIdx.y * blockDim.y + threadIdx.y;
  if (i < n && j < m) {
    output[j * n + i] = input[i * m + j];
  }
}

Both read and write coalesced by tiling with shared memory

The tiling method is a common methodology for optimizing matrix operation. It divides the matrix into smaller, manageable blocks or “tiles” that can fit into shared memory.

Memory coalescing in CUDA (1) – VecAdd

Feb 25, 2024 cuda basics tech

Background

Memory coalescing is a crucial optimization technique in CUDA programming that allows optimal usage of the global memory bandwidth. When threads in the same warp running the same instruction access to consecutive locations in the global memory, the hardware can coalesce these accesses into a single transaction, significantly improving performance.

Coalescing memory access is vital for achieving high performance. Besides PCIe memory traffic, accessing global memory tends to be the largest bottleneck in GPU’s memory hierarchy. Non-coalesced memory access can lead to underutilization of memory bandwidth.

Basics

Reduce kernel in CUDA

Question definition

Solutions

Naive Version with atomicAdd

Get GPU Properties

Memory coalescing in CUDA (2) – Matrix Transpose

Background

Kernels

Read coalesced

Write coalesced

Both read and write coalesced by tiling with shared memory

Memory coalescing in CUDA (1) – VecAdd

Background

Naive Version with `atomicAdd`