Cuda

Reduce kernel in CUDA

cuda basics tech

Question definition Given an array of \(n\) integers, the goal is to compute the sum of all elements within the array. Solutions The implementations for all kernel versions can be found at 2-reduce.cu on GitHub. Naive Version with atomicAdd The simplest approach involves utilizing each thread to perform an atomicAdd operation on the output variable. Here’s how the kernel is defined: __global__ void reduce_naive_atomic(int* g_idata, int* g_odata, unsigned int n) { unsigned int idx = blockIdx.

Memory coalescing in CUDA (2) – Matrix Transpose

cuda basics tech

Background In the VecAdd page, we’ve introduced the memory coalescing in global memory access. This post will follow the topic with another interesting application: Matrix transposing. The following content will briefly touch on the following topics: Tiles in matrix, this is the basis of optimization matrix computation A simple trick to avoid bank conflict in shared memory access Kernels The code for all the kernels locates in 1-matrix-transpose-coalesce.cu. Read coalesced template <typename T> __global__ void transpose_read_coalesce( const T* __restrict__ input, T* __restrict__ output, int n, int m) { int i = blockIdx.

Memory coalescing in CUDA (1) – VecAdd

cuda basics tech

Background Memory coalescing is a crucial optimization technique in CUDA programming that allows optimal usage of the global memory bandwidth. When threads in the same warp running the same instruction access to consecutive locations in the global memory, the hardware can coalesce these accesses into a single transaction, significantly improving performance. Coalescing memory access is vital for achieving high performance. Besides PCIe memory traffic, accessing global memory tends to be the largest bottleneck in GPU’s memory hierarchy.