CUDA by Examples Study Notes

4 minute read

Published: June 24, 2024

Introduction

Chapter 3

CUDA is a platform for running GPU code. So one big perk of CUDA is to use syntax similar to that used for CPU programming to write GPU code. In order to do that, CUDA C introduced some key words in addition to the standard C.

Some of the keywords are used to distinguish code that runs on device and host. (host is the CPU, device is the GPU). I like to think of the host (CPU) as the manager/orchestrater, and it “tells” the device (GPU) what to do.

__global__ void kernel(args) : This qualifier alerts the compiler that a function should be compiled on device, not host.
kernel<<<m,n>>>(arg1, arg2): This is how a GPU kernel is called. The m and n in the angle brackets are args that affect the launch, arg1, arg2 are the actual arguments fed to the kernel function.
To do anything useful on the GPU, we need to allocate the required memory on GPU first:
- cudaMalloc((void**)dev_ptr, size): cudaMalloc is similar to the standard C malloc function, but instead of returning a pointer, it takes a pointer to a device pointer, and returns an error code.
  Why is cudaMalloc designed like this? This is because in C, when we need to modify the value of an input argument in place, we need to pass in a reference to that argument. (See this post)
  NOTE: remember to always check the returned error code from cudaMalloc (and any other functions that returns error code). Seems like a good practice.
- Some rules to remember:
  1. Do not dereference the device pointer returned by cudaMalloc from code that executes on the host.
  2. device pointers can be passed around in host code, but they cannot be accessed to read/write memory on the host.
- In order to free a cudaMalloced pointer, use cudaFree(ptr)
some other common methods:
- cudaMemcpy
- cudaGetDeviceCount
- cudaDeviceProp: a struct containing device info
- cudaChooseDevice()

Chapter 4

Parallel programming in CUDA

Lets go back to the kernel launch call: kernel<<<m, n>>>(arg1, arg2...)
- Here, m means the number of parallel blocks in which we would like the device to execute the kernel.
- Inside the kernel, in order to distinguish different kernel copies:
  - blockIdx: This is defined in CUDA runtime, telling us which block that kernel copy is. It is a uint3 variable. So we can access blockIdx.x, and blockIdx.yfor indexing the copies.
- m can also be a grid, instead of a number:
```
  dim3 grid(DIM,DIM); 
  kernel<<<grid,1>>>( dev_bitmap );
```
  - The dim3 type grid allows us to specify a 2D launch grid. The third dimension is inferred to 1 because CUDA might support 3D launch grid in the future.
Sometimes, we can see CUDA code that uses offsets: ![[Pasted image 20240622111530.png]] But if the grid is 2D, why would we use one number as the offset? Turns out CUDA sometimes use a linear offset to access memory in pointers. This give us a unique index into ptr that range from 0 to m*n -1
For a function that runs on the device (not the kernel), we can use:
- __device__ signature to indicate the code will run on GPU. These function can only be run on functions defined with __device__ or __global__.

Chapter 5

When launching a CUDA kernel, we do:
```
  kernel<<<Blocks, Threads>>>(args)
```
- The second argument means the number of threads / Block
- To index a thread, we have threadIdx.x and threadIdx.y.
Recall: Limit on number of blocks in a single launch: 65,535
- Limit on Number of threads/block: maxThreadsPerBlock (from the device properties structure, usually 512 threads/block)
To index a thread, we use the standard way to convert 2D index to 1D index \(tid = threadIdx.x + blockIdx.x \times blockDim.x\) ![[Pasted image 20240712161901.png]]
- blockDim: a constant for all blocks, stores the number of streads along each block dimension.It is a 3D variable
When using both Blocks and Threads, sometimes we want N total threads across K threads, which is N/K Threads per Block. To make it a integer division, we can use the following trick: \((N + K - 1) / K\)
- This result will be the smallest integer that is larger than or equal to N/K

Some useful links:

CUDA C programming guide

Share on

Twitter Facebook LinkedIn

Junda (David) Su