CUDA by Examples Study Notes
Published:
Introduction
Chapter 3
CUDA is a platform for running GPU code. So one big perk of CUDA is to use syntax similar to that used for CPU programming to write GPU code. In order to do that, CUDA C introduced some key words in addition to the standard C.
Some of the keywords are used to distinguish code that runs on device and host. (host is the CPU, device is the GPU). I like to think of the host (CPU) as the manager/orchestrater, and it “tells” the device (GPU) what to do.
__global__ void kernel(args)
: This qualifier alerts the compiler that a function should be compiled on device, not host.kernel<<<m,n>>>(arg1, arg2)
: This is how a GPU kernel is called. Them
andn
in the angle brackets are args that affect the launch,arg1, arg2
are the actual arguments fed to the kernel function.- To do anything useful on the GPU, we need to allocate the required memory on GPU first:
cudaMalloc((void**)dev_ptr, size)
:cudaMalloc
is similar to the standard Cmalloc
function, but instead of returning a pointer, it takes a pointer to a device pointer, and returns an error code.Why is cudaMalloc designed like this? This is because in C, when we need to modify the value of an input argument in place, we need to pass in a reference to that argument. (See this post)
NOTE: remember to always check the returned error code from cudaMalloc (and any other functions that returns error code). Seems like a good practice.
- Some rules to remember:
- Do not dereference the device pointer returned by
cudaMalloc
from code that executes on the host. - device pointers can be passed around in host code, but they cannot be accessed to read/write memory on the host.
- Do not dereference the device pointer returned by
- In order to free a cudaMalloced pointer, use
cudaFree(ptr)
- some other common methods:
cudaMemcpy
cudaGetDeviceCount
cudaDeviceProp
: a struct containing device infocudaChooseDevice()
Chapter 4
Parallel programming in CUDA
- Lets go back to the kernel launch call:
kernel<<<m, n>>>(arg1, arg2...)
- Here,
m
means the number of parallel blocks in which we would like the device to execute the kernel. - Inside the kernel, in order to distinguish different kernel copies:
blockIdx
: This is defined in CUDA runtime, telling us which block that kernel copy is. It is auint3
variable. So we can accessblockIdx.x
, andblockIdx.y
for indexing the copies.
m
can also be a grid, instead of a number:dim3 grid(DIM,DIM); kernel<<<grid,1>>>( dev_bitmap );
- The
dim3
type grid allows us to specify a 2D launch grid. The third dimension is inferred to 1 because CUDA might support 3D launch grid in the future.
- The
- Here,
- Sometimes, we can see CUDA code that uses offsets: ![[Pasted image 20240622111530.png]] But if the grid is 2D, why would we use one number as the offset? Turns out CUDA sometimes use a linear offset to access memory in pointers. This give us a unique index into ptr that range from 0 to
m*n -1
- For a function that runs on the device (not the kernel), we can use:
__device__
signature to indicate the code will run on GPU. These function can only be run on functions defined with__device__
or__global__
.
Chapter 5
- When launching a CUDA kernel, we do:
kernel<<<Blocks, Threads>>>(args)
- The second argument means the number of threads / Block
- To index a thread, we have
threadIdx.x
andthreadIdx.y
.
- Recall: Limit on number of blocks in a single launch: 65,535
- Limit on Number of threads/block:
maxThreadsPerBlock
(from the device properties structure, usually 512 threads/block)
- Limit on Number of threads/block:
- To index a thread, we use the standard way to convert 2D index to 1D index \(tid = threadIdx.x + blockIdx.x \times blockDim.x\) ![[Pasted image 20240712161901.png]]
blockDim
: a constant for all blocks, stores the number of streads along each block dimension.It is a 3D variable
- When using both Blocks and Threads, sometimes we want N total threads across K threads, which is N/K Threads per Block. To make it a integer division, we can use the following trick: \((N + K - 1) / K\)
This result will be the smallest integer that is larger than or equal to N/K