I’d like to get some developer/user feedback on something I’m working on the my cuda-api-wrappers library. (Actually, I should have probably made an announcement about a major feature release of it, but maybe another day.) Anyway, I’m working on adding a builder class for kernel launch configurations to the library, and I was hoping people might want to share patterns they tend to reproduce in their code when building those.
What I have so far simplifies some common idioms. For example, in the
vectorAdd NVIDIA sample, we have:
int threadsPerBlock = 256; int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock; printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock); vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements); err = cudaGetLastError();
and with the build, it can now be:
auto launch_config = cuda::launch_config_builder() .overall_size(numElements) .block_size(256) .build(); cuda::launch( vectorAdd, launch_config, d_A, d_B, d_C, numElements );
where the launch config construction merely communicates the intent (and does not require the reader to note that we’re doing rounded-up division, then realize why it’s the right thing to do.)
So, that covers one way we create launch configurations; it’s also safer, in case
numElements is close to
std::numeric_limits<int>::max(); and I have another thing or two up the sleeve, like associating the builder with a kernel-in-a-context or a device and using their maximum supported block sizes for a linear block. But I want to better cover other common launch config creation patterns (in particular, but not just, w.r.t. the grid and block dimensions, and not just for linear grids). So, requests are welcome.