Survey: What would you like to see in a kernel launch configuration builder?

I’d like to get some developer/user feedback on something I’m working on the my cuda-api-wrappers library. (Actually, I should have probably made an announcement about a major feature release of it, but maybe another day.) Anyway, I’m working on adding a builder class for kernel launch configurations to the library, and I was hoping people might want to share patterns they tend to reproduce in their code when building those.

What I have so far simplifies some common idioms. For example, in the vectorAdd NVIDIA sample, we have:

  int threadsPerBlock = 256;
  int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
  printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid,
         threadsPerBlock);
  vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
  err = cudaGetLastError();

and with the build, it can now be:

auto launch_config = cuda::launch_config_builder()
	.overall_size(numElements)
	.block_size(256)
	.build();

cuda::launch(
	vectorAdd, launch_config,
	d_A, d_B, d_C, numElements
);

where the launch config construction merely communicates the intent (and does not require the reader to note that we’re doing rounded-up division, then realize why it’s the right thing to do.)

So, that covers one way we create launch configurations; it’s also safer, in case numElements is close to std::numeric_limits<int>::max(); and I have another thing or two up the sleeve, like associating the builder with a kernel-in-a-context or a device and using their maximum supported block sizes for a linear block. But I want to better cover other common launch config creation patterns (in particular, but not just, w.r.t. the grid and block dimensions, and not just for linear grids). So, requests are welcome.

another launch config pattern is the grid sized according to the GPU and the kernel to be launched (maximize occupancy, but fit on the device). This can be generally useful for any grid-stride-loop pattern, and also important for cooperative grid launch. A basic example is given in the programming guide.

@Robert_Crovella : Yes, I’m definitely thinking about utilizing the maximum-occupancy-calculation functions here. Right now you can do:

auto launch_config = cuda::launch_config_builder()
	.block_size(128)
	.grid_size( 
		my_wrapped_kernel.max_active_blocks_per_multiprocessor() 
		* device.multiprocessor_count()
	)
	.build();

but maybe it could be:

auto launch_config = cuda::launch_config_builder()
	.kernel(my_kernel) // doesn't even have to be wrapped I think
	.block_size(128)
	.device(device)
	.saturate_with_active_blocks()
	.build();

How does that sound to you?

either one should be fine.

The first one works right now. I mean, on my development branch. For the second one, I need to implement the second-to-last line.

@RobertCrovella: Ok, have added this feature; but it’s not tested yet. It’s on the development branch, you’re welcome to play with it if you have the time.