Dynamic launching of kernels

I am trying to find the best way of generalizing launching of different CUDA kernels by looking at kernels in a general manner from the host. This means trying to calculate the size of the grid and block in a dynamic way. The reason for this is to make it easier to use CUDA applications on different sized files, and kernels that for instance work in a different amount of elements per thread.

Say a kernel wants to work on 16 elements per thread, and the host gets a file of input 4305 bytes.

That means the total number of threads that is required is 269.06, which does not add, so we have to pad the filesize with 15 to 4320, which gives 270 threads. We want to have at least a couple of blocks per SM, so 270 / 60 = 30 blocks. 4320 / 30 = 144 threads pr block. To make sure the padded data is not written back to the host, we only copy the right amount of bytes from the result to the output.

This is a simplified approach of course, and it does not really work. As I have encountered problems with adapting the parameters, I keep getting into endless loops of changing the grid_size to fit the amount of bytes needed to be computed, and then having to change the block_size etc.

Has anyone had any experience with this? I am more tempted to going over to an alternative where I divide the job into parts and do seperate launches, instead of adjusting in an endless loop so that the parameters fit.

I use a fixed block size, which is chosen as the block size that allows the kernel to run with the best performance (benchmarked). And I don’t pad any data, I use if statements so that the threads “off the end” don’t do anything.

This doesn’t meet your blocks per SM requirement, though. I just forget about that. If the data size isn’t big enough to get all the SMs running, then running a smaller block size will not make much of a difference. You still have the same number of GPU threads to run.

I usually use a fixed block size myself, however I am working on a solution where I want to run different kernels from the same host, and some sort of dynamic parameter settings need to be met. One way can be to have the kernel we want to run, specify its block size and number of element each thread works on. From those numbers one can generate the grid size. However if the grid size is very small, there is a chance the performance will suffer cause SMs will be idle. Therefor it could be an idea to adapt the block size and get a higher grid size.

The idle threads idea seems to be a more tidy solution, even though it´s a shame to let processors go to waste :P Do you have an example of how you have done this?

Any examples of dynamic adaption will be helpful and give me some ideas