I am trying to find the best way of generalizing launching of different CUDA kernels by looking at kernels in a general manner from the host. This means trying to calculate the size of the grid and block in a dynamic way. The reason for this is to make it easier to use CUDA applications on different sized files, and kernels that for instance work in a different amount of elements per thread.
Say a kernel wants to work on 16 elements per thread, and the host gets a file of input 4305 bytes.
That means the total number of threads that is required is 269.06, which does not add, so we have to pad the filesize with 15 to 4320, which gives 270 threads. We want to have at least a couple of blocks per SM, so 270 / 60 = 30 blocks. 4320 / 30 = 144 threads pr block. To make sure the padded data is not written back to the host, we only copy the right amount of bytes from the result to the output.
This is a simplified approach of course, and it does not really work. As I have encountered problems with adapting the parameters, I keep getting into endless loops of changing the grid_size to fit the amount of bytes needed to be computed, and then having to change the block_size etc.
Has anyone had any experience with this? I am more tempted to going over to an alternative where I divide the job into parts and do seperate launches, instead of adjusting in an endless loop so that the parameters fit.