I’d like to do something like this:
float* vmem_in;
float* vmem_out;
cudaMalloc( (void**)&vmem_in, width*height*sizeof(float) );
cudaMalloc( (void**)&vmem_out, width*height*sizeof(float) );
// Just to illustrate
block.x = 32;
block.y = 16;
grid.x = width/32;
grid.y = height/16;
// mykernel does per-element computation
mykernel<<<grid,block,shmsz>>>( vmem_in, vmem_out, width );
// Here is the interesting part
grid.x = 1;
block.x = width%32;
mykernel<<<grid,block,shmsz>>>( vmem_in + 32*(width/32), vmem_out + 32*(width/32), width );
It is useful when you cannot find a block size that covers the whole matrix (other than 1x1) and you don’t want to make a shift in the kernel code (which supposes not only a performance loss, but also more registers per thread). I don’t know enough architectural details so I don’t know if it is even possible. It seems like it is not working right now, so what I’d like to know is if this is technically possible and if it will be implemented in the (near) future.
Thank you!