I am doing some processing on a large chunk of data which cannot fit into shared memory if I try to process it all with a single kernel call, however, the data can be broken down into smaller peices for independent processing. I have determined the maximum number of elements of data which can fit into shared memory per multiprocessor.
I am wondering if I would get better performance by using a larger number of blocks per multiprocessor with fewer threads each (8 blocks with 32 threads each) or a smaller number of blocks per multiprocessor with more threads each (1 block with 256 threads). Or does the answer lie somewhere in the middle?
Also, do I take any sort of performance hit for making many (>64) independent calls to the kernel? My feeling is that any performance lost would be more than gained back by being able to use shared memory instead of global (as I have been).
Thanks for any help.