Performance Threads, blocks, registers and shared memory

Please, I would like to know whether is possible to have the following configuration:

KernelName<<<1000,256>>>(parameters)
With 10 registers per thread (1000 blocks and 256 threads per block)
Having everything else ok.

Now with 32 bytes per thread from shared memory.

And with 96 bytes per thread from shared memory.

If you’re on a compute 1.1 device, it has 768 threads per multiprocessor and 8192 registers. With 256 threads per block, that’s 3 blocks per multiprocessor (so you’ll be using the max number of threads), giving you 8192 / (3 * 256) = 10.66667 registers per thread.

All devices have 16kb of shared memory per multiprocessor, so using the same calculation, you get 16384 / (3 * 256) = 21.33333 bytes per thread.

*** compute 1.1 device ***

KernelName<<<(number of blocks), (number of threads per block) >>>(parameters)

(number of blocks) X (number of threads per block) < 769 always.

Is that right?

  1. or…

KernelName<<<1000,64>>>(parameters)

125 sets of 8x64=512 threads

8192 / (8x64) = 16 registers per thread

16384 / (8x64) = 32 bytes per thread

Is that right?