GTX 259 blocks, threads configuration To achieve optimal performance

Hi,

Does the CPU context switching concept apply to GPU as well? When the number of threads is greater than the number of cores, context switching overhead will be generated on the GPU?

We have a GTX 259 card which has two GPU units. Each has 30 multiprocessor and 240 cores. According the the book “CUDA By Exmaples”, “the optiaml performance is achieved when the number of blocks we launch is excactly twice the number of multiprocessors our GPU contains.” (Page 176).

Does that mean we need to configuate card to run the heavy calculation kernel function using 60 blocks and 4 threads in each block for each GPU on GTX 259?

As there is also a 32 threads warp concept invloved, should we using 60 blocks and 32 threads configuration? Will this generate too much overhead from context switching?

In the 60 blocks and 4 threads configratuon, will the GPU generate extra 28 threads in each block to make at least a warp?

Thanks

You just made up a new card model. I only know about GTX 295.

You just made up a new card model. I only know about GTX 295.

  1. Context switching applies to the GPU too, but it is done in dedicated hardware and can be treated as overhead free.
  2. Definitely not.
  3. Yes.
  1. Context switching applies to the GPU too, but it is done in dedicated hardware and can be treated as overhead free.
  2. Definitely not.
  3. Yes.

Hi,

Thanks for the reply.

If the overhead is free for using more threads than the cores, will the configruation 60 blocks with 512 threads in each block have better performance according to the device query result below? Assume the memory is enough.

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Thanks

Hi,

Thanks for the reply.

If the overhead is free for using more threads than the cores, will the configruation 60 blocks with 512 threads in each block have better performance according to the device query result below? Assume the memory is enough.

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Thanks

You would be much better served by reading Chapter 5 of the CUDA programming guide. Optimizing execution parameters requires both understanding of how the programming model works and benchmarking of your code.

You would be much better served by reading Chapter 5 of the CUDA programming guide. Optimizing execution parameters requires both understanding of how the programming model works and benchmarking of your code.