CUDA Device Query (Driver API) statically linked version
There is 1 device supporting CUDA
Device 0: “GeForce 210”
CUDA Driver Version: 3.0
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 2
Total amount of global memory: 536870912 bytes
Number of multiprocessors: 2
Number of cores: 16
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.40 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
My question is about the total amount of Shared Memory: How is organized the shared memory? How many blocks can I do to improve performance using shared memory in my code? In one SM (8 cores), if I have 8 blocks and 8 threads (1 block = 1 thread), do I have 8 shared memory? And if I have 10 blocks (1 block = 1 thread), the other 2 blocks suspended have shared memory, or they swap??
Total shared memory available is 16KB per multiprocessor.
The shared memory is declared inside the kernel. This allocated memory will be for a block (shared by all the threads within the block).
For eg. if you have given shared float s_buffer[100];
and if total number of blocks is 10 and 32 threads per block. ie. total 320 threads, the above given shared memory will be allocated for all the 32 threads within a block. All the 10 blocks will allocate its own separate set of memory. Note that the scope of shared memory is within a cuda block and is shared by all the threads within the block.
While execution, a cuda block (all the threads within the block) will get active at the same time and the shared memory allocated will remain there until the processing of these active blocks is completed.
Total shared memory available is 16KB per multiprocessor.
The shared memory is declared inside the kernel. This allocated memory will be for a block (shared by all the threads within the block).
For eg. if you have given shared float s_buffer[100];
and if total number of blocks is 10 and 32 threads per block. ie. total 320 threads, the above given shared memory will be allocated for all the 32 threads within a block. All the 10 blocks will allocate its own separate set of memory. Note that the scope of shared memory is within a cuda block and is shared by all the threads within the block.
While execution, a cuda block (all the threads within the block) will get active at the same time and the shared memory allocated will remain there until the processing of these active blocks is completed.
If I have 1 thread for 1 block, this 16K are all for this thread? So 1 thread 16K, 2 thread (same block) 8K each (for example)… So s_buffer is shared between threads of the same block or are there 32 s_buffer (as the number of threads)?
My problem is about an image, 640X480 (PGM). I want to divide this image among blocks/threads… but I don’t know the principle (based on my GF210 specs) for improve performance. How can I say that is better split the image in 64 blocks rather than 32???
If I have 1 thread for 1 block, this 16K are all for this thread? So 1 thread 16K, 2 thread (same block) 8K each (for example)… So s_buffer is shared between threads of the same block or are there 32 s_buffer (as the number of threads)?
My problem is about an image, 640X480 (PGM). I want to divide this image among blocks/threads… but I don’t know the principle (based on my GF210 specs) for improve performance. How can I say that is better split the image in 64 blocks rather than 32???
No there is just one s_buffer for all the 32 threads in a block.
Where and how to use shared memory depends on your logic. If you are doing some image processing applications, where you have to do a set of operations on each and every pixel of the input image, then your kernel should generate one thread each for each pixel of the image. In your case 640*480=307200, that you can divide it as 256 threads per block and 1200 blocks. The thing is that you have to make sure that all the GPU cores should get utilized.
You can use CUDA occupancy calculator (which is an excel sheet and is available in your CUDA SDK path) to check multiprocessor occupancy and to find the best utilization for your problem.
Shared memory should be preferred only if you are having multiple accesses to the same global memory location. Otherwise better keep the input image in texture memory which is read only.
No there is just one s_buffer for all the 32 threads in a block.
Where and how to use shared memory depends on your logic. If you are doing some image processing applications, where you have to do a set of operations on each and every pixel of the input image, then your kernel should generate one thread each for each pixel of the image. In your case 640*480=307200, that you can divide it as 256 threads per block and 1200 blocks. The thing is that you have to make sure that all the GPU cores should get utilized.
You can use CUDA occupancy calculator (which is an excel sheet and is available in your CUDA SDK path) to check multiprocessor occupancy and to find the best utilization for your problem.
Shared memory should be preferred only if you are having multiple accesses to the same global memory location. Otherwise better keep the input image in texture memory which is read only.