Hi all,
I’m trying to improve the rapidity of the execution of a kernel. In order to achieve a certain speed, i want to use as much threads as possible.
Since i’m using a quadro FX 4600, the compute capibility is 1.0, and the limitation for the size of a bloc is 512 and for the grid 65535 for each dimension.
When I increase the size of my blocks I have the error cudaError_enum.
This is certainly related to a problem concerning the shared memory, since I’m using the size of my blocks to determine the size that will be occupied in shared memory :
At most my block dimension are equal to (17,17) which corresponds to a number of 289 (<512) threads ber block.
And the grid size for each dimension is (17408,13056) (<65535).
#define U16 unsigned short int
//Shared memory
__shared__ float disp_sdata[block_size_x][block_size_y]; // shared memory for the disparity map
__shared__ U16 colorR_sdata[block_size_x][block_size_y];
__shared__ U16 colorG_sdata[block_size_x][block_size_y];
__shared__ U16 colorB_sdata[block_size_x][block_size_y];
So this means that i’m using
- sizeof(float)block_size_xblock_size_y + 3*sizeof(unsigned short int)block_size_xblock_size_y
which is equal to
4*289 + 3*1*289= <b>2023 bytes </b>
I also read in the sticky note that the built-in variables are using 16 bytes in shared memory.
As for the kernel parameters, i use the following :
float* d_idata, size_t pitch_in,
U16 * d_RGBmap,
float* d_odata, size_t pitch_out,
unsigned int width, unsigned int height
Since the size of a pointer is the size of an address (4) as well as the size of size_t and the size of int, the parameters are using 4*7=28 bytes.
So the grand total is
2023+16+28=2067 bytes this is quite far from the limitation for shared memory which is equal to 16kbytes.
however whenever i use a block size of (18,18), which corresponds to (324 threads/block) a size of 2312 kbytes in shared memory, I get the error cudaError_enum. And i don’t understand why…
Is there something else that i’m forgetting ?
Does anybody else already had that kind of trouble before ?
I’m pretty sure that this is related to shared memory issues since i’ve gone through some topics inthe Nvidia forum that made very clear that there was a link between the error “cudaError_enum” and shared memory. Moreover the size of my blocks is the only parameters that i’m changing, before the error occcurs, so it definitely comes from here.