CUDAkk
June 26, 2009, 3:30pm
21
True, a compute 1.3 device can coalesce 8-bit reads. In benchmarks, however, coalesced 8-bit reads are still painfully slow compared to coalesced 32/64/128 bit reads.
Can you give any hints about my code ( given in the begining of this thread ) to improve performance?
Hi All,
In my kernel
__global__
void getValue( unsigned char *Value, long xValue, long yValue )
{
long limit = __mul24( xValue, yValue); // Here xValue = 2000 & yValue = 1500.
long idx = __mul24( blockIdx.x , blockDim.x ) + threadIdx.x;
if( idx < limit )
{
int y = ((idx) / (xValue));
int y_ = __mul24(y,8);
int dt = __mul24(y, idx);
for ( int i = 0; i < 4; i++ )
{
for ( int l = 0; l < 8; l++ )
{
int dl = tex1Dfetch(hDY, y_+l ) + dt + i;
int ind = __mul24(idx, 256) + __mul24(i, 64) + l*8;
Value[ind ] = tex1Dfetch( hTexture, (0 << 2)+dl );
Value[ind+1] = tex1Dfetch( hTexture, (1 << 2)+dl );
Value[ind+2] = tex1Dfetch( hTexture, (2 << 2)+dl );
Value[ind+3] = tex1Dfetch( hTexture, (3 << 2)+dl );
Value[ind+4] = tex1Dfetch( hTexture, (4 << 2)+dl );
Value[ind+5] = tex1Dfetch( hTexture, (5 << 2)+dl );
Value[ind+6] = tex1Dfetch( hTexture, (6 << 2)+dl );
Value[ind+7] = tex1Dfetch( hTexture, (7 << 2)+dl );
}
}
}
}
This is called using 256 threads per block.
My problem is:
This functin execution time is 93 ms . I am not understanding why it takes so much time?
Can any one has any idea please help I have the same problem.
CUDAkk
June 27, 2009, 10:37am
23
Hi Manjunath Gudisi,
Actually what is the problem in my code is accessing the value of Value . If I replace this with shared array that will nothing just taking value from texture memory and discarding it. then I get time 20 ms which is improvement. But till now I am not sucess to use shared array in this kernel , one of the problem is here value is unsigned char.
Can any body say how to use shared memory in my kernel inplace of value ?