Shared memory writes really slow

Hello!

I have a kernel function in which threads write to different elements of an array in shared memory. The problem is that the writes are extremely slow. Since the threads write to different elements of the array, I don’t think there are any bank conflicts.

Does anyone know what could be the problem here? The compute capability of my GPU device is 1.3. I’m launching 65536 x 65536 blocks and 500 threads in each block. The array size is well below the shared memory size per block limit.

I would really appreciate your suggestions.

Thanks!

Hello!

I have a kernel function in which threads write to different elements of an array in shared memory. The problem is that the writes are extremely slow. Since the threads write to different elements of the array, I don’t think there are any bank conflicts.

Does anyone know what could be the problem here? The compute capability of my GPU device is 1.3. I’m launching 65536 x 65536 blocks and 500 threads in each block. The array size is well below the shared memory size per block limit.

I would really appreciate your suggestions.

Thanks!

Hi!

Your gridsize is 65536x65536?? The maximum dimension is 65535. it is strange that your example goes!
Are you sure of this?

Hi!

Your gridsize is 65536x65536?? The maximum dimension is 65535. it is strange that your example goes!
Are you sure of this?

I’m sorry, the grid size is actually 65,535 x 65,535. I’m using the dimensions returned by maxGridSize[0] and maxGridSize[1]. Should I just use 65,535 x 1?

I’m sorry, the grid size is actually 65,535 x 65,535. I’m using the dimensions returned by maxGridSize[0] and maxGridSize[1]. Should I just use 65,535 x 1?

Can you show the code of the write and the values of the indexes?
The most probably is that there be bank conflicts … i think.

Can you show the code of the write and the values of the indexes?
The most probably is that there be bank conflicts … i think.

Hi! Thanks for writing back. Here’s the kernel function:

__global__ void kernel()

{

   __shared__ float array[500];

   int tx = threadIdx.x;

   int i;

   for (i = 0; i < 100; ++i)

   {

      array[tx] = 0.0f;

   }

}

The grid and block dimensions are:

dimBlock.x = 500;

dimBlock.y = 1;

dimBlock.z = 1;

dimGrid.x = 65535;

dimGrid.y = 65535;

dimGrid.z = 1;

Thanks for your help!

Hi! Thanks for writing back. Here’s the kernel function:

__global__ void kernel()

{

   __shared__ float array[500];

   int tx = threadIdx.x;

   int i;

   for (i = 0; i < 100; ++i)

   {

      array[tx] = 0.0f;

   }

}

The grid and block dimensions are:

dimBlock.x = 500;

dimBlock.y = 1;

dimBlock.z = 1;

dimGrid.x = 65535;

dimGrid.y = 65535;

dimGrid.z = 1;

Thanks for your help!

This access is suspicious.

array[tx] = 0.0f;

See the ptx code and look for this access. This is a shared access. Sometimes the compiler to ptx change or delete some unuseful instructions. This access to array[tx] seems unuseful because you do not take after that value. Please, check out this. I think that this code is not executing.

The reason of the slow of the program can be that are a lot of blocks, but this access does not affect to him.

Regards!

This access is suspicious.

array[tx] = 0.0f;

See the ptx code and look for this access. This is a shared access. Sometimes the compiler to ptx change or delete some unuseful instructions. This access to array[tx] seems unuseful because you do not take after that value. Please, check out this. I think that this code is not executing.

The reason of the slow of the program can be that are a lot of blocks, but this access does not affect to him.

Regards!

Thanks! I’ll try to check the ptx code. That should give me some clue.

Thanks! I’ll try to check the ptx code. That should give me some clue.

You have 6553565535500 threads to run so thats 2*10^12 threads each writing 100 times ( or just once if the compiler optimises the code)

How long is ‘really slow’ ?

You have 6553565535500 threads to run so thats 2*10^12 threads each writing 100 times ( or just once if the compiler optimises the code)

How long is ‘really slow’ ?

The code that I included above is a simplified form of my actual code. However, even the simplified version of the code is having the same issue.

I never got a chance to see whether the program completes. I usually wait for 15-20 mins, and if the program is still running, then I just terminate it. So it could be that the program just stalls without complaining.

The code that I included above is a simplified form of my actual code. However, even the simplified version of the code is having the same issue.

I never got a chance to see whether the program completes. I usually wait for 15-20 mins, and if the program is still running, then I just terminate it. So it could be that the program just stalls without complaining.