Howto: Compute a value ONCE per block and reuse it for all threads?


i have a variable which will have the same value within all threads because it depends on block’s index. the calculation of this value is very time consuming why i want to do it only once for each block.

my proposed method is the following:

dim3 gridSize; // number of blocks
gridSize.x = 256;
gridSize.y = 256;
int numThreads = 256; //number of threads per block

kernel<<<gridSize, numThreads>>>(float* _array)
int idx = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;

__shared__ float psi;

if(threadIdx.x == 0)
    psi = __cosf(blockIdx.X) * __sinf(blockIdx.Y);


float beta = psi * __sinf(threadIdx.x);

_array[idx] = beta;


in my real code the computation of psi is much more complex why I try to do the computation only once per block and save the result to the shared memory. using the method above causes all other threads to wait (due to __syncthreads()) until thread 0 is done.
does anyone know a more effective way to calculate values that are identical to any thread within a block but differ from block to block??

thank you and best regards, rob

Compute them in CPU and loads them to shared memory in the kernel?

Your method is fine as it is.
But if that per-block compute is so slow, you could precompute the results as a table. You could use the CPU as gshi recommended, or perhaps even better, use the GPU itself.
DO this by running TWO kernels. The first kernel builds the table. Each thread in the first kernel computes the BLOCK precompute used by the second table.
This would be efficient if you have more than 5000 blocks or so.
The efficiency comes from the fact that the init kernel will be using every thread and nobody is waiting. The time to write the result to device memory and then for the second kernel to read it is pretty small.

However small that is, though, if your per-block compute is fast and cheap, that overhead will kill you. As a total unsupported guess, I’d expect you’d need 50 or more operations to make that init kernel be worthwhile.

Kernel launch overhead is 10-25 us but it’s not really terrible compared to most computes.

A constant memory lookup table would be another option. You only need to pre-calculate the sine and cosine of [0,max(gridSize.x,gridSize.y)], which is only 256 floats per function in your example. That would eliminate the overhead of computing the transcendentals and leave you with a multiply. Every thread could do it - constant memory cache would make the reads fast and the floating point multiply is cheap, plus you eliminate the syncthreads call and eliminate some conditional execution. Just a thought.