Using a block-wide counter with shared memory. Opinions are welcome...

I am writing a kernel functions that will read X elements from a global array into the shared memory, and then will need to add a counter that will increment if the sample loaded follows some conditions.
Global memory read will be coalesced, no bank conflicts while reading the samples.
Let’s say I launch the kernel this way:

const unsigned short Num_Samples = 256; // Which is also the number of threads
const unsigned short Num_Blocks = 2000;

our_kernel < Num_Samples > <<< Num_Blocks, Num_Samples, (Num_Samples + 1) * sizeof(int) >>> (in, out, len);

The third launch parameter is Num_Samples + 1, which equals 257, so the block has an extra element for the counter. The beginning of the kernel definition, which I’m still writing, is:

template <const unsigned short NUM_SAMPLES>
__global__ void our_kernel(float *array_in, float *array_out, size_t array_len)
    {
    extern volatile __shared__ int sdata[];
    size_t  tid        = threadIdx.x,
            gridSize   = NUM_SAMPLES * gridDim.x,
            i          = blockIdx.x * NUM_SAMPLES + tid;
    sdata[tid] = 0;

    while (i < array_len)
        {
        sdata[tid] += array_in[i];
        i += gridSize;
        }
    __syncthreads();

    // MORE STUFF FOLLOWS, BUT I'M NOT FINISHED WORKING ON IT
    }

I will be doing a test after loading the data in the while() loop, line 10, and if it is true, I have to increment the counter of this block. So in this case, the threads need to write to the same resource.
The kernel launch allocates 1 extra position for the shared memory, but I am not sure this is the correct approach.
What do you guys suggest to implement this counter?

your data in global memory is float, but your shared memory that you are loading it into is defined as int?

that looks broken to me

it’s also not clear you need volatile on your shared mem declaration, but you may

one possible approach to update the counter would be an atomicAdd:

extern volatile __shared__ float sdata[];
    int *ctr = reinterpret_cast<int *>(sdata + NUM_SAMPLES);
    size_t  tid        = threadIdx.x,
            gridSize   = NUM_SAMPLES * gridDim.x,
            i          = blockIdx.x * NUM_SAMPLES + tid;
    sdata[tid] = 0;
    if (!tid) *ctr = 0;
    __syncthreads();

    while (i < array_len)
        {
        float temp = array_in[i];
        if (temp meets condition) atomicAdd(ctr, 1);
        sdata[tid] += temp;
        i += gridSize;
        }

(coded in browser, not tested, for illustrative purposes)

Good afternoon, Robert.
Apologies, the input/output are all float and so is the shared memory. For correctness, the kernel definition and call are:

template <const unsigned short NUM_SAMPLES>
__global__ void our_kernel(float *array_in, float *array_out, size_t array_len)
    {
    extern volatile __shared__ float sdata[];
    // .....
    }

our_kernel < Num_Samples > <<< Num_Blocks, Num_Samples, (Num_Samples + 1) * sizeof(float) >>> (in, out, len);

The volatile I borrowed from the reduction code, which populates the shared memory exactly as I need.
I will implement this concept and test it today. Thanks a lot for the suggestion.

I have used a portion of the shared memory for a counter and the way I call the kernel is:

kernel_func <<< grid_Size, block_Size, block_Size * sizeof(float) + sizeof(int) >>> (PARAMETERS);

So I am requesting block_Size elements of size float plus 1 element of size int, the counter for the block.
The declaration inside the kernel function is:

extern __shared__ float sdata[];
int *scounter = (int *) &sdata[block_Size];

However, I am in doubt in if/how to account for bank conflicts when iterating the index of sdata, since there is now an extra 4 bytes in the bank but it is not an extra index of sdata.
Is there bank conflict in this situation, if I iterate over sdata as if these extra 4 bytes were not there, or it has to be accounted for?
If so, how?

This organization shouldn’t introduce bank conflicts, per se. The pointer passed as the shared memory pointer will be the same as the sdata pointer in your example. Thereafter, subsequent 32-bit quantities will belong to separate banks.

Your statement “there is now an extra 4 bytes in the bank” makes no sense to me.

Starting with the pointer passed to the kernel code as the shared memory pointer, the first 4 bytes will belong to bank 0, the next 4 bytes will belong to bank 1, etc. until bank 31. Thereafter, the next 4 bytes will belong to bank 0, the next 4 bytes will belong to bank 1, etc.

So everything should be fine for sdata.

And with respect to scounter, I see no possibility for bank conflicts there. It only lives in a single bank.

Probably you need to learn more about banks and bank conflicts.

Good evening, Robert, and thanks again for your input.
This certainly helps clarify my understanding of shared memory, as I always think I got it right and then come across a situation proving me wrong.

I also noticed that the compiler complains when we reuse a symbol, like sdata, if it is declared differently in separate kernels, like shared float sdata[], in a kernel and shared double sdata[] in another. I thought it went totally out of scope when the kernel finished…

This is due to the extern keyword. If you define the shared memory arrays statically, you won’t run into this.

I don’t think this is really specific to CUDA. I think if you had ordinary C functions with extern definitions like this, they also would clash. The extern keyword places the symbol at global scope, if I am not mistaken.

Thanks for the clarification!