Problems with using shared memory

Hello,

My CUDA code runs 10 times slower when I use shared memory. However, I am not sure as to why. I have it setup so that each thread in the block just reads or writes to an exclusive memory in the array (in shared memory). Maybe I am not using it as it is meant to be. I am new to CUDA and this is my first kernel writing attempt.

So, in my CUDA kernel I have something like this:

[codebox]

#define NUM_BLOCKS 320

global void MyKernel()

{

shared int compareCode [NUM_BLOCKS];

compareCode [threadIdx.x] = someValue;

// Now later in the code, I have something like this:

if (compareCode[threadIdx.x] > currentCode]

{

   compareCode[threadIdx.x] = currentCode;

}

}

[/codebox]

This is basically the essense of the code. As you can see, my kernel allocates the array in shared memory and each thread should be writing and reading from its its own exclusive location (determined by threadIdx.x). This, runs very very slow. When I have something like:

[codebox]

global void MyKernel()

{

int compareCode = someValue;

if (compareCode > currentCode]

{

   compareCode = currentCode;

}

}

[/codebox]

This runs 10 times faster! Am I supposed to allocate shared memory some other way or are there some conflicts going on??! I fail to understand how though. Each thread should make a request for an exclusive index and no other thread should be asking for that index. Completely has me stumped.

Would really appreciate any help because right now I am using a lot of registers and this is preventing me from using the power of the GPU.

Many thanks,

xarg

If any NVIDIA developers read this, the visual profiler is so awesome! Thanks for that wonderful tool.

what’s your execution configuration, 1-D thread block or 2-D thread block?

what does “NUM_BLOCKS” meas?

  1. total number of blocks or

  2. total number of threads in a block

See wrap serialize in the profiler that gives you information about bank conflict in your shared memory .

It is a 1-D thread block.

NUM_BLOCKS is also the total number of threads in a block.

It seems quite probable that there are bank conflicts going on. However, I must confess that there is very little I understand about bank conflicts and am hoping someone can help me what is going on what this scenario.

So, say I have something as follows:

[codebox]

#define NUM_OF_THREADS_IN_BLOCK 320

global void MyKernel()

{

__shared__ float shared_mem[NUM_OF_THREADS_IN_BLOCK];

shared_mem[threadIdx.x] = 0.0;

for (int i = 0; i < 1000; ++i)

{

   shared_mem[threadIdx.x]++;

}

}

[/codebox]

So, in this trivial example, each thread block has 320 threads. My understanding is that since each thread is acessing a unique 4 byte value and there should be no bank conflicts. Obviously, I am quite wrong. So can someone explain with this simple scenario, what might be going wrong? I would be really grateful.

Thanks,

xarg

Well the code you’ve pasted above will compile to an empty kernel because the values from shared_mem aren’t written to global memory. This is dead code to the compiler, it doesn’t do anything persistent so it’s disregarded (rightfully so). I assume you write the results to memory somewhere that you just haven’t shown us?

There wouldn’t be any bank conflicts in this though. I have no idea why it would be so much slower.

Yeah, I had written it like that to keep it compact and just highlight the important point:

So, originally it is something like this:

[codebox]

#define NUM_THREADS_IN_BLOCK 320

global void MyKernel(float * result)

{

shared float track [NUM_THREADS_IN_BLOCK];

track[threadIdx.x] = 0.0f;

for (int i = 0;i < 1000; ++i)

{

   if (//SOME CONDITION)

   {

        track[threadIdx.x] = 1.0f;

   }

}

for (int i = 0; i < 1000; ++i)

{

  if (track[threadIdx.x] > SOME_VALUE) // This line is REALLY slow!   

  { // DO SOMETHING }

}

}

[/codebox]

The comparison operator in the second loop is REALLY slow. Drops speed by about 10 times.

cheers,

/xarg