Problems with using shared memory

xargon · September 13, 2009, 10:24pm

Hello,

My CUDA code runs 10 times slower when I use shared memory. However, I am not sure as to why. I have it setup so that each thread in the block just reads or writes to an exclusive memory in the array (in shared memory). Maybe I am not using it as it is meant to be. I am new to CUDA and this is my first kernel writing attempt.

So, in my CUDA kernel I have something like this:

[codebox]

#define NUM_BLOCKS 320

global void MyKernel()

{

shared int compareCode [NUM_BLOCKS];

compareCode [threadIdx.x] = someValue;

// Now later in the code, I have something like this:

if (compareCode[threadIdx.x] > currentCode]

{

   compareCode[threadIdx.x] = currentCode;

}

[/codebox]

This is basically the essense of the code. As you can see, my kernel allocates the array in shared memory and each thread should be writing and reading from its its own exclusive location (determined by threadIdx.x). This, runs very very slow. When I have something like:

[codebox]

global void MyKernel()

{

int compareCode = someValue;

if (compareCode > currentCode]

{

   compareCode = currentCode;

}

[/codebox]

This runs 10 times faster! Am I supposed to allocate shared memory some other way or are there some conflicts going on??! I fail to understand how though. Each thread should make a request for an exclusive index and no other thread should be asking for that index. Completely has me stumped.

Would really appreciate any help because right now I am using a lot of registers and this is preventing me from using the power of the GPU.

Many thanks,

xarg

If any NVIDIA developers read this, the visual profiler is so awesome! Thanks for that wonderful tool.

LSChien · September 14, 2009, 1:37am

Hello,

[codebox]

define NUM_BLOCKS 320

global void MyKernel()

{

shared int compareCode [NUM_BLOCKS];

compareCode [threadIdx.x] = someValue;

// Now later in the code, I have something like this:

if (compareCode[threadIdx.x] > currentCode]

{
   compareCode[threadIdx.x] = currentCode;
}

}

[/codebox]

what’s your execution configuration, 1-D thread block or 2-D thread block?

what does “NUM_BLOCKS” meas?

total number of blocks or
total number of threads in a block

CUDAkk · September 14, 2009, 3:13am

Hello,

My CUDA code runs 10 times slower when I use shared memory. However, I am not sure as to why. I have it setup so that each thread in the block just reads or writes to an exclusive memory in the array (in shared memory). Maybe I am not using it as it is meant to be. I am new to CUDA and this is my first kernel writing attempt.

So, in my CUDA kernel I have something like this:

[codebox]

define NUM_BLOCKS 320

global void MyKernel()

{

shared int compareCode [NUM_BLOCKS];

compareCode [threadIdx.x] = someValue;

// Now later in the code, I have something like this:

if (compareCode[threadIdx.x] > currentCode]

{
   compareCode[threadIdx.x] = currentCode;
}

}

[/codebox]

This is basically the essense of the code. As you can see, my kernel allocates the array in shared memory and each thread should be writing and reading from its its own exclusive location (determined by threadIdx.x). This, runs very very slow. When I have something like:

[codebox]

global void MyKernel()

{

int compareCode = someValue;

if (compareCode > currentCode]

{
   compareCode = currentCode;
}

}

[/codebox]

This runs 10 times faster! Am I supposed to allocate shared memory some other way or are there some conflicts going on??! I fail to understand how though. Each thread should make a request for an exclusive index and no other thread should be asking for that index. Completely has me stumped.

Would really appreciate any help because right now I am using a lot of registers and this is preventing me from using the power of the GPU.

Many thanks,

xarg

If any NVIDIA developers read this, the visual profiler is so awesome! Thanks for that wonderful tool.

See wrap serialize in the profiler that gives you information about bank conflict in your shared memory .

xargon · September 14, 2009, 5:11am

It is a 1-D thread block.

NUM_BLOCKS is also the total number of threads in a block.

It seems quite probable that there are bank conflicts going on. However, I must confess that there is very little I understand about bank conflicts and am hoping someone can help me what is going on what this scenario.

So, say I have something as follows:

[codebox]

define NUM_OF_THREADS_IN_BLOCK 320

global void MyKernel()

{

__shared__ float shared_mem[NUM_OF_THREADS_IN_BLOCK];

shared_mem[threadIdx.x] = 0.0;

for (int i = 0; i < 1000; ++i)

{

   shared_mem[threadIdx.x]++;

}

}

[/codebox]

So, in this trivial example, each thread block has 320 threads. My understanding is that since each thread is acessing a unique 4 byte value and there should be no bank conflicts. Obviously, I am quite wrong. So can someone explain with this simple scenario, what might be going wrong? I would be really grateful.

Thanks,

xarg

_Big_Mac · September 14, 2009, 10:22am

Well the code you’ve pasted above will compile to an empty kernel because the values from shared_mem aren’t written to global memory. This is dead code to the compiler, it doesn’t do anything persistent so it’s disregarded (rightfully so). I assume you write the results to memory somewhere that you just haven’t shown us?

There wouldn’t be any bank conflicts in this though. I have no idea why it would be so much slower.

xargon · September 14, 2009, 12:41pm

Yeah, I had written it like that to keep it compact and just highlight the important point:

So, originally it is something like this:

[codebox]

define NUM_THREADS_IN_BLOCK 320

global void MyKernel(float * result)

{

shared float track [NUM_THREADS_IN_BLOCK];

track[threadIdx.x] = 0.0f;

for (int i = 0;i < 1000; ++i)

{

   if (//SOME CONDITION)

   {

        track[threadIdx.x] = 1.0f;

   }

}

for (int i = 0; i < 1000; ++i)

{

  if (track[threadIdx.x] > SOME_VALUE) // This line is REALLY slow!   

  { // DO SOMETHING }

}

[/codebox]

The comparison operator in the second loop is REALLY slow. Drops speed by about 10 times.

cheers,

/xarg

Topic		Replies	Views
Newbie: __shared__, what am I doing wrong? shared memory seems to be slowing things down - why? CUDA Programming and Performance	2	1845	November 21, 2009
Shared memory bank conflicts? CUDA Programming and Performance	0	862	June 4, 2009
Accessing Shared memory: Unexpected timing result "Naive" access seems to be twice as fast a CUDA Programming and Performance	3	1228	February 27, 2012
beginner question regarding shared memory CUDA Programming and Performance	4	7008	November 16, 2009
Little questions about conflict bank CUDA Programming and Performance	7	7464	April 26, 2011
Shared memory writes really slow CUDA Programming and Performance	17	8893	November 3, 2010
Shared memory avoiding bank conflict less effective CUDA Programming and Performance	3	3766	May 6, 2010
convolution using shared memory slowdown instead of speedup... CUDA Programming and Performance	1	5619	March 11, 2010
shared memory without bank conflict slower than that with bank conflict CUDA Programming and Performance	2	944	November 28, 2019
Shared memory bank conflict CUDA Programming and Performance	2	1632	October 1, 2021

Problems with using shared memory

Related topics