My CUDA code runs 10 times slower when I use shared memory. However, I am not sure as to why. I have it setup so that each thread in the block just reads or writes to an exclusive memory in the array (in shared memory). Maybe I am not using it as it is meant to be. I am new to CUDA and this is my first kernel writing attempt.
// Now later in the code, I have something like this:
if (compareCode[threadIdx.x] > currentCode]
{
compareCode[threadIdx.x] = currentCode;
}
}
[/codebox]
This is basically the essense of the code. As you can see, my kernel allocates the array in shared memory and each thread should be writing and reading from its its own exclusive location (determined by threadIdx.x). This, runs very very slow. When I have something like:
[codebox]
global void MyKernel()
{
int compareCode = someValue;
if (compareCode > currentCode]
{
compareCode = currentCode;
}
}
[/codebox]
This runs 10 times faster! Am I supposed to allocate shared memory some other way or are there some conflicts going on??! I fail to understand how though. Each thread should make a request for an exclusive index and no other thread should be asking for that index. Completely has me stumped.
Would really appreciate any help because right now I am using a lot of registers and this is preventing me from using the power of the GPU.
Many thanks,
xarg
If any NVIDIA developers read this, the visual profiler is so awesome! Thanks for that wonderful tool.
NUM_BLOCKS is also the total number of threads in a block.
It seems quite probable that there are bank conflicts going on. However, I must confess that there is very little I understand about bank conflicts and am hoping someone can help me what is going on what this scenario.
So, in this trivial example, each thread block has 320 threads. My understanding is that since each thread is acessing a unique 4 byte value and there should be no bank conflicts. Obviously, I am quite wrong. So can someone explain with this simple scenario, what might be going wrong? I would be really grateful.
Well the code you’ve pasted above will compile to an empty kernel because the values from shared_mem aren’t written to global memory. This is dead code to the compiler, it doesn’t do anything persistent so it’s disregarded (rightfully so). I assume you write the results to memory somewhere that you just haven’t shown us?
There wouldn’t be any bank conflicts in this though. I have no idea why it would be so much slower.