Use shared memory in chunks

prosvirov.k · December 18, 2018, 10:05pm

Greetings!

For now im facing an issue trying to optimize a following function:

#define BLOCKS 32U
#define THREAD_PER_BLOCK 1024U
typedef Complex double2;

static __device__ inline Complex complex_round(Complex a) {
    Complex c;
    c.x = round(a.x);
    c.y = round(fabs(a.y));
    return c;
}

static __global__ void round_complex(Complex *a, int size, int dev) {
    const int numThreads = blockDim.x * gridDim.x;
    const int threadID = blockIdx.x * blockDim.x + threadId.x;
    a[threadID] = complex_round(a[threadID]);
    if (threadID == 0 && dev == 0) a[threadID].x -= 2.0;
}

I tried to move it into the shared memory since I got a lot of global memory access calls, but I found that my data has a size of 568kb, while a shared memory size is 48kb on my GPU.

I was able to move it partly to shared memory (occupy as much as I could), but the increase in speed was not significant.
So the question is, is there any way to fit everything in shared memory? I think it can be done since as far as I know shared memory is allocated per block.

njuffa · December 18, 2018, 10:48pm

I wonder how this code compiles: “a[i].x -= 2.0;”. I don’t see a variable ‘i’ being declared anywhere.

In any event, this is a streaming kernel that just updates a vector in global memory. Naturally, it will have lots of global memory accesses, and its performance will be limited by global memory throughput. As it’s simply streaming the data, there is no data reuse, so buffering in shared memory cannot speed up the code.

prosvirov.k · December 18, 2018, 11:37pm

Thank you, njuffa, sir.

That was a simple typo.

But anyway, in general case, imagine I have data reuse in my function, but my data is way bigger than the amount of sharing data, how do I optimize that?

Robert_Crovella · December 18, 2018, 11:47pm

The usual approach is to break the work into tiles, so that a tile of data will fit in shared memory, and the tile offers reuse opportunity. Often, each threadblock will work on a separate tile. This assumes there is some data (reuse) locality in your program.

An algorithm category that exhibit this behavior are stencil operations.

Refer to slides starting with 46 here:

[url]https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf[/url]

prosvirov.k · December 18, 2018, 11:52pm

Robert_Crovella, thank you, sir, for the brilliant answer and provided materials!

Jimmy_Pettersson · December 20, 2018, 11:01am

You might be interested in using Xtream, it automates per block shared memory caching and lets you define a stencil.

Check out the shared memory example (3x3 mean filter) here:

Xtream is just a thin wrapper above CUDA but it simplifies this type of operations.