performance unchanged by shared memory 1D heat equation attempt

So as my first CUDA program, I thought I’d try a 1D heat equation solver and compare it to one I’d made in GLSL in the oldschool gpgpu fashion.

So I now have a working version which renders the heat distribution to screen every so often (enough to appear animated). It was a quick unoptimised initial attempt, and it worked. My plan was then to improve it by using shared and constant memories where applicable (assuming I knew what ‘applicable’ entailed).

My solution was as follows: Store temperature values in a linear block of global memory. Take each diagonal from the tri-diagonal update matrix, and store them in an array each. To update/iterate, I call the kernel and it performs the necessary operations: x_new[i] = L[i] * x[i-1] + D[i] * x[i] + U[i] *x[i+1]; where L,D,U are lower, main and upper diagonals respectively.

As you can see, each thread accesses its neighbours values too, so I thought shared memory would be useful, so that 1 (rather than 3) global memory accesses occur per element. However, this did not change anything performance wise.

My kernel looks like this :

[codebox]

global_ static void HeatEqn2(float2* vbo, float* X, float* Y, float* L, float* D, float* U)

{

__shared__ float line[BLOCK_SIZE+2];

int idx = threadIdx.x + blockDim.x * blockIdx.x;

line[threadIdx.x+1] = X[idx];

if (threadIdx.x==1)

	line[0] = X[idx-2];

else if (threadIdx.x==0)

	line[blockDim.x+1] = X[idx+blockDim.x];

__syncthreads();

// OLD WAY --> float val = L[idx] * X[idx-1]  +  D[idx]*X[idx]  +  U[idx]*X[idx+1];

float val = L[idx] * line[threadIdx.x]  +  D[idx]*line[threadIdx.x+1]  +  U[idx]*line[threadIdx.x+2];

Y[idx] = val;

vbo[idx].y = val*300;

}[/codebox]

The shared block has two extra elements to account for the out-of-block accesses. vbo is the vertex buffer used for rendering. In the ‘unoptimised’ (but yet equally performing) kernel, I had no shared line memory, and instead accessed directly from X.

Can anyone explain to me why this does not yield significant improvements.

In addition to this, due to the fact that L,D,U do not change, I thought I should place them in on-chip constant memory for faster access, but that had no improvements either.

My block size is 256x1x1 and for now the grid-size is just 4x1.

Thanks in advance

The global memory loads aren’t going to be coalesced using that strategy, so there is no advantage to using shared memory.

could you please explain what is global memory loads & store?

global memory load = reading from global memory
global memory store = writing to global memory

You’ll have to excuse my ignorance, but I’m not too sure I understand why it wouldn’t be coalesced. I’ve spent the day trying to work it out and even saw an NVIDA finite difference example (in a slide) nearly identical to mine. Are you saying the accesses to the X array won’t be coalesced? Doesn’t each thread (except for 0 and 1) access exactly one global memory element - meaning that chunk of memory can be fetched at once, plus the additional two boundary elements’ segments?

I obviously don’t get it (seeing my code doesn’t speed up), so if its not too much of a hassle, would you mind elaborating?

thanks again :)

It’s your X[idx-2] access that is not coalesced. On compute 1.0/1.1 hardware, you pay a massive performance penalty for this shifted read. On compute 1.3 hardware, it isn’t so bad a penalty, but it is still there. Read the newly release “best practice guide” for a very good discussion on non-coalesced reads due to offsets: http://forums.nvidia.com/index.php?showtopic=101432

Your vbo[idx].y write is also going to be uncoalesced as it is essentially writing to an array with a stride of 2. Again, the effect of stride on memory performance is very nicely discussed in the best practice guide.

MisterAnderson42 has covered most of it. The other aspect to coalescing memory access is that threads within a half warp (16 threads) must load and store from/to a 16 word aligned block of global memory (16 floats or 64 bytes in this case). So memory access with anything other than strides of 1 within a half warp or 16 within blocks is going to be slow.

Thanks guys. I get what you’re saying. My device has compute capability 1.3 (gtx295) so I may not be seeing as great a performance problem, but I still need to learn to improve it. I’m getting quite frustrated with myself that I’m taking so long to grasp many of the memory coalescing concepts + banks etc.

As I said in my last post, I saw a similar problem in an NVIDIA slide (from the CUDA webinars) where they used a kernel nearly identical to mine to solve the finite difference problem (except their halos were 3, not 1). I wish I could find a more optimised solution.

At the moment I’m running the kernel ~170 times per opengl draw, and these iterations total to ~2.05ms. So this is the time I’m trying to decrease. (Only the final kernel launch in the set writes to the vbo, so the write is pretty negligible for now)

I’m assuming a Finite Difference method is a common and fundamental technique to have implemented on the gpu, and so surely there are much more optimal ways to do this (especially 1D). Do you guys have any links directly related to 1D FD methods, or could you explain the thought process you might adopt if you tried to improve my kernel?

I’ve removed the shared memory parts of the code so that the kernel is now plainly:

[codebox]global static void HeatEqn(float* X, float* Y, float* L, float* D, float* U)

{

int idx = threadIdx.x + blockDim.x * blockIdx.x;

Y[idx]= L[idx] * X[idx-1]  +  D[idx]*X[idx]  +  U[idx]*X[idx+1];

}

[/codebox]

I did read that best practises article, and will give it another look later, I’m just trying to understand the step by step activities that a warp will undergo during execution. For example:

  • Will the reads from L, U, D be coalesced and read in chunks?

  • Will the read from X entail Chunked (for all 1-1 mapped threads) + 2 extra chunk reads (for the halo, ‘out-of-bounds’ references) ?

I really apologise for all the (probably) trivial questions, but there are limited resources.

As always, any help will be SUPER appreciated.

Thanks

rewolf

Why aren’t you just hardcoding the L, D, and U values? 1 -2 and 1 I imagine? Just put them directly in the code. Do you have a non-uniform mesh? Even so, just store the spacing of the mesh, not each diagonal separately. That is just wasting memory space and bandwidth.

Also edited to add: constant memory isn’t likely showing any performance increase because each thread will be accessing a different element in the constant memory. Constant memory is fast when everyone accesses the same element at the same time.

Edited to further add: if you really want flexibility in the stencil coefficients and maximum speed, generate the kernels at runtime and JIT compile them.

Yeah hardcoding is definitely a good idea - I was just trying to make the solution too general (in case stencils weren’t the same for each element, like maybe for nonlinear equations), but for heat equations like this, its definitely overkill. thanks.

These values will be different on the boundaries though, but I assume a conditional/branch in that case won’t be too dramatic given that only two threads will be affected? Also if I wanted these values to be specified dynamically, would this be a worthy use of constant memory, seeing that (nearly) every thread will access the same element and thus caching will speed it up - though I could just pass it as a kernel parameter.

The reason I understood constant memory incorrectly was because I thought the Programming Guide said it was ‘on-chip’ whereas it actually said the constant cache was on-chip. my bad.

I think I’ll avoid JIT kernel compilation for now - trying to avoid the Driver API.

As for improving the misaligned reads from the main vector, there doesn’t seem to be any shared memory exploitation that can improve it (at least in 1D), so I’ll just live with it :P

Thanks again to all that helped!

rewolf

Do try binding X to a texture and reading it with tex1Dfetch. That would be the easiest way to improve your simple kernel.

There is another issue at play here, though:

How many threads are you running?

You said

That works out to a mere 12 microseconds per call, which indicates to me that your kernel launches are very small and the time is dominated by the kernel launch overhead.

You can easily get the reads and writes coalesced for something like the finite difference method you are using. The concept is the same in any number of dimensions - after all memory is linear and contiguous loads and stores can only be achieved for a single dimension. To get it done, you only need three things:

[list=1]

[*]Pad storage and use a block size so everything works on 16 word aligned boundaries

[*]Add your ghost cell/boundary value data to you main global memory array and use a separate kernel to do the assignment before a differencing run (and your idea of using constant memory for holding boundary data or coefficients is a good one)

[*]Use a computation scheme where blocks overlap, and have a number of “non-participatory” threads per block, whose sole function is to load from global to shared memory for coalescing purposes.

The block overlap gives a scheme like this (for the minimal 32 threads per block):

Block 0

00000000001111111111222222222233

01234567890123456789012345678901

xccccccccccccccccccccccccccccccx

Block 1

11112222222222333333333344444444

67890123456789012345678901234567

xxxxxxxxxxxxxxxccccccccccccccccx

Block 2

33333333444444444455555555556666

23456789012345678901234567890123

xxxxxxxxxxxxxxxccccccccccccccccx

where “c” is a load from global to shared memory, a stencil calculation and store to global memory, and “x” is just a load. Even though only half the threads per block actually do calculations, it still a lot faster than the naive alternative because global memory loads have such a high latency penalty when they are not coalesced.

Wow guys, thanks a lot! Gonna try your ideas now - really appreciate the insight.

At the moment just 1024 threads which I would obviously increase for greater resolution later. My block size is 128 at the moment, I’ve been changing it around a bit, no particular reason I settled on 128. So that means I need 8 blocks in my grid. AFAIK there are 30 multiprocessors, so there’s plenty of space to increase the grid-space without impacting performance I think.

Thanks for pointing out the launch overhead, I will try performing multiple updates per kernel.

K, I’m off to try these things - thanks again External Media

So I after MisterAnderson42 mentioned that the launch overhead is contributing a lot to the low performance, I decided to see what would happen if I iterated the equation from within the kernel rather than calling the kernel repeatedly. I couldn’t believe how fast I got it working (even with uncoalesced memory reads) - It was simulating 3s of virtual time in just 1s of real-time - i was very happy.

But then I tried cranking up the number of meshpoints (which was originally 1024). When I tried 4096 everything failed and I then realised that the iterations can’t be done within the kernel because it depends on data being updated in other blocks concurrently which obviously doesn’t work for big grids.

avidday, your block-overlapping idea was my next attempt. Initially I didn’t know what all your numbers meant in the example haha, but after trying to devise your explained scheme I slapped my forehead yet again. I have tried an implementation of the scheme as shown below:

[codebox]global static void HeatEqn2(float2* vbo, float* X, float* Y)

{

int idx = threadIdx.x + blockDim.x * blockIdx.x;

__shared__ float block[BLOCK_SIZE];

float tmp;

if (blockIdx.x==0){

	block[threadIdx.x] = X[threadIdx.x];

	__syncthreads();

	if (idx > 0 && idx < BLOCK_SIZE-1){

		tmp = LL*block[idx-1] + DD*block[idx] + UU*block[idx+1];

		Y[threadIdx.x] = tmp;

		vbo[threadIdx.x].y = tmp * 300;

	}

}else{

	int overlapIdx = blockIdx.x * HALF_BLOCK_SIZE + threadIdx.x;

	block[threadIdx.x] = X[overlapIdx];

	__syncthreads();

	if (threadIdx.x > 14 && threadIdx.x < BLOCK_SIZE-1){

		tmp = LL*block[threadIdx.x-1] + DD*block[threadIdx.x] + UU*block[threadIdx.x+1];

		Y[overlapIdx] = tmp;

		vbo[overlapIdx].y = tmp*300;

	}

}	

}[/codebox]

where each block overlaps the previous by 16 elements (actually I haven’t yet tried other block sizes - just 32). LL,DD,UU are in constant memory ; X, Y, and the vertexbuffer vbo are initialised with the initial distribution before any kernels run.

I think I’m finally understanding the coalescing stuff, though I’m not sure whether my code is any good. Also, I’m bothered that i have to repeatedly call this small kernel and face the penaltoes of the launch overheads, as this runs much slower than my (flawed) internally iterating kernel. But at least its faster than my original uncoalesced kernel :P.

Thanks again so much for all the tips, its seriously helped towards my understanding of the memory concepts and kernel design (as basic as the kernels are).

Gonna attempt a tex1D setup now as MisterAnderson42 suggested.

rewolf

That is very roughly the idea, but your implementation is really suboptimal in a number of ways (and, if I am not mistaken, contains a pretty major error). Assuming you have suitably padded your global memory arrays, it can be as simple as something like this:

#define OFFSET (15)

_global__ static void HeatEqn2(float2* vbo, float* X, float* Y)

{

	int idx = threadIdx.x + blockDim.x * blockIdx.x;

	__shared__ float block[];

	float tmp;

	block[threadIdx.x] = X[idx];

	__syncthreads();

	int startidx=(blockIdx.x > 0) ? OFFSET : 1;

		int endidx=blockDim.x-2;

	if ((threadIdx.x>=startidx)&&(threadIdx.x<=endidx)) {

		tmp = LL*block[threadIdx.x-1] + DD*block[threadIdx.x] + UU*block[threadIdx.x+1];

		Y[threadIdx.x] = tmp;

		vbo[threadIdx.x].y = tmp * 300.0;

		}

}

That coalesces the loads from global memory, but the float2 store will not be coalesced. Ideally you should load the float2 values to shared memory, have the calculating threads update the value in shared memory, and then have all threads store back. A half warp writing float2 values in the correct pattern should coalesce the writes using 128 byte segments.

As for block size, you obviously want to use as many as shared memory allows. 32 threads per block is far too few to get enough latency hiding to get close to the full computational efficiency, but it is the minimum required to make the scheme work. But do bear in mind that 1D calculations are really the worst performing unless your grid is very large.

Hey thanks again
Apart from not being as elegant as your solution, they seem to do the same thing, so I can’t quite see how mine is very unoptimal. I’m assuming the error you were referring to was to do with my calculation of overlapIdx ? I’ve fixed that now so it works properly for any block size and overlap. Will give the float2 coalescing a shot. I’m using a block size of 128 now, which yields an occupancy of 1.0 according to the profiler.
Thanks for pretty much teaching me haha. External Image

rewolf

edit: I tried the write coalescing, but it seemed to slow it down a bit.