Memory optimization when all kernels are working with the same data

akjha · May 29, 2011, 8:22pm

Hi everyone,

Consider a kernel in that has a for loop of the following form.

device func( double *out, double *src){

int i,ip;

i = threadIdx.x + blockIdx.x * blockDim.x;

for(ip = 0; ip < 10000; ip++){
out[i] = oper(i,ip) * src[ip] * constants;
}

}

oper(i,ip) is some operation that depends on both i and ip. Therefore,
every kernel of the loop requires the same index from src while
running its own iteration. Therefore, every loop is working on the
same data. Now since the size of src is very big, we could place it in
shared memory and that could improve the computational speed. However,
I was wondering that since all the parallel executing threads are
using the same index of src for one loop, is there a better way to
handle the memory allocation of source than using shared memory.

Thank you very much for your time,

akj

hyqneuron · May 29, 2011, 8:50pm

You need 80,000 bytes to contain the entire of src. You could only put part of src in shared mem at a time since maximal shared mem size is 48k.
If you use shared memory, you’ll have to control the prefetching on your own and possibly use a few __syncthreads(). Doing a large prefetch at one go will leave the stream cores idle for quite some time. Doing small prefetches might require a larger number of __syncthreads(), which has its own penalty.

If oper(i, ip) involves a rather long computation and has absolutely no divergence, you could try doing small prefetches without using __syncthreads(), this way even distribution in time of global memory load could be guaranteed, and the stream cores can be kept busy for most of the time as well.

You could consider using constant memory to store your src. This way the hardware might generate some balanced global memory read. However, if the uniform cache does not do any prefetch, there is the possibility that you may get a slowdown comparing to the shared memory solution.

akjha · May 29, 2011, 11:13pm

Thank you for the reply. While I could implement the optimizations you suggested, the main point I would like to exploit is that all the kernels are going to be using the same data at all times.
Also, there is one thing I forgot to mention about the previous code. It has a divergent condition as below:

device func( double *out, double *src){

int i,ip;

i = threadIdx.x + blockIdx.x * blockDim.x;

for(ip = 0; ip < N; ip++){
if(i != ip){
out[i] = oper(i,ip) * src[ip] * constants;
}
}

}

where N is > 48 K.

I would be glad to get any suggestions on how I could be exploiting the fact that all kernels would be working on the same data.

Thank you once again
akj

hyqneuron · May 30, 2011, 9:22am

L1 would cache the data for all threads. If L1 doesn’t behave in some silly manner, that would already be a speed booster without any other optimisation. Constant memory is optimised for the very kind of memory access you are talking about. You just have to try and ensure that you don’t get bogged down by some weird, unknown hardware behaviour.

tera · May 30, 2011, 10:14am

Can’t you reorganize the problem to loop over [font=“Courier New”]i[/font] and have [font=“Courier New”]ip[/font] be determined by the thread number?

akjha · May 30, 2011, 10:17pm

Hi hyqneuron,

Thank you for the reply. I can see your point, since with Fermi, the global memory loads are cached in L1. However, will it not be true that every time I access a different element of src, I would have to do a different global memory access ? I am looking to avoid that. As you suggested earlier, prefetching could help, but again if the number of variables I prefetch is high, then that would lead to the cores being idle. Guess I will have to play around with the different configurations.

akjha · May 30, 2011, 10:22pm

Thanks Tera. I could do that, but what are the possible ways it can help, apart from the not having divergence in the loop. I did not want to do that since then the number of loops in the global function would become too high, plus I would not be exploiting the spatial locality of the src array in the kernel code.

tera · May 30, 2011, 10:48pm

The obvious advantage would be that you could load [font=“Courier New”]src[ip][/font] only once (or probably the compiler would already do that optimization for you). In case the loop gets too large, you could at least do some tiling (each block only runs part of the loop).

Of course that might just move the problem from the input to the output side. What is the output supposed to look like? The example code just overwrites the output for all but the largest [font=“Courier New”]ip[/font] value, which probably wasn’t intended.

akjha · May 31, 2011, 3:11am

Hi Tera,

Thanks for the reply and for pointing out what is an error in the code. In the example code, out[i] is incremented in each iteration. The correct code is as below:

device func( double *out, double *src){

int i,ip;

i = threadIdx.x + blockIdx.x * blockDim.x;

for(ip = 0; ip < N; ip++){

if(i != ip){

out[i] += oper(i,ip) * src[ip] * constants;

}

Tiling was an option, and I was therefore planning on using shared memory. However, that would imply checking for the divergence condition inside the kernel again. If only one loop has a different condition compared to the others, then is the parallelization going to suffer a lot ?

LSChien · May 31, 2011, 4:29am

Hi Tera,
Thanks for the reply and for pointing out what is an error in the code. In the example code, out[i] is incremented in each iteration. The correct code is as below:
device func( double *out, double *src){

int i,ip;

i = threadIdx.x + blockIdx.x * blockDim.x;

for(ip = 0; ip < N; ip++){

if(i != ip){

out[i] += oper(i,ip) * src[ip] * constants;

}

}

}

Do you use cuobjdump to check how many instructions in your app? i.e. cost of oper(x,y)

Also what GPU do you target on? Game card or Tesla brand?

If you run your app on game card and oper(x,y) is heavy on DP operation, then tile may not improve the performance.

Once you know how many instructions in your app, then you can estimate performance and know the performance gap.

akjha · May 31, 2011, 5:35am

I am targeting the Tesla brand. My guess is that oper(x,y) will have a large number of instructions.

hyqneuron · May 31, 2011, 7:29am

Another possible trick:

Eliminate the divergence. Just do the write even if i is equal to ip. Prefetch a single element from src before you start oper for the current src element.

Like this:

__device__ void kernel(double *out, double *src)

{

	int i,ip;

	i = threadIdx.x + blockIdx.x * blockDim.x;

	double currentsrc = src[0];

	double nextsrc;

	for(ip = 0; ip < N; ip++)

	{

		nextsrc = src[ip+1]; //L1 would load 128 bytes at a time. So it doesn't cost anything for 15 out of 16 loops. 

			//If oper is long enough, the load that does occur would be completely hidden as well.

		//do remember to allocate 8 more bytes for src so that there would be no memory access problem

		out[i] += oper(i,ip) * currentsrc * constants;

		currentsrc = nextsrc;

	}	

}

The above method generates rather balanced global read accesses. To make it even more balanced you’ll have to create some divergence between warps, though that would be unnecessary in your case, I believe.

I think you could try putting src in constant memory. This way the compiler will certainly generate Load Uniform instructions, which are supposingly optimised for this kind of uniform access that is taking place in all your threads. Considering the fact that your kernel is probably compute-bound(oper is long), the memory part actually wouldn’t matter much.

You will want to use shared memory for pre-fetching only when you want to prefetch multiple elements at one go and you do not have enough registers to store them. Do check the cuobjdump output to ensure that nextsrc doesn’t get thrown into local memory. Of course, when your oper has sufficient ILP, you can actually reduce the number of threads per block and increase the number of registers per thread, though I doubt you will actually be able to do that with all the meddlesome optimisation of ptxas.