How would you do this?


I want to iterate


I did it using global memory (not worrying about coalescing or anything), but it is not much faster than a serial CPU version. Obtaining maximum performance is my goal.

How would you attempt this?

I am considering using texture memory to improve performance, would this be a good choice?

Thank you!

If you want to read it, here is a fuller description and code snippet:

I am writing a program to do the following:

  1. I have a 2D (N x M) array ‘buf_h’

  2. I copy this to ‘buf’ on device

  3. I allocate (N+2) x (M+2) size blocks of memory on device called ‘old’, ‘newa’, and ‘edge’

  4. ‘edge’ and ‘old’ start-out as ‘buf’ surrounded by one layer of '0’s.

  5. The following is iterated ITS times for 0<i,j<N+1:

    5a. new[i][j]=(old[i+1][j]+old[i-1][j]+old[i][j+1]+old[i][j-1]-edge[i][j])/4

    5b. old[i][j]=new[i][j]

  6. ‘buf’ is copied to ‘buf_h’




	cudaMalloc((void **) &buf, sizeof(float)*N*M);

	cudaMalloc((void **) &old, sizeof(float)*(N+2)*(M+2));


	cudaMemcpy(buf, buf_h, sizeof(float)*N*M, cudaMemcpyHostToDevice);

	cudaMemset(edge, 0, sizeof(float)*(N+2)*(M+2));


	load<<<grid, block>>>(buf, edge, old);



  edgeker<<<grid, block>>>(old, edge, newa);



	finish<<<grid, block>>>(buf, old);


	cudaMemcpy(buf_h, buf, sizeof(float)*N*M, cudaMemcpyDeviceToHost);



void load(float *buf, float *edge, float *old){

	int el=(threadIdx.x+blockIdx.y*M), corr=M+3+2*blockIdx.y;


	*(edge + el+corr)=*(buf + el);

	*(old + el+corr)=*(buf + el);


void edgeker(float *old, float *edge, float *newa){

	float *newacorr=newa + threadIdx.x+blockIdx.y*M+M+3+2*blockIdx.y;

	float *oldcorr=old + threadIdx.x+blockIdx.y*M+M+3+2*blockIdx.y;

	float *edgecorr=edge + threadIdx.x+blockIdx.y*M+M+3+2*blockIdx.y;

	*(newacorr)=(*(oldcorr + 1) + *(oldcorr - 1) + *(oldcorr + (M+2)) + *(oldcorr - (M+2)) - *(edgecorr))/4;



void finish(float *buf, float *old){

	int el=(threadIdx.x+blockIdx.y*M), corr=M+3+2*blockIdx.y;



Uh, how is that supposed to work at all? Your other threads/blocks still need the old value of oldcorr.

Also you absolutely should use shared memory. E.g. let each block calculate the results for a 16x16 area. Loading the input values (a 18x18 area) into shared memory with minimal uncoalesced reads could be a bit tricky but should work well enough (hardly worse than what you have now - though it would be much simpler if make sure that N+2 is always divisible by 16).

Probably you should calculate at least two steps of the algorithm in one kernel call. While it means you will duplicate some calculation for the border elements in the neighbouring blocks, halving the number of global memory accesses should make up for it by far.

And also get rid of the cudaThreadSynchronize calls after each kernel call, they will only slow things down and also make it impossible to interleave CPU with GPU calculation.

Woops! I don’t know how I missed that… Somehow the output came out OK though… I was just lucky perhaps.

Yes, I have considered using shared memory also… So you think that would be superior to using texture memory, then.

I presumed that the call to cudaThreadSynchronize() would be necessary, especially inside the loop, to make sure that all of the threads in the previous kernel had finished before another kernel is launched. I also assumed that if all the threads had completed (and thus it would have been safe to omit cudaThreadSynchronize()) then it wouldn’t waste much time as it doesn’t have anything to do.

Thanks for the ideas.

cudaThreadSynchronize is only needed just before performing a timing measurement on the host. CUDA always makes sure that previous kernel calls are complete before starting new ones.

Shared memory would probably give you the most performance, but it will also result in the most complex code. A few small changes to a few lines to use Array memory and tex2D to read the values will probably give you a factor of 10-20 performance boost. Not coalescing reads does hurt you that much in performance in my experience.

Make sure to coalesce the memory writes, too.

To see how effectively your code is utilizing the device, count up the number of global memory reads + writes and divide by the kernel execution time to get GiB/s of global memory bandwidth. At peak performance, an 8800 GTX can reach 70 GiB/s.

Your real performance will be from rearranging your filter kernel compute.
You’re currently reading 9 values from memory for every pixel update.
You can make it 1 if you sweep and reuse previous reads.

You should store three rows of your “old” image and logroll them… when you’re done with the oldest row, you load a new row into the RAM and it’s now your r+2 row.

Your rough algorithm would be:
Initialize by loading two the first full rows of source pixels into shared memory arrays A and B.

  1. Read a full row into shared memory array C. This would be a coalesced read and write.
  2. You now have three rows, A B C all in shared memory. A is the top row, B is the middle row, C is the bottom row.
  3. Compute your effect for each pixel of the middle row, using fast shared memory reads. These memory reads are all coalesced as well!
  4. Write your answer into your output buffer in device RAM. It’s even OK to overwrite your input array, you won’t affect your compute since all the local values are cached
  5. swap pointers… just pointers, not data. So A=B and B=C.
  6. go back to 1, filling your new C array.

This algorithm reads and writes each pixel from device memory exactly ONCE, and those are coalesced.

There’s one more complication, how to deal with multiblocks, but that can be done by splitting your image into areas that are filtered independently and in parallel. Here you DO have to worry about one block changing a pixel that’s used by another at the edge, though maybe you could avoid that completely by using a temporary device buffer and flip-floping your reads and writes each iteration. [IE image A is filtered and output to image B, then image B is filtered and output to image A.]

Finally, if this is iterated more than twice, your design is still wrong. This is a convolution, and instead of repeating a small one pixel convolution 10 times or 100 or whatever, you can do it once with a larger 10 pixel kernel… and that’d still use only one read and write from device memory as long as your kernel was small enough to fit all the temp storage in shared memory.

Thank you all very much for the useful comments and suggestions.

I have been thinking about how to implement SPWorley’s suggestion, but I do not understand the explanation of how to handle multiblocks - how do I guarantee (without a lot of overhead) that the value I obtain from the ‘halo swap’ with neighbouring blocks is the updated value? The only way I know of is to have the iteration loop outside the kernel, but then I will lose a lot of speed as values will need to be re-read into shared memory again at each iteration.

Well I have done everything I know of to optimize the program (‘old’ and ‘edge’ are 2D textures, and the relevant section of ‘old’ is copied to shared memory before it is used, ‘newa’ is a cudaArray. I also tried implementing SPWorley’s idea of swapping pointers to shared memory in order for the block to pregress across the image).

I am doing this project for the purpose of evaluating the speed-up that can be obtained using CUDA, therefore I do not want to change the algorithm in any way or use ‘tricks’ (such as loading a 2-cell thick buffer around each block so that I can do two iterations for each kernel call). Yet, I wish to ensure that I really am making full use of its potential.

The serial CPU version takes around 5-6 times as long to execute the central algorithm as my CUDA version. Does that sound correct? I would imagine that the main thing slowing down my code is the necessity to relaunch the kernel at every iteration (and therefore reload memory, etc.). The concept of ‘halo-swapping’ from parallel computing (whereby the edge pixels can be passed to other processors) does not seem to be possible on the GPU, so I don’t see any way to avoid this slow-down.

Here is a trimmed version of the code:

__global__ void computeker(float *newa, size_t newap){

	__shared__ float scol[3][16];

	int x=threadIdx.x;

	int blockx=blockIdx.x*(blockDim.x-2);


  scol[0][threadIdx.x]=tex2D(oldtex, blockx+x, blockIdx.y);

  scol[1][threadIdx.x]=tex2D(oldtex, blockx+x, blockIdx.y+1);

  scol[2][threadIdx.x]=tex2D(oldtex, blockx+x, blockIdx.y+2);



	if((blockx+x-1)>=M) return;


 Â *((float *)((char *)newa + (blockIdx.y) * newap) + blockx+x - 1)=fabsf(scol[0][x] + scol[2][x] + scol[1][x + 1] + scol[1][x - 1]- tex2D(edgetex, blockx+x - 1, blockIdx.y))/4;






 Â computeker<<<grid, block>>>(newa, newap);

 Â if(it<ITS-1){

 �  cudaMemcpy2DToArray(old, 4, 1, newa, newap, sizeof(float)*M, N, cudaMemcpyDeviceToDevice);

 Â }





So far I have not heard of many applications that could even get close to “full potential” on a GPU without changing the algorithm. If you find one that is probably because they chose a really bad implementation for the CPU version.

Don’t guess, find it out! E.g. using the CUDA visual profiler width plot anything that is white tends to correspond to overhead. Or just approximate it by assuming 20 - 40 us per call.

Then just calculate those border pixels in both blocks! At least for rather small problems that is still bound to be faster. Though just calculating the appropriate 4x4 kernel for two iterations instead is likely to be simpler, faster and generally saner.

Also, I see little point in these:

if((blockx+x-1)>=M) return;


The cost of divergent branches and uncoalesced stores is likely to be higher than (if necessary) padding the newa buffer and doing some extra calculation.

I have tried using the Profiler for the first time.

It seems to be showing that all of my Global Memory stores are incoherent. If I changed it so the result of the calculation wasn’t written directly to global memory but instead to registers or shared memory, then put in a __syncthreads() at the end of the kernel and after that wrote the values calculated to global memory, that should probably result in all my Global Memory stores being coalesced, but I shall have to see whether it improves performance.

The next problem that I see is that it reports there are about 9152 branches for each ‘method’ (kernel call, I believe) and 2364 of these are divergent. I will look into Reimar’s suggestion to remove the if statements and instead perform unnecessary calculations.

The Width Plot does not appear to have any white. It has pink at the beginning and end for the memCpy (and for some reason the end one is longer than the beginning one even though I transfer more data using memCpy at the beginning of the program), and then in between the two is solid yellow indicating my kernel calls. The CPU time for each of my kernel calls is approximately 1067, GPU time about 574. If overhead per kernel launch is only about 40usec, that means only about 4% of CPU time spent per kernel is overhead.

Well I can’t see any way to make my Global Memory writes coalesced or solve the problem of the divergent threads.

As the program must load border cells which are necessary for the calculation, but does not (and can not) calculate results for those boundary cells, I either have to have non-coalesced memory reads or non-coalesced memory writes.

While I can remove the
if((blockx+x-1)>=M) return;
statement, I cannot remove
as that is necessary to prevent the calculation being performed on the border cells (which, as I said, is not possible) and data being written by those threads to global memory.

Unless I make some pretty radical changes to the program, I don’t think it will be possible to solve either of these problems. Is it possible to work out how much they are actually slowing down my program?

Remove all arithmetic ops, leave in your memory operations only, and time it. Figure out how much bandwidth you’re actually getting in memory. I’m guessing that getting around the coalesced memory problem would get you a factor of 5 at least.

Finally, what are your grid dimensions?

Hi tmurray and thanks for the reply.

Well if I comment out everything in the kernel after the writes to shared memory, then the Profiler reports that each kernel is taking about 284usec GPU time and 906 usec CPU time.

My own timing shows that the kernel iteration loop (which includes a kernel launch and a Memcpy D->D) takes about 75% of the time it took with the additional commands left in.

So with no arithmetic operations, no non-coalesced memory reads or writes, no divergent branches, and one global memory read, four shared memory reads, and one texture read less, it looks like I am not obtaining drastically improved speed.

In this situation, the Width Plot on the Profiler does show lots of white gaps (unlike the previous time). So I presume the main reason why the program does not run a lot faster even though it does a lot less, is because in the previous set-up the overhead was being hidden somehow

I was thinking that one possible way of eliminating non-coalesced memory writes would be to make a new 2D array in global memory with 14 columns created using MallocPitch. Then I could write the results of my computation to this array (one row per block) and then copy the whole array to ‘newa’. I think that should solve that problem, but I don’t know if it would result in a performance improvement.

Currently my grid and block dimensions are:
dim3 block(16, 1);
dim3 grid(M/14+(M%14==0?0:1), N);
where M=840 and N=600
and the kernel loop is iterated 1000 times.

Well, Memcpy2D doesn’t seem to like copying from a 2D array with 14 columns and (M/14)*N rows to one with M columns and N rows, so that idea won’t work.