I’m new to CUDA programming, and I’m having trouble making my code faster. My main concern is that the memory accesses aren’t being coalesced, and I can’t get the profiler to work with the counters or timestamps :(. Any suggestions would be appreciated. Thanks!

This is my kernel:

#define BLOCK_SIZE 128

__global__ void transform(cufftComplex* output, cufftComplex* f, int fM, int fN, float TWOPI, int fxN, int fyN)



	int bx = blockIdx.x;

	int tx = threadIdx.x;

	int index = tx + (bx * BLOCK_SIZE);


	int ifx = (int) (index)%fxN;

	int ify = (int) (index)/fxN;

	int end = fN*fM;

	int step = BLOCK_SIZE;


	if(ifx < fxN && ify < fyN)



		int check = 0;

		float x, y, fx, fy, dx, dy, m, n;

		int ix, iy;


		cufftComplex tot = make_cuFloatComplex(0,0);

		cufftComplex med;


		   __shared__ cufftComplex fTile[BLOCK_SIZE];

			   for (int i = 0; i < end; i += step)


			if(tx + i < end)

				fTile[tx] = f[tx + i];




			if (end-i < BLOCK_SIZE)

				check = end - i;

			else check = BLOCK_SIZE;


			for (int j = 0; j < check; j++)



				ix = (j + i)%fM;

				iy = (j + i)/fM;

				x = x_const[ix];

				y = y_const[iy];

				fx = fx_const[ifx];

				fy = fy_const[ify];

				dx = dx_const[ix];

				dy = dy_const[iy];

				m = -TWOPI*((x*fx)+(y*fy));

				n = dx*dy;

				med = make_cuFloatComplex(n * __cosf(m), n * __sinf(m));

				tot = cuCaddf(tot, cuCmulf(med, fTile[j]));






		output[index] = tot;




Any ideas? :wacko:

I can’t see where exactly is the problem with coalescing. What is the BLOCK_SIZE?

BTW Is declaring the shared array within the loop intended?

Let’s see:

  • I’m assuming you used cudaMalloc() to assign memory to output and f. If so, then the memory is properly aligned (to 256 bytes to be more precise), and the reads and writes are coalesced.
  • you’re reading 8-byte (64-bit) types
  • You’re using a constant base, with the kth thread accessing the kth element. (evidenced by [constBase + tx])

Yes, you have all the requirements for coalesced reads. In your cases, you will get one coalesced 128-byte transaction per half-warp.

EDIT - Oh, and in f[tx + i], i must be a multiple of 16 (the # of threads in a half-warp). That way you make sure that the segment of memory you are reading is aligned to the proper size (128 bytes in this case). I see that i is incremented by 128, so you’re fine there as well.

Thanks for the reply! I’ve been going over the reduction example for ideas on how to improve performance, but is there anything obvious in my code that could be a bottleneck? I checked the occupancy in the profiler, and its very low (.5). :unsure:
@Big Mac
Thanks for pointing that out. I moved the shared memory declaration out of the loop.
BLOCK_SIZE is set to 128 at the top.

50% occupancy is not that low, actually. Occupancy is generally overrated as a performance metric. (It definitely does not mean that your card is idle 50% of the time, which is what the term might lead you to believe.)

A multiprocessor can support 768 threads or 1024 for compute capability 1.3 or higher. This translates into 24 warps per multiprocessor (32 if using a G200 or higher). The occupancy is just a measure of how many warps out of the maxumum supported can execute on a multiprocessor at a given time.

You may use the CUDA Occupancy calculator to find the limiting factor, and try to increase occupancy by optimizing your kernel. Sometimes you’ll see big gains, sometimes a minute difference; it depends on the specific kernel. From my experience, an occupancy of 50% to 67% is fairly large. At 50%, 12 warps can execute on a multiprocessor, which, on a GPU with 16MPs surmounts to 192 warps, or 6144 threads. Unless you have many more threads than that, improving occupancy is not likely to give large gains.

As far as your kernel, something does jump out: *_const. I’m assuming that those are arrays copied into constant memory using cudaMemcpyToSymbol(). Although you get 64KB of constant memory to play with, the working set is only 8KB, so if your constant add up to above that size, constant memory reads can get very expensive. I’ve found it is sometimes better to read whatever data is needed into shared memory, and use it from there, as opposed to reading it directly from constant memory. You can also read part of the data from global memory into shared memory, and part from the constant cache. You will have to test to see which method works faster for you.

Second, you use several if statements to check whether you have passed the data size (end). Is it possible to add padding to your data and do the calculations without the if statements? If so, that should give you an extra percentage or two in speed, as if statements are expensive.