Global memory optimisation

I am tuning some CUDA code at the moment. I made some changes and was expecting to see a large speedup, but it made no difference.

I have some arrays of data that I am processing with blocks of threads. Successive threads were processing non-sequential entries in the array, based on an itemIndex array, which is an array of items that need to be processed. Something like this…

__global__ void MyFunction(int *itemIndex, int itemCount, int *d_p1, int *d_p2,
	float *d_length, float *d_percent, float *d_precalc, float *d_output)
	int i = (blockIdx.x * THREADSPERBLOCK) + threadIdx.x;
	if (i >= itemCount) {
	int lcix = itemIndex[i];	// get the next item to process

	// Copy Global data into registers
	int p1 = d_p1[lcix];
	int p2 = d_p2[lcix];
	float length = d_length[lcix];
	float percent = d_percent[lcix];
	float precalc = d_precalc[lcix];
	float result;

	// Process item
	... various calculation on the data

	d_output[lcix] = result;

Now originally, the itemIndex array was sort of random, but actually ended up processing alternate items {0,2,4,6, etc}, so I did some preprocessing on the arrays to make itemIndex sequential, and I was expecting a significant speedup, but I didn’t see any improvement at all. Just to confirm, I also tried complete random ordering of the items, so that itemIndex was something like {18427, 3123, 23014, 15923, 8129, etc}, and it still made no difference to the execution time.

From what I’ve read, reading from global memory is slow, but by reading sequential entries with each thread in a warp, it should be much faster. I’m running several thousand threads, in blocks of 32 threads per block.

Any ideas? Am I correct that in-order access should be faster?

Have you tried using nvvp or nvprof? What’d they say?

Also, I think you can only coalesce memory if your block dimension is a multiple of 16 as coalesced reads only happen for half-warps, iirc. I could be wrong. But it would explain why only using blocks of 32 threads per block is the issue.

Instead, use something like this :

const int thread_id = threadIdx.x + blockIdx.x * blockDim.x;

for (int tid = thread_id; tid < size; tid += blocks_per_grid * threads_per_block)


This way your threads march evenly as a grid over the data set.

There’s also the const and restrict keywords which will help the compiler resolve aliasing issues and might allow for further optimization and better caching effects.

Looking at NVP at the moment. It’s saying lots of things, but none of them make sense yet.

Also, I could be wrong, but isn’t 32 a multiple of 16? :) Just for you, I tried block size of 16 threads, but as I expected, it slowed down by about 30%. Block sizes of less than 32 threads is just wasting threads, because the warp size is 32 threads. By my understanding, coalescing happens in half warps, but CUDA is totally able to coalesce a full warp in two halves. So, there shouldn’t be any issue with 32 threads per block. I also tried at 128 threads per block, but it was the same as 32 threads per block.

Do bank conflicts only occur in shared memory? All the technical docs seem to talk about “Shared memory bank conflicts”, but then the examples are all accessing global memory to avoid bank conflict…

Global memory has Partitions, and this is a seperate issue. Is global memory access susceptible to both bank conflicts and Partition camping issues?

Bank conflicts only occur in shared memory. THere is no concept of bank conflicts in global memory. Global memory accesses are subject to partition camping but not bank conflicts.

Coalescing happens in half warps only for devices of cc1.x compute capability. For cc2.x and newer, coalescing occurs in the context of the accesses generated from a full 32-thread warp.

When ordinary optimization methods yield no benefit, it’s usually because the limiter to performance that you think is in effect, is not. In these cases, analysis driven optimization is the best approach (IMO), and that starts with the profiler. You might wish to google “CUDA analysis driven optimization”

in addition to profiling, you do not mention how you measure the ‘speedup’, or simply execution time
are you doing the prep work on the host or the device?
it may be that you have simply moved execution time (from the primary kernel to prep), instead of gained/ reduced execution time, particularly when you have a 1 to 1 ratio between prep work and prep work consumption

you do not show how you process/ utilize the data; perhaps it is possible to fold-up/ reuse more data, and/ or push more data into prep, that would imply execution time gains

I have my own timing code implemented, which independently measures the time for the prep, copy to device, kernel, and copy to host, so I can see the various times in my logs.

There may certainly be opportunity to fold up the data into tiles, so that each thread works on a larger dataset, reducing the amount of data that is coming from global memory. That will require a major restructure to the implementation, and a much higher complexity to the algorithm, so I’m trying to avoid this at the moment, but perhaps I will need to do this eventually.