Global memory write cost

Hi all,

A collegue and myself have implemented a naive cuda kernel that is designed to calculate the covarience of a sparse matrix on the GPU.

We aim to do this mostly as an effort to do some statistical inference, the company I work for operates in the web domain, and part of its operation is datamining large quantities of keywords, tokens and n-grams for relevance against each other. In this system we aim to find relations between tokens.

During tuning we were suprised to find that the kernel is very slow, but seemingly only at writing the result back to global memory

The following (also available here… is the implementation of the kernel, this is wrapped in a program

__device__ float get_intersections(int* intr, int t1, int t2, int wI) {

	int n = 0;

	for (int i = 0; i < wI; i++) {

		int x1 = (t1 * wI) + i;

		if (intr[x1] == -1)


		for (int j = 0; j < wI; j++) {

			int x2 = (t2 * wI) + j;

			if (intr[x2] == -1)


			if (intr[x1] == intr[x2])




	return (float) n;


__global__ void calc(float* result, int* tokens, int* intr, int wT, int wK, int wI) {

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	int j = blockIdx.y * blockDim.y + threadIdx.y;

	float t1 = (float) tokens[i];

	float t2 = (float) tokens[j];

	float v = 0;

	if (i >= j) {

		float t00 = -t1 / wK;

		float t01 = 1 - t1 / wK;

		if (i == j) {

			// calculate diagonal

			v = ((t01 * t01 * t1) + (t00 * t00 * (wK - t1))) / wK;

		} else {

			float nn = get_intersections(intr, i, j, wI);

			float t10 = -t2 / wK;

			float t11 = 1 - t2 / wK;

			v = ((nn * t01 * t11) + ((t1 - nn) * t01 * t10) + ((t2 - nn) * t00 * t11) + ((wK - (t2 + t1 - nn)) * t00 * t10)) / wK;


		result[i + (j * wT)] = v;

		result[j + (i * wT)] = v;



If we run the kernel as is, we find that it is slower than a CPU implementation (taking with our example datasets about 5 seconds)

If however the following lines are commented out, then the kernel is extremely fast, offering a very good speedup over the CPU

result[i + (j * wT)] = v;

		result[j + (i * wT)] = v;

For background reading I am guessing that the writes to global memory are not aligned or coalesced, but I cannot see how we would be able to coalesce these, this is more my lack of understanding and immaturity about CUDA more than anything else.

Does anyone have any ideas on what we are doing wrong, and how we can improve this kernel ?.

I can naturally provide more details as needed

Many thanks in advance

If you comment out the write back to memory, the compiler optimizes away the whole kernel.

Your kernel is very slow due to the repeated, uncoalesced memory accesses. Note that pre-Fermi GPUs do not cache global memory accesses. You need to either move the relevant data into shared memory and operate there, or use textures, which are cached.

Or try your code on a GTX470/GTX480.

Thanks for the speedy reply, you have saved us from attempting to optimise the wrong thing !, from further experimentation it turns out the the intr array calls in get_intersections are the expensive part, most likely down to, as you suggested uncoallesced memory calls.

Could you refer me to a document describing the optimizations of nvcc? I have a similar problem and I would like to gain some more insight.

nvopencc is based on the open64 compiler. Unfortunately, there is little documentation available I’m aware of, and reading the source code yourself might not be what you are after. :-) Maybe you can find something with google though.