Writing results into global array for only some threads

Hello everybody!

I am wondering about returning results from the kernel when not every thread generates a result. Currently, I return data using an array that is stored in global memory space and I use a global index variable to get the position of the next entry. I use atomic add to increase this index.

The (very simplified) code looks like this:

__global__ void my_kernel(int* index, float* results)

{

	// Get the thread and block id.

	const int blockId = blockIdx.x;

	const int threadId = threadIdx.x;

	

	// Solve the sub problem corresponding to the thread and block id.

	const int score = solveSubproblem(blockId, threadId);

	

	// If the score is high enough, write it into the results array.

	if (score > 0)

	{

		const int resultIndex = atomicAdd(index, 1);

		results[resultIndex] = some_value;

	}

}

Well, the good thing first: I already got a very nice speedup using Cuda. But judging from the runtime results, the kernel is significantly faster when not many results are obtained. I already ruled out memory reads and computation inside the kernel; both should be constant no matter the result of solveSubproblem(…). I believe using atomicAdd could be the reason for this, especially when multiple threads running in parallel have to use atomicAdd.

The straight-forward solution would be to use an array with a size equal to the number of threads, but this is not possible due to memory constraints. Another solution mentioned in this forum was using stream compaction and scan primitives. However, I do not really see how it could solve the problem.

Is there a smarter way than the one I am using to return results in an array when not every thread generates any result?

Any input on this is highly appreciated! Thanks in advance. :)

If the problem is contention on the atomic operator, you could change your results array to be two dimensional. One index would be the block index, the other would be the self increasing index you have above. You

just have to convert a 2d indexed array to a single array:

results[blockid][resultindex]=some_value

to results[blockid*blockdim.x+resultindex]=some_value

Verify the indexing scheme I used above. I get blockdim and griddim mixed-up.

Perhaps if each thread calculated multiple values, you could queue up the values locally and then instead of atomicAdd(index, 1), you could atomicAdd(index, nresults). If the performance drop is due to contention on atomics, it should help.

ctierney42 and Jamie K, thank you for your replies!

Introducing a two-dimensional array and splitting the index into multiple ones should result in less conflicts and seems to be worth trying. Computing multiple results in one thread and writing the whole bunch of data into the results array also sounds promising because it would directly reduce the number of index accesses. I will let you know how it turned out. :)

Both solutions are aimed at reducing the bottleneck introduced by the atomic add. Is there completely different way to solve my problem that would circumvent the bottleneck?

What kind of memory constraints are we talking about? I assume because your grids are huge, things won’t fit. You can easily circumvent this by running smaller grids, writing out every result and compact, move your result on the CPU which should be much faster at doing this. Now you can just run your kernel multiple times, offsetting data, for example to calculate results for other parts of your problem.

Ojiisan, thanks for your reply. I think splitting the grid into multiple smaller ones would not solve the problem because the bottleneck using atomic add remains. Nevertheless, thanks for your input. :)

A small update: Following Jamie K’s advice, I now compute multiple results in one thread, buffer them and write them back in one single step at the end. This means that I have to use the atomic add only once for a whole bunch of results instead of once per single result. I experienced a significant speedup and am quite happy right now. :)

However, it still is a workaround and I would be even happier if there is a solution that does not rely on atomic add. (Even though I get the feeling that this is not entirely possible because writing the results back into one single results array has to be pipelined somehow …)