Race condition? How to define thread-specific kind of variables?

weiyang.lin · December 13, 2018, 10:31pm

I am trying to re-write some shared-memory CPU code with CUDA, and I got some problems in finding a thread-specific substitute. In other words, I want each CUDA thread to work with some local variables, without interfering with each other.

A toy code I wrote is pasted below, and I got “Bad” results because of race conditions. I do not need the atomic-add, but that works fine. So the question is, how to make sure no other thread wants the same index of the var array?

__global__ void thread_local_var_test_kernal1(int** worker_array, int* var, int size)
{
	const int
		blockId = blockIdx.y * gridDim.x + blockIdx.x,
		idx = blockId * blockDim.x + threadIdx.x;
	if (idx < size)
	{
		int* worker = worker_array[threadIdx.x];		// <------- I want worker array to work without race-condition

		var[threadIdx.x]++;		// race condition
		//atomicAdd(&var[threadIdx.x], 1);	// OK
	}
}

static void thread_local_var_test_kernal_test1(int size)
{
	int threadsPerBlock = block_dim;
	int blocksPerGrid = (size + threadsPerBlock - 1) / threadsPerBlock;

	int* dev_var = 0;
	cudaMalloc(&dev_var, threadsPerBlock * sizeof(int));
	cudaMemset(dev_var, 0, threadsPerBlock * sizeof(int));

	int** dev_worker_array = 0;
	cudaMalloc(&dev_worker_array, threadsPerBlock * sizeof(int*));

	thread_local_var_test_kernal1 <<<blocksPerGrid, threadsPerBlock>>>(dev_worker_array, dev_var, size);
	cudaDeviceSynchronize();

	printf("================================ CUDA kernal is completed ================================\n");
	int* host_var = (int*)malloc(threadsPerBlock * sizeof(int));
	cudaMemcpy(host_var, dev_var, threadsPerBlock * sizeof(int), cudaMemcpyDeviceToHost);
	cudaFree(dev_var);
	int sum_var = 0;
	for (int i = 0; i < threadsPerBlock; i++)
	{
		printf("%d, ", host_var[i]);
		sum_var += host_var[i];
	}
	printf("CUDA sum = %d (%d is expected).\n", sum_var, size);
	(sum_var == size) ? printf("Good.\n") : printf("Bad.\n");
	free(host_var);
}

Robert_Crovella · December 13, 2018, 10:47pm

atomics should work.

If you have an orderly pattern (such as your toy code) you can use a classical parallel reduction:

[url]https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf[/url]

weiyang.lin · December 13, 2018, 10:59pm

Thanks Robert. However, I do not need the sum_var, but I use it just to show “Bad” results. I just updated the code with comments, and what I really need is the worker array to work appropriately without being interfered by other threads. For example, I can pre-allocate the worker arrays in host or another kernal, and use them in the core kernal later.

weiyang.lin · December 13, 2018, 11:16pm

The way it should work is that, once an array is allocated for each thread, then each thread should be able to re-use the memory for multiple times without the needs of allocate & free.

Topic		Replies	Views
Race Condition CUDA? CUDA Programming and Performance	4	2872	November 15, 2014
Race condition? CUDA Programming and Performance	6	8338	December 5, 2009
Race condition at a simple global array access CUDA Programming and Performance	1	982	November 21, 2013
writing to the same global variable by different threads CUDA Programming and Performance	4	4572	December 9, 2009
Local variables in kernel CUDA Programming and Performance	3	3575	September 5, 2009
Simple Question threads adding on the same variable CUDA Programming and Performance	4	2918	October 23, 2009
how to avoid race condition? CUDA Programming and Performance	7	5651	October 23, 2009
Race condition? CUDA Programming and Performance	0	3329	October 14, 2009
Variable global CUDA Programming and Performance	17	5054	January 21, 2012
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8383	April 15, 2011

Race condition? How to define thread-specific kind of variables?

Related topics