Coalescing question possibly a write commit question

I am getting some strange behavior when trying to read global memory. My application performs operations on elements in a large array, which is initialized to zero. The device code sets the first element of the data array to a value, then all subsequent data elements are copied from that element within the data elements array. I don’t expect all the values to be set, there are multiple cores running simultaneously so until the first block of threads complete the initial element will still be zero. The problem is that it’s accuracy is not that great, with a different number of errors each run.

[codebox]#include <stdio.h>

#include <cuda_runtime.h>

#define mesh 512

const dim3 dimBlock(512);

const dim3 dimGrid(mesh/dimBlock.x, mesh/dimBlock.y, 1);

const int Num_Elements = mesh*mesh;

static long *data_elements_h, *data_elements_d = NULL;

// Device Function

global void Process_Elements(long* data_elements_d)


int ThreadID = (blockIdx.y*gridDim.x+blockIdx.x)*blockDim.y*blockDim.x+thre


if (ThreadID==0) data_elements_d[ThreadID] = 0xF;

else data_elements_d[ThreadID] = data_elements_d[0]; // Suspect read/write



// Host Function


bool InitCudaDevice(void) { return true; }


bool InitCudaDevice(void) { cudaSetDevice(0); return true; }


int main(int argc, char* argv)


long input_h = 0xF;

int elements_error = 0;

if (!InitCudaDevice()) return 0;

size_t Data_size = Num_Elements*sizeof(long);

data_elements_h = (long *)malloc(Data_size);

data_elements_h[0] = input_h; // Set first element to the input value

for (int ThreadID=1; ThreadID<Num_Elements; ThreadID++) data_elements_h[ThreadID] = 0; // Init other elements to zero

cudaMalloc((void **) &data_elements_d, Data_size);

cudaMemcpy(data_elements_d, data_elements_h, Data_size, cudaMemcpyHostToDevice);

Process_Elements<<<dimGrid, dimBlock, 0, 0>>>(data_elements_d);

cudaMemcpy(data_elements_h, data_elements_d, Num_Elements*sizeof(long), cudaMemcpyDeviceToHost);

for (int k=0; k<Num_Elements; k++) {

	if (data_elements_h[k] != input_h) {


		//printf("data_elements_h(%d): %llX\n", k, data_elements_h[k]);



if (elements_error) printf("%d OK, %d Errors out of %d\n", Num_Elements-elements_error, elements_error, Num_Elements); else printf("No Element Errors\n");




return 0;


If I initialize the data with the first element already set then I get no errors and every data element gets the value 0xF after one iteration. However I need the code to run on the device without having to copy all the memory back and forth.

How do I determine how what span to leave before reading a value written by a previous thread? I’d like to be able to reference data based off the ThreadID like so:

data_elements_d[ThreadID] = data_elements_d[ThreadID - padding];

Thanks for any pointers,


Check out __threadfence() in the 2.2 beta.

Thanks! That was easier than expected. I’ve just registered for the Beta.

Just to check my understanding. The following should allow me to access data written by a previous block:

[codebox]global void Process_Elements(long* data_elements_d){

int ThreadID = (blockIdx.y*gridDim.x+blockIdx.x)*blockDim.y*blockDim.x+thre


if (ThreadID<512)  data_elements_d[ThreadID] = 0xF;

else data_elements_d[ThreadID] = data_elements_d[ThreadID-512]; // Suspect read/write




But this WOULDN’T allow me to access data by the previous thread (but would be much slower anyway):

[codebox]global void Process_Elements(long* data_elements_d){

int ThreadID = (blockIdx.y*gridDim.x+blockIdx.x)*blockDim.y*blockDim.x+thre


if (ThreadID==0) data_elements_d[ThreadID] = 0xF;

else data_elements_d[ThreadID] = data_elements_d[ThreadID-1]; // Suspect read/write




Thanks again!

Just to clarify, my above post was a question based on reading the (much improved and way better) Version 2.2 Beta Programming guide. I don’t have the 2.2 Beta myself, so don’t interpret the above code as correct.

If anyone does, I’d be most grateful for any confirmation.


How do you define previous? In CUDA, threads are run in a completely non-deterministic order.

With a large data set, larger than the number of threads times the number of MP (512*30) the execution is divided into blocks. The “first” block is for threadid 0 up to blockIdx.x * blockDim.x + threadIdx.x with the blockId and blockDim starting at zero and incrementing untill the entire data set is calculated.

My question is wether data calculated and updated in block 0 is available to “later” blocks, block 1000 for example, and if so how many blocks should I leave before assuming the data has been updated.

You cannot make any assumptions on the order in which blocks are executed. The hardware is free to run block 1000 first if it chooses to do so.

in fact, i currently have a kernel that uses atomicInc to update the index where I write in my output array (I write from very few threads, compared to the number of threads running).

When looking at the output array, it seems that ‘higher number’ threads are running before ‘lower number’ threads (threadIdx.x + blockIdx.x*blockDim.x).

Understood, thanks for the clarification. Will this change in 2.2 with __threadfence();?