kernel_two_steps_in_one(int* data, int size){
int i = blockIdx.x * blockDim.x + threadIdx.x;
if( i<numEdges ){
//some threads write to data, some not
}
__threadfence(); //i thought this would have worked.
__syncthreads();
if( i < numEdges ){
//however, on this part all threads must access their corresponding index of data[i], i get an error here, i think some threads get to this point before the writing has completed.
}
}
}
i solved this problem by separating it in 2 kernels, and use cudaThreadSynchronize() between each call.
but i thought i could have a one kernel call , and __threadfence would solve my problem, but it didnt because it crashes, since i think some threads still access data at some places were data hasnt been put yet by the other threads.
kernel_two_steps_in_one(int* data, int size){
int i = blockIdx.x * blockDim.x + threadIdx.x;
if( i<numEdges ){
//some threads write to data, some not
}
__threadfence(); //i thought this would have worked.
__syncthreads();
if( i < numEdges ){
//however, on this part all threads must access their corresponding index of data[i], i get an error here, i think some threads get to this point before the writing has completed.
}
}
}
i solved this problem by separating it in 2 kernels, and use cudaThreadSynchronize() between each call.
but i thought i could have a one kernel call , and __threadfence would solve my problem, but it didnt because it crashes, since i think some threads still access data at some places were data hasnt been put yet by the other threads.
edit: i’m still confused with __threadfence() and its purpose. I thought that if i put __threadfence(), then at that point all threads from the device would wait for writes to be performed on the global memory (on data), so from that point and on threads would access correct data. like a syncronization based on memory state.
i am mistaken i guess, some explanation besides the Cuda reference manual would be welcome.
edit: i’m still confused with __threadfence() and its purpose. I thought that if i put __threadfence(), then at that point all threads from the device would wait for writes to be performed on the global memory (on data), so from that point and on threads would access correct data. like a syncronization based on memory state.
i am mistaken i guess, some explanation besides the Cuda reference manual would be welcome.
__threadfence() does not block other threads. It forces each thread to wait until its own memory writes have been completed and pushed back up the memory hierarchy to the appropriate level (depending on whether you use __threadfence(), __threadfence_block(), etc…). This is much more subtle guarantee than a true synchronization point, because threads are allowed to progress past the fence instruction at any time.
I’m definitely no expert on when __threadfence() is actually useful, as these tend to be non-trivial parallel algorithms where explicit synchronization is not needed, but some kind of memory coherence guarantee is. I do know that if you ever think __threadfence() is the answer to a problem, stop and reevaluate for a moment, because it usually is not. :)
__threadfence() does not block other threads. It forces each thread to wait until its own memory writes have been completed and pushed back up the memory hierarchy to the appropriate level (depending on whether you use __threadfence(), __threadfence_block(), etc…). This is much more subtle guarantee than a true synchronization point, because threads are allowed to progress past the fence instruction at any time.
I’m definitely no expert on when __threadfence() is actually useful, as these tend to be non-trivial parallel algorithms where explicit synchronization is not needed, but some kind of memory coherence guarantee is. I do know that if you ever think __threadfence() is the answer to a problem, stop and reevaluate for a moment, because it usually is not. :)