Question related __threadfence

Hi all,

If my question is off topic or already asked i’m sorry. I’ve read article about it but still it’s not clear in my mind. Thanks for your help in advance.

//counter = 0 
__global__ kernel(volatile int* buffer, int *counter)
{
	int tid =  threadIdx.x + blockIdx.x * blockDim.x;
	if(/* condition */)
	{
		buffer[atomicAdd(counter,1)] = tid; 
		__threadfence();
	}
        else
        {
             /* Some computation */
        }

	/* [HERE], Can I be sure that can every threads in every blocks read same value from buffer array ? */
}

I’ve this sort of kernel which has massive hierarchy (<<<512,1024>>>. Some of it’s threads write data into buffer array in order. I’m keeping my counter into global memory and increasing it using atomics.
My question is that at the last line, Can I be ensure that can every threads in every blocks read same value from buffer array ? Because does threadfence guarantees that?

i would think that threadfence implies a memory barrier, that in turn helps to preserve memory transaction ordering
i would also think that the above expects or implies across-block synchronization
i do not see how threadfence can be used to ensure across-block synchronization

ask njuffa to code an across-block memory barrier/ synchronization primitive for you

@little_jimmy thanks for answer. I was thinking more or less same with you. But In this case, it didn’t help at all :) I also tried to move __threadfence method call from line-8 to line-14. nothing changed. Am I confused semantic of threadfence ?

__threadfence() by itself won’t help here. It doesn’t matter where you put it.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions

If you want to guarantee that a thread sees a value written by another thread in another threadblock, you will need some kind of grid-wide synchronization (execution) barrier.

the simplest canonical grid-wide execution barrier is the kernel launch

other grid-wide constructed execution barriers generally require considerable care in correct use to avoid deadlock.

thank you very much @txtbob.

__threadfence doesn’t mean that it orders memory writes grid-wide. So, in which case we can use it for example ?

I also tried to use __gpu_sync() from this paper http://eprints.cs.vt.edu/archive/00001087/01/TR_GPU_synchronization.pdf. But my gpu is hanged :/

__threadfence() guarantees ordering of global memory writes. This means that given this:

(assume global_data was initialized to all zero prior to invocation of kernel)

global_data[idx+1000] = 1;
__threadfence();
global_data[idx+2000] = 2;

no thread, anywhere in the grid, will have the possibility of doing

int2 = global_data[idx+2000];
int1 = global_data[idx+1000];

and observe that int2 = 2 whereas int1 = 0 (subject to the caching restrictions described in the doc link I provided).

A possible use case is given in the threadfence reduction cuda sample code.

http://docs.nvidia.com/cuda/cuda-samples/index.html#threadfencereduction

There it serves 2 purposes:

  1. guarantee that a given threadblock’s reduction result is not consumed until it is ready (via ordering of writes)
  2. enable a special kind of global synchronization that I call “threadblock-draining sync”

The GPU memory model is a bit tricky and txbob has the main idea correct here, but it is important to remember that none of the thread fences affect the GPU caches. If you really want to preserve memory ordering across threads on different SMs, make sure to use memory instructions that bypass (or write through as needed) the L1 caches.

That is why I made the comment in my posting:

" (subject to the caching restrictions described in the doc link I provided)."

to wit:

“Note that for this ordering guarantee to be true, the observing threads must truly observe global memory and not cached versions of it; this is ensured by using the volatile keyword…”

You may want to read section 5 of that VT paper you linked. In-kernel grid-wide synchronization schemes frequently depend on all threads being schedulable (the threadblock-draining approach being a notable exception). Without going into a careful description of what this means or would require, a fairly simple way to satisfy it would be to launch at most 1 threadblock per SM, which is exactly what is specified as a requirement in section 5 of that paper. In particular:

“Our solution to this problem is to have an one-to-one mapping between thread blocks and SMs. In other words, for a GPU with ‘Y’ SMs, we ensure that at most ‘Y’ blocks are used in the kernel.”

If you are attempting such a scheme with a kernel launch such as you have described:

that cannot possibly be reliable, as you have violated the requirement. There are no GPUs currently that have 512 SMs.

you can be sure that they don’t. maxwell SM can run up to 2048 threads simultaneously, titanx has 24 SMs. so some of your threads need to finish in order to allow other threads to start execution. you need to split it into two kernels

Hi,

@txbob @Gregory Diamos thank you for detailed explanation. Now i’m quite clear in terms of threadfence :)

And I miss gpu_sync paper. Yes you’re right, their approach doesn’t fit in my case.

@BulatZiganshi what do you mean exactly ? As I know, maxwell SMM is different kepler SMX. SMX has 192 cores, SMM has 128 cores.

they also can run multiple threads per core, with a tehcnology like HyperThreading on CPUs. So, each SMX/SMM can run up to 2048 threads simultaneously, and entire GPU runs <50k threads. since your job has 512k threads, they can’t run all simultaneously.

from the manual:

At its core are three key abstractions - a hierarchy of thread groups, shared memories, and barrier synchronization - that are simply exposed to the programmer as a minimal set of language extensions.

These abstractions provide fine-grained data parallelism and thread parallelism, nested within coarse-grained data parallelism and task parallelism. They guide the programmer to partition the problem into coarse sub-problems that can be solved independently in parallel by blocks of threads, and each sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block.

This decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time enables automatic scalability. Indeed, each block of threads can be scheduled on any of the available multiprocessors within a GPU, in any order, concurrently or sequentially, so that a compiled CUDA program can execute on any number of multiprocessors as illustrated by Figure 5, and only the runtime system needs to know the physical multiprocessor count.

This scalable programming model allows the GPU architecture to span a wide market range by simply scaling the number of multiprocessors and memory partitions

Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores as illustrated by Figure 1 4, enabling programmers to write code that scales with the number of cores.

Threads within a block can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses.

it’s because it needs that some of the first 50K threads be finished before it can start new ones