_constant_ memory not thread safe in CUDA 4.0

sightera · April 7, 2011, 12:21pm

Hi,
We have a kernel code that uses constant memory (as a filter). We’ve now started to use this code in CUDA 4.0 multi-thread. That is several CPU threads call the same kernel. Does this make a problem ? Assuming that each thread has its own different filter, will this cause a problem? Is a constant memory (defined globally in the .cu code) shared between host-threads?? If so - how can one avoid this problem? Also, will we have the same problem when dealing with other globally defined cuda-specific types (texture)?

Christopher_Cameron · April 11, 2011, 9:57am

In CUDA 4.0 multi-threading, textures, constant variables (and device variables) are per-device and shared across all host threads. The CUDA API’s multi-threaded behavior promises that the result of a multi-threaded sequence of API calls will be that of some serial interleaving of the calls, respecting any host-side synchronization done by the user (although independent calls may be executed concurrently). A copy to a device or constant symbol will be effected at a time determined by the stream specified to the copy call (this determination in turn depends on the host ordering of calls). An update to a texture binding will affect any subsequent kernels launched (in any thread and in any streams), but none that have already been launched (even if their execution has not begun).

Consider the example of a device symbol s which is used by kernel K. If you write program which does the following (all in the same stream)

Thread A                   | Thread B

   ---------------------------|---------------------------

   cudaSetDevice(0);          | cudaSetDevice(0);

   cudaMemcpyToSymbol(s,...); | cudaMemcpyToSymbol(s,...);

   K<<<.>>>();                | K<<<.>>>();

then there exists a race.

To get around this there are couple of options.

You can use host-thread synchronization combined with stream synchronization to enforce a particular ordering, e.g,

Thread A                   | Thread B

   ---------------------------|---------------------------

   cudaSetDevice(0);          | cudaSetDevice(0);

   cudaMemcpyToSymbol(s,...); | 

   K<<<.>>>();                | 

   pthread_barrier();         | pthread_barrier();

                              | cudaMemcpyToSymbol(s,...);

                              | K<<<.>>>();

Alternatively, if you indeed wish for multiple concurrent kernel executions to use separate copies of this data then you will be best served by creating separate copies of this data (just as one would do when writing a C-plus-threads host-only program).

If you have a project which depended on the CUDA 3.2 behavior of “one context per thread,” you can go back to this behavior by using runtime-driver interoperability and explicitly creating contexts for each thread (this is supported, but is not recommended for performance and complexity reasons).

sightera · April 12, 2011, 6:38am

In CUDA 4.0 multi-threading, textures, constant variables (and device variables) are per-device and shared across all host threads. The CUDA API’s multi-threaded behavior is that the result of a multi-threaded sequence of API calls will be that of some serial interleaving of the calls, respecting any host-side synchronization done by the user (although independent calls may be executed concurrently). A copy to a device or constant symbol will be effected at a time determined by the stream specified to the copy call (this determination in turn depends on the host ordering of calls). An update to a texture binding will affect any subsequent kernels launched (in any thread and in any streams), but none that have already been launched (even if their execution has not begun).

Consider the example of a device symbol s which is used by kernel K. If you write program which does the following (all in the same stream)
Thread A                   | Thread B

   ---------------------------|---------------------------

   cudaSetDevice(0);          | cudaSetDevice(0);

   cudaMemcpyToSymbol(s,...); | cudaMemcpyToSymbol(s,...);

   K<<<.>>>();                | K<<<.>>>();
then there exists a race.

To get around this there are couple of options.

You can use host-thread synchronization combined with stream synchronization to enforce a particular ordering, e.g,
Thread A                   | Thread B

   ---------------------------|---------------------------

   cudaSetDevice(0);          | cudaSetDevice(0);

   cudaMemcpyToSymbol(s,...); | 

   K<<<.>>>();                | 

   pthread_barrier();         | pthread_barrier();

                              | cudaMemcpyToSymbol(s,...);

                              | K<<<.>>>();
Alternatively, if you indeed wish for multiple concurrent kernel executions to use separate copies of this data then you will be best served by creating separate copies of this data (just as one would do when writing a C-plus-threads host-only program).

If you have a project which depended on the CUDA 3.2 behavior of “one context per thread,” you can go back to this behavior by using runtime-driver interoperability and explicitly creating contexts for each thread (this is supported, but is not recommended for performance and complexity reasons).

Thanks a lot for the detailed reply !

Topic		Replies	Views
Thread safety of reading and writing different area of constant memory in multiple concurrently executed kernels? CUDA Programming and Performance	10	1023	March 27, 2023
Help with memory management CUDA Programming and Performance	20	5765	March 27, 2010
CUDA 4.0 Context Sharing by Threads Impact on existing Multi-threaded Apps CUDA Programming and Performance	8	22911	March 9, 2011
Adding CUDA streams to threaded software CUDA Programming and Performance	7	8579	August 2, 2011
CUDA Memory Consistency CUDA Programming and Performance	23	55542	March 8, 2007
Am I using the constant memory properly? CUDA Programming and Performance	4	1779	May 15, 2013
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9593	January 1, 2009
Texture memory coherency with respect to concurrent kernel execution CUDA Programming and Performance	4	3821	June 9, 2011
CUDA thread in background? CUDA Programming and Performance	10	16000	February 19, 2010
Many threads updating a single flag in global memory CUDA Programming and Performance	13	6522	May 9, 2011

_constant_ memory not thread safe in CUDA 4.0

Related topics