_constant_ memory not thread safe in CUDA 4.0

We have a kernel code that uses constant memory (as a filter). We’ve now started to use this code in CUDA 4.0 multi-thread. That is several CPU threads call the same kernel. Does this make a problem ? Assuming that each thread has its own different filter, will this cause a problem? Is a constant memory (defined globally in the .cu code) shared between host-threads?? If so - how can one avoid this problem? Also, will we have the same problem when dealing with other globally defined cuda-specific types (texture)?

In CUDA 4.0 multi-threading, textures, constant variables (and device variables) are per-device and shared across all host threads. The CUDA API’s multi-threaded behavior promises that the result of a multi-threaded sequence of API calls will be that of some serial interleaving of the calls, respecting any host-side synchronization done by the user (although independent calls may be executed concurrently). A copy to a device or constant symbol will be effected at a time determined by the stream specified to the copy call (this determination in turn depends on the host ordering of calls). An update to a texture binding will affect any subsequent kernels launched (in any thread and in any streams), but none that have already been launched (even if their execution has not begun).

Consider the example of a device symbol s which is used by kernel K. If you write program which does the following (all in the same stream)

Thread A                   | Thread B


   cudaSetDevice(0);          | cudaSetDevice(0);

   cudaMemcpyToSymbol(s,...); | cudaMemcpyToSymbol(s,...);

   K<<<.>>>();                | K<<<.>>>();

then there exists a race.

To get around this there are couple of options.

You can use host-thread synchronization combined with stream synchronization to enforce a particular ordering, e.g,

Thread A                   | Thread B


   cudaSetDevice(0);          | cudaSetDevice(0);

   cudaMemcpyToSymbol(s,...); | 

   K<<<.>>>();                | 

   pthread_barrier();         | pthread_barrier();

                              | cudaMemcpyToSymbol(s,...);

                              | K<<<.>>>();

Alternatively, if you indeed wish for multiple concurrent kernel executions to use separate copies of this data then you will be best served by creating separate copies of this data (just as one would do when writing a C-plus-threads host-only program).

If you have a project which depended on the CUDA 3.2 behavior of “one context per thread,” you can go back to this behavior by using runtime-driver interoperability and explicitly creating contexts for each thread (this is supported, but is not recommended for performance and complexity reasons).

Thanks a lot for the detailed reply !