Writes to global memory are not visible

I have a simple kernel defined as follows:

global void
ker_memset(float* x, uint size, float value)
uint idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size)
x[idx] = value;

In my init() function that sets up the parameters, I call the kernel and once init() is finished it returns to main where I print the array x by copying it from device to host memory and writing to stdout, using print_dvector() function.

void print_dvector(float* da, uint n)
cublasGetVector(n, sizeof(float), da, 1, htemp_nn, 1);
print_vector(htemp_nn, n);

Now a weird thing happens: When I call the print_dvector() function from init() after the kernel call the print_dvector() call in main prints the correct (updated) value of the array x stored in the device global memory.
But if I omit the print_dvector() call from init() then the output of print_dvector() from main does not seem to reflect the changes that should’ve been effected by the kernel ker_memset.

I’ve tried inserting cudaThreadSynchronize() calls in different places but it doesn’t seem to help at all!

Could someone enlighten me as to what exactly is happening here? It seems I’ve not grasped the basics of kernel invocation. How do I block the host code till the all device kernel threads are completed?


Memory copies (like the one I assume cublasGetVector() is using) are supposed to wait until the previous kernel is complete before running. Are you sure that your ker_memset() kernel actually ran? Do you have any error checking to verify that it didn’t immediately abort?

Yes. I have error checking in place that seems to indicate that the kernel ran just fine. However the print routine still prints the values from global memory before the kernel call even executes.

Here’s what I think is happening:

  1. At kernel execution, the device is now scheduled to run the kernel

  2. The host however, immediately continues onto the next statement (print_dvector) which fires a cudaMemcpy call to the device

  3. For some reason, the Memcpy happens before the kernel and I get the values of the vector prior to the update (kernel execution)

I have tried inserting cudaThreadSynchronize() in almost all the usual places (after the kernel call) but it never seems to help.

I have pasted my code here: http://pastebin.com/zu3KbyAu

I compile and run like this:

$ nvcc test.cu -o test

$ ./test

And I get the output with all 512 elements initialized to zero instead of the value 2.0 as I’d wanted.

Can’t get myself to believe that I’m stuck on something this trivial!! Please help!!

The cudaMemcpy direction and pointer order in your print_dvector and print_dmatrix are reversed. You want this (in the dvector case):

cudaMemcpy(htemp_nn, da, n * sizeof(float), cudaMemcpyDeviceToHost);

Destination pointer first, and you are copying from device to host. The original version copied the contents of htemp_nn from the host to the device, also overwriting the device array you had just filled.

Although this code doesn’t run the 2D case, I noticed that your 2D block size is 32x32. The maximum number of threads on pre-Fermi cards is 512, so I just wanted to warn you in case you are using something other than the GTX 400 series or the Tesla C2050.

Thanks a lot seibert!

I have a Quadro 5800. But I still don’t get what you’re trying to say regarding the kernel configuration in the 2D case. Could you “dumb” it down a little for a newbie? :)

There are limits on blockDim. For your device, the limits are:

  • x and y dimensions must be less than or equal to 512

  • z dimension must be less than or equal to 64

  • The product of the dimensions (x * y * z) must be less than or equal to 512.

Your 2D block configuration in this code (which you don’t use but I wanted to warn anyway) is x = 32, y = 32. This passes the first two rules, but 32*32 = 1024, so it breaks the 3rd rule. You have to use a smaller 2D block.