Non-zero matrix in kernel becomes zero matrix outside of kernel

I am populating a device matrix in a kernel and in emulation mode the last thread of the last block prints to file the contents of this matrix and there are a lot of non-zero elements which I expect, but once outside the kernel before any more processing is done in host code I print out the contents of that device matrix and every element is zero. Before I print out the contents in host code I make a call to cudaGetLastError(). So the sequence is

kernel<<<>>> //produces non-zero elements
print out matrix elements //now all zero elements

The matrix is defined as
device int Matrix[NUMBLOCKS][NUMBLOCKS];

Does anyone have any suggestions?

At least with device variables on the GPU, the only way to access them is with cudaMemcpy(To/From)Symbol or using cudaGetSymbolAddress and then cudaMemcpy. I don’t know whether the same is true in emulation.

What I am trying to do is is to populate an array in shared memory shMatrix and then coalesce those elements to the device Matrix. This appears to work fine in the kernel because as I said once all the processing done in the kernel the very last thread of the very last block prints out the contents of the device Matrix and the elements are as I would expect. But once outside the kernel the elements of the device Matrix are zero.

During execution of the kernel is there a second copy of the device Matrix which the kernel uses and I am supposed to copy the kernel version into the host version? I have never had to do that before and I have been using this method of coalesced writes to populate the device Matrix for a month now.

I was assuming that your reading of the device matrix “just after the kernel completes” was performed in host code. And the most common mistake with device variables is to read them on the host with cudaMemcpy(host_ptr, &deviceVar, …) which cannot be done since the host cannot take the address of a variable that exists only on the device (the compiler silently allows this, however…).

There shouldn’t be any reason for your device memory to be zeroed after a kernel launch. Without a minimal and compilable code example that demonstrates this behavior, I cannot even guess at the reason.