Problem in getting __device__ dynamic array data back to host

Dear developers, I would like to ask you some help about using GPU device memory. I need to dynamically allocate an array on the device which is updated by different calls to a global function within the application run. I am not able to get back results on the host, so probably I am making some mistakes (and probably I did not understand something; I am a chemist, not a programmer). May you help me, please? Here follows the minimal sand-box code that is giving me problems.

#include #include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <cuda_runtime_api.h>
#include <cublas.h>

__device__ double *test;

__global__ void test_init()
{
   test[0] = sqrt(2.0);
   test[1] = sqrt(3.0);
}

int main( int argc, char** argv)
{
   double *tt = (double *)malloc(2 * sizeof(double));
   cudaMalloc((void**)&test, 2 * sizeof(double));
   test_init();
   cudaMemcpyFromSymbol((void**)&tt, "test", 2 * sizeof(double), 0, cudaMemcpyDeviceToHost);
   printf("

%lf, %lf

",tt[0],tt[1]);
   return 0;
}

This code (run on Tesla M2050, compute capability 2.0, nvcc v4.2) gives as output:

0.000000, 0.000000

while it should print the square roots of 2.0 and 3.0. What am I doing wrong? Any help will be greatly appreciated.

Best regards,

Mirco

Try:

...
cudaMemcpyFromSymbol((void**)&tt, "test", sizeof(tt), 0, cudaMemcpyDeviceToHost);
...

in line 21.

MK

Thank you for your prompt reply. I tried as you suggested, but it is still giving me the same output, i.e.

0.000000, 0.000000

It seems like there is no transfer of data. I also tried to init tt elements to 1.0 and, as result, the program outputs

1.000000, 1.000000

So tt is not changed at all from the cudaMemcpyFromSymbol call. If I do the same with a scalar double it works, but unfortunately I need an array, possibly dynamically allocated.

Any other clues / suggestions?

Thank you.

Mirco

cudaMalloc((void**)&test, 2 * sizeof(double));
this is wrong probably

need to get pointer and to copy it to test, btw, better just use kernel parameters.

double* t;
cudamalloc(&t,1000);
testinit(t);

How about dropping the “” from the name?

I tried to drop the quotes, trying out also the combination of your suggestion with the one of cmaster.matso, but I keep obtaining the same behavior,

I am sorry, I think I am not getting your suggestion. If I understand correctly you are suggesting to remove the declaration “device *test;” and declare the variable inside the main. If I do as you suggest, things work. However this is not solving my issue. I shall post a more complete code so that you can better understand what I need to do (and probably this could help you in finding what I am doing wrong - even from a conceptual point of view):

#include #include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <cuda_runtime_api.h>
#include <cublas.h>

__device__ double *test;

__global__ void test_init()
{
    test[0] = sqrt(2.0);
    test[1] = sqrt(3.0);
}

__global__ void test_update()
{
    test[0] += 1.0;
    test[1] += 2.0;
}

int main( int argc, char** argv)
{
    double *tt = (double *)malloc(2*sizeof(double));

   // Init the array on device
   cudaMalloc((void**)&test, 2*sizeof(double));
   test_init<<<1,1>>>();

   // First update with respect to initial values
   test_update<<<1,1>>>();

   // Some CPU calcualtions
   [...]

   // New update of device variable with respect
   // to last update
   test_update<<<1,1>>>();
 
   // Other calculations
   [...]

   // Gather the last state of the variable
   cudaMemcpyFromSymbol((void**)&tt, test, 2*sizeof(double), 0, cudaMemcpyDeviceToHost);
   printf("%lf, %lf",tt[0],tt[1]);
   return 0;
}

I hope this is clear.

Mirco

Didn’t You forget to synchronize after kernel call?

No problem, just call
test_init(t);
test_update(t);

memory is save between kernel calls.

But one need to synchronize before accessing it from host.

No, calls are synchronized of course, memcpy kernel kernel memcpy

So, if I understand, I should avoid the extern “device *test;” declaration and work directly from the main() using a pointer such as the one that you call “t”. I am wondering if this still works if different threads in different blocks can access to “t”. I shall explain better myself.

I have to parallelize this code (x is 2D array of size M*N, xt is a 1D array of size N):

[...]
for (i = 0; i < N; ++i)
{
   for (j = 0; j < M; ++j)
   {
       xt[i] += x[j*N+i];
   }
}
[...]

this is like left-multiplying x by a vector of size 1*M filled with ones. This vector matrix multiplication is found in examples and I have no problems in its implementation.

However, my data is stored in a file in a number of blocks (nb) such that M*nb may be too large to be loaded in memory. Thus, my idea was to read blocks of data of size M and call a kernel parallelizing
the operation shown in the latter piece of code. Thus at every call of the kernel the xt vector
elements must be updated with respect to their previous values. I would like to avoid moving the partial xt array back and forth the GPU. This is why I was thinking about using an array stored in the
device global memory. Is this conceptually wrong?

Mirco

device double *test; stores some pointer. Where this pointer is changed? Host changes it. So host could just pass this pointer to kernels after memory is allocated and data is copied there.

Dear all, I solved my specific problem (I did a mistake in planning the implementations). However, it still remains open the question about dynamically allocating an array in device global memory and getting it to host. I think this would be interesting to be solved even for other people who would might be interested in using it. What do you think?

Thank you for the clarification. I will do some tests tomorrow.