Problem in getting device dynamic array data back to host

mirco_zerbetto · January 18, 2013, 10:19am

Dear developers, I would like to ask you some help about using GPU device memory. I need to dynamically allocate an array on the device which is updated by different calls to a global function within the application run. I am not able to get back results on the host, so probably I am making some mistakes (and probably I did not understand something; I am a chemist, not a programmer). May you help me, please? Here follows the minimal sand-box code that is giving me problems.

#include #include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <cuda_runtime_api.h>
#include <cublas.h>

__device__ double *test;

__global__ void test_init()
{
   test[0] = sqrt(2.0);
   test[1] = sqrt(3.0);
}

int main( int argc, char** argv)
{
   double *tt = (double *)malloc(2 * sizeof(double));
   cudaMalloc((void**)&test, 2 * sizeof(double));
   test_init();
   cudaMemcpyFromSymbol((void**)&tt, "test", 2 * sizeof(double), 0, cudaMemcpyDeviceToHost);
   printf("

%lf, %lf

",tt[0],tt[1]);
   return 0;
}

This code (run on Tesla M2050, compute capability 2.0, nvcc v4.2) gives as output:

0.000000, 0.000000

while it should print the square roots of 2.0 and 3.0. What am I doing wrong? Any help will be greatly appreciated.

Best regards,

Mirco

cmaster.matso · January 18, 2013, 10:51am

Try:

...
cudaMemcpyFromSymbol((void**)&tt, "test", sizeof(tt), 0, cudaMemcpyDeviceToHost);
...

in line 21.

MK

mirco_zerbetto · January 18, 2013, 11:02am

Thank you for your prompt reply. I tried as you suggested, but it is still giving me the same output, i.e.

0.000000, 0.000000

It seems like there is no transfer of data. I also tried to init tt elements to 1.0 and, as result, the program outputs

1.000000, 1.000000

So tt is not changed at all from the cudaMemcpyFromSymbol call. If I do the same with a scalar double it works, but unfortunately I need an array, possibly dynamically allocated.

Any other clues / suggestions?

Thank you.

Mirco

Lev · January 18, 2013, 11:06am

cudaMalloc((void**)&test, 2 * sizeof(double));
this is wrong probably

need to get pointer and to copy it to test, btw, better just use kernel parameters.

double* t;
cudamalloc(&t,1000);
testinit(t);

pasoleatis · January 18, 2013, 12:56pm

How about dropping the “” from the name?

mirco_zerbetto · January 18, 2013, 1:20pm

I tried to drop the quotes, trying out also the combination of your suggestion with the one of cmaster.matso, but I keep obtaining the same behavior,

I am sorry, I think I am not getting your suggestion. If I understand correctly you are suggesting to remove the declaration “device *test;” and declare the variable inside the main. If I do as you suggest, things work. However this is not solving my issue. I shall post a more complete code so that you can better understand what I need to do (and probably this could help you in finding what I am doing wrong - even from a conceptual point of view):

#include #include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <cuda_runtime_api.h>
#include <cublas.h>

__device__ double *test;

__global__ void test_init()
{
    test[0] = sqrt(2.0);
    test[1] = sqrt(3.0);
}

__global__ void test_update()
{
    test[0] += 1.0;
    test[1] += 2.0;
}

int main( int argc, char** argv)
{
    double *tt = (double *)malloc(2*sizeof(double));

   // Init the array on device
   cudaMalloc((void**)&test, 2*sizeof(double));
   test_init<<<1,1>>>();

   // First update with respect to initial values
   test_update<<<1,1>>>();

   // Some CPU calcualtions
   [...]

   // New update of device variable with respect
   // to last update
   test_update<<<1,1>>>();
 
   // Other calculations
   [...]

   // Gather the last state of the variable
   cudaMemcpyFromSymbol((void**)&tt, test, 2*sizeof(double), 0, cudaMemcpyDeviceToHost);
   printf("%lf, %lf",tt[0],tt[1]);
   return 0;
}

I hope this is clear.

Mirco

cmaster.matso · January 18, 2013, 1:50pm

Didn’t You forget to synchronize after kernel call?

Lev · January 18, 2013, 2:31pm

No problem, just call
test_init(t);
test_update(t);

memory is save between kernel calls.

cmaster.matso · January 18, 2013, 2:37pm

But one need to synchronize before accessing it from host.

Lev · January 18, 2013, 2:43pm

No, calls are synchronized of course, memcpy kernel kernel memcpy

mirco_zerbetto · January 18, 2013, 6:02pm

So, if I understand, I should avoid the extern “device *test;” declaration and work directly from the main() using a pointer such as the one that you call “t”. I am wondering if this still works if different threads in different blocks can access to “t”. I shall explain better myself.

I have to parallelize this code (x is 2D array of size M*N, xt is a 1D array of size N):

[...]
for (i = 0; i < N; ++i)
{
   for (j = 0; j < M; ++j)
   {
       xt[i] += x[j*N+i];
   }
}
[...]

this is like left-multiplying x by a vector of size 1*M filled with ones. This vector matrix multiplication is found in examples and I have no problems in its implementation.

However, my data is stored in a file in a number of blocks (nb) such that M*nb may be too large to be loaded in memory. Thus, my idea was to read blocks of data of size M and call a kernel parallelizing
the operation shown in the latter piece of code. Thus at every call of the kernel the xt vector
elements must be updated with respect to their previous values. I would like to avoid moving the partial xt array back and forth the GPU. This is why I was thinking about using an array stored in the
device global memory. Is this conceptually wrong?

Mirco

Lev · January 18, 2013, 7:00pm

device double *test; stores some pointer. Where this pointer is changed? Host changes it. So host could just pass this pointer to kernels after memory is allocated and data is copied there.

mirco_zerbetto · January 18, 2013, 7:09pm

Dear all, I solved my specific problem (I did a mistake in planning the implementations). However, it still remains open the question about dynamically allocating an array in device global memory and getting it to host. I think this would be interesting to be solved even for other people who would might be interested in using it. What do you think?

mirco_zerbetto · January 18, 2013, 7:17pm

Thank you for the clarification. I will do some tests tomorrow.

Topic		Replies	Views
cudaDeviceSynchronize returned error code 700 while using dynamically allocated array? CUDA Programming and Performance	8	702	May 30, 2023
Dynamic 2d gpu array coping back to host CUDA Programming and Performance cuda	1	682	March 22, 2022
__device__ array to __global__ Cant pass a __device__ array to __global__ CUDA Programming and Performance	3	2354	February 24, 2012
CUDA class - allocate memory using malloc (Dynamic Global Memory Allocation and Operations) CUDA Programming and Performance	3	3229	February 2, 2017
how to get the value of array in a kernel function CUDA Programming and Performance	12	10968	October 31, 2007
Allocating memory to __device__ variables CUDA Programming and Performance	1	2367	July 10, 2008
Global arrays? CUDA Programming and Performance	24	10917	August 18, 2010
copy data which is allocated in device from device to host CUDA Programming and Performance	2	4705	December 2, 2011
How to allocate global dynamic memory on device from host CUDA Programming and Performance	6	1091	March 2, 2019
pointer in global device memory CUDA Programming and Performance	9	11784	November 23, 2011

Problem in getting __device__ dynamic array data back to host

Related topics

Problem in getting device dynamic array data back to host