Hi,
I’ve been trying to allocate dynamically a class array on CPU, allocate the correspondig memory on GPU and copy the array.
I then pass a pointer to the array located on GPU as the kernel’s argument and try to call the class’s method but every call to the methods fails (no errors are given during compilation).
Here is an example of the code:
#include <iostream>
using namespace std;
#define THREADS_NUMBER 2
#define BLOCKS_NUMBER 1
class MyClass{
private:
double member;
public:
__device__ __host__ MyClass() {member=0;};
__device__ __host__ void add( double value ){ member+=value;};
__device__ __host__ double get_member(){return member;};
};
__global__ void kernel(MyClass** class_array){
int tid = blockDim.x * blockIdx.x + threadIdx.x;
(*class_array)[tid].add(5.);
printf("%lf \n",(*class_array)[tid].get_member());
}
int main (){
MyClass* classarray = new MyClass[BLOCKS_NUMBER*THREADS_NUMBER]();
MyClass* dev_classarray;
size_t size= BLOCKS_NUMBER*THREADS_NUMBER*sizeof(MyClass);
cudaMalloc(&dev_classarray, size);
cudaMemcpy(dev_classarray, classarray, size, cudaMemcpyHostToDevice);
kernel<<<BLOCKS_NUMBER,THREADS_NUMBER>>>(&dev_classarray);
cudaMemcpy(classarray, dev_classarray, size, cudaMemcpyDeviceToHost);
for (int s=0; s<BLOCKS_NUMBER*THREADS_NUMBER; s++){
cout<<classarray[s].get_member()<<endl;
}
return 0;
}
The same code with an host function identical to the kernel (with a for cycle in it) works just fine.
Any ideas as to where is the problem?
Thank you
Any time you are having trouble with a CUDA code, it’s good practice to employ proper CUDA error checking, and also run your code with cuda-memcheck. Even if you don’t understand the error output generated, it will be useful to others that you may ask for help.
I suggest doing this before asking others for help. Not sure what proper CUDA error checking is? Google “proper CUDA error checking” and take the first hit, and read it and apply it to your code.
There is certainly a problem here:
kernel<<<BLOCKS_NUMBER,THREADS_NUMBER>>>(&dev_classarray);
^
The address of dev_classarray is a location in host memory. That sort of construct is almost never usable in ordinary CUDA device code. Attempting to dereference that pointer in device code will result in a device code execution fault - the device is not allowed to dereference a host pointer (i.e. a pointer that points to a location in host memory space). The device code is not allowed to access host memory space, when that space has been allocated using an ordinary host allocation method (excepting Power9 Coherency/ATS, which is presumably not at issue here).
There is an additional usage error indicating a lack of comprehension around the use of double pointers and cudaMalloc, but we can leave that aside for now.
It’s not clear why you wish to pass a double-pointer to your kernel. For the functionality you have shown here, single pointer usage should suffice:
$ cat t129.cu
#include <iostream>
#include <stdio.h>
using namespace std;
#define THREADS_NUMBER 2
#define BLOCKS_NUMBER 1
class MyClass{
private:
double member;
public:
__device__ __host__ MyClass() {member=0;};
__device__ __host__ void add( double value ){ member+=value;};
__device__ __host__ double get_member(){return member;};
};
__global__ void kernel(MyClass* class_array){
int tid = blockDim.x * blockIdx.x + threadIdx.x;
class_array[tid].add(5.);
printf("%f \n", class_array[tid].get_member());
}
int main (){
MyClass* classarray = new MyClass[BLOCKS_NUMBER*THREADS_NUMBER]();
MyClass* dev_classarray;
size_t size= BLOCKS_NUMBER*THREADS_NUMBER*sizeof(MyClass);
cudaMalloc(&dev_classarray, size);
cudaMemcpy(dev_classarray, classarray, size, cudaMemcpyHostToDevice);
kernel<<<BLOCKS_NUMBER,THREADS_NUMBER>>>(dev_classarray);
cudaMemcpy(classarray, dev_classarray, size, cudaMemcpyDeviceToHost);
for (int s=0; s<BLOCKS_NUMBER*THREADS_NUMBER; s++){
cout<<classarray[s].get_member()<<endl;
}
return 0;
}
$ nvcc -o t129 t129.cu
$ cuda-memcheck ./t129
========= CUDA-MEMCHECK
5.000000
5.000000
5
5
========= ERROR SUMMARY: 0 errors
$