How to initialize a Class on the device

Hi all, I’m struggling with the following problem: I need to initialize a class on the GPU and fill it with data using a kernel in order to do it in parallel. The only documentation I was able to find explain how to initialize and fill with data on the host, moving the class on the device at the end which is not what I need.
I show here a simplified example of the code :

#include <iostream>

class Dict {
private:
    struct KeyValuePair {
        float key;
        float *value;
        int value_size;
    };

    KeyValuePair *data;

    int max_size;

public:
    Dict(int max_dict_size, int max_array_size) :  max_size(max_dict_size) {
        data = new KeyValuePair[max_size];
        for (int i = 0; i < max_size; i++) {
            data[i].value = new float[max_array_size];
            data[i].value_size = max_size; }}

    ~Dict() { for (int i = 0; i < max_size; i++) {
              delete[] data[i].value;}
        delete[] data;
    }

    void add_entry(float key, float value[],  int position) {
        if (position == max_size){return;}

        data[position].key = key;
        int value_size = data[position].value_size;
        for (int i = 0; i < value_size; i++){data[position].value[i] = value[i];} }


    void get_value(float key, float value[]) {
        for (int i = 0; i < max_size; i++) {
            if (data[i].key == key){
                for (int j = 0; j < data[i].value_size; j++) {value[j] = data[i].value[j];}
                }}}
};


float cuda_max(float i, float j){
  if(i>j){return i;} else{return j;}}
float cuda_min(float i, float j){
  if(i>j){return j;} else{return i;}}



int main() {
    Dict my_dict(11, 11);

    int wthdrt_sz = 6;
    int njctn_sz = 5; 

    int cnjtn_sz = wthdrt_sz + njctn_sz;
    
    float fllng_lvl [cnjtn_sz]  = {0.,1.,2.,3.,4.,5.,6.,7.,8.,9.,10.}; 
    float max_vol               = 10.;
    float min_vol               = 0.;
    float wthdrt    [wthdrt_sz] = {5.,4.,3.,2.,1.,0};
    float njctn     [njctn_sz]  = {0.,1.,2.,3.,4.};
    

    for(int fllng_lvl_idx = 0; fllng_lvl_idx<max_vol+1; fllng_lvl_idx++){
        
    float fllng_lvl_slc = fllng_lvl[fllng_lvl_idx];
    
    float rl [cnjtn_sz];
    for(int i=0; i < cnjtn_sz; i++){ 
        if(i<wthdrt_sz){rl[i] = cuda_max(min_vol , fllng_lvl_slc - wthdrt[i]);}
        if(i>=wthdrt_sz){rl[i] = cuda_min(max_vol , fllng_lvl_slc + njctn[i-wthdrt_sz]);}}
   

   int temp = cnjtn_sz ;
   int count_db = 0;  

   for(int i=0; i<temp; i++){for(int j=i+1; j < temp; j++){
   if(rl[i] == rl[j]){for(int k = j; k <temp; k++){rl[k] = rl[k+1];}
   count_db++; j--; temp--;} }}
   
  
   for(int i = cnjtn_sz- count_db; i < cnjtn_sz ; i++) rl[i]=-1.;

    my_dict.add_entry(fllng_lvl_slc, rl, fllng_lvl_idx);}

   /*this part display the result and it is not important*/


    float result[cnjtn_sz];
    for(int i =0; i<cnjtn_sz;i++)
    {my_dict.get_value(fllng_lvl[i], result);
    std::cout << fllng_lvl[i]<<std::endl;
    for (int i = 0; i < cnjtn_sz; i++){ if(result[i]>-1) {std::cout << result[i] << " ";}}
    std::cout << "\n" <<std::endl;}
  
    return 0;
}

The previous code preform everything on the host while I want to construct the class Dict on the device and then filling with the resulting computation in parallel using a cuda kernel.
For reference I have consulted the following material:

CUDA : How to allocate memory for data member of a class - Stack Overflow
How to pass a C++ class with array of pointers to CUDA? - Stack Overflow
c - Memory allocation on GPU for dynamic array of structs - Stack Overflow

Additional question: Is it a good idea? Or there are reasons to create always the class on the host and move it on the device after it was filled?

Thank you for your support.

setting up an object of a class by a single thread in CUDA device code, for use on the device only, should be identical to how you would do it in host code. CUDA C++ is closely enough conformant to C++ to have confidence with that statement. The only change would be the necessity to mark class methods with __device__ decorator.

I’m not aware of any considerations that indicate you should “create always the class on the host and move it on the device after it was filled”.

Like most other things you do on the GPU, where there is performance-sensitive code, we generally want those operations to be distributed across threads. Having an object of a class in global memory that is set up by a single thread does not later prevent the data of that object being operated on by multiple threads.

Also, if you intend for multiple threads to make changes/updates (e.g. add_entry) you will need to sort that out. CUDA does not automatically sort out multi-threaded access to shared or global data.

Hi Robert, thank you for your reply. I’m not in the position to test the code until Monday, I just want to ask a couple of clarifications .

  1. Adding the mark __device__ to the constructor of the class I can execute it inside a kernel (which should run in one thread in order to create a single instance of the class ). My point is, do I need to allocate the memory for the class on the GPU before to do this ?
  2. Once the class instance has been created I can fill it with a kernel in parallel using the methods always decorated with the __device__ marker. In this way I can eliminate the for loop below
 for(int fllng_lvl_idx = 0; fllng_lvl_idx<max_vol+1; fllng_lvl_idx++){.....}.

Is it correct?
Thank you for your attention.

It can be done either way (just like, perhaps, you could imagine with C++). You could provide an allocation for the object ahead of time, in which case you might use placement new to initialize the object. Or you could use for example new in your device code to create a new object/instance of the class, which will both allocate and run the constructor. Note that calling new in device code creates an allocation that is not readily accessible to host data copy functions such as cudaMemcpy, but such an allocation is certainly shareable with other threads, and you’ve already indicated that your focus here is creating the object on the device, not the host. Still, if you didn’t like this, allocating space on the host and then using placement new would probably be one possible way to work around this limitation in CUDA.

Yes, correct. I haven’t studied your code carefully, but in general a host code C++ for-loop might sensibly be replaced by a kernel to do the same thing in device code. The class would either need to provide access to the underlying data to make this convenient to write, or appropriately designed getter/setter methods.

Hi @Robert_Crovella , I’m facing some difficulties to generate a working code. I post a non working version :

#include <iostream>
using namespace std;

class Dict {
private:
    struct KeyValuePair {
        float key;
        float *value;
        int value_size;
    };

    KeyValuePair *data;

    int max_size;

public:
    __device__   Dict(int max_dict_size, int max_array_size) :  max_size(max_dict_size) {
        data = new KeyValuePair[max_size];
        for (int i = 0; i < max_size; i++) {
            data[i].value = new float[max_array_size];
            data[i].value_size = max_array_size; }}

    __device__   ~Dict() {
        for (int i = 0; i < max_size; i++){delete[] data[i].value;}
        delete[] data;}

    __device__   void add_entry(float key, float value[],  int position) {
        if (position == max_size){return;}

        data[position].key = key;
        int value_size = data[position].value_size;
        for (int i = 0; i < value_size; i++) {
            data[position].value[i] = value[i];}}


    __device__   void get_value(float key, float value[]) {
        for (int i = 0; i < max_size; i++) {
            if (data[i].key == key){
                for (int j = 0; j < data[i].value_size; j++) {value[j] = data[i].value[j];}
                }}}
};



__device__   void cuda_max(float i, float j, float& result){
  if(i>j){result = i;} result =j;}
__device__  void  cuda_min(float i, float j, float& result){
  if(i>j){result = j;} result = i;}





__global__  void Set_on_Gpu(int max_dict_size, int max_array_size,  Dict *d_dict){
	int thr_idx = blockIdx.x * blockDim.x + threadIdx.x;
	if(thr_idx==0) {d_dict = &Dict(max_dict_size, max_array_size);}
}

__global__ void Add_on_Gpu(float key, float value[],  int position, Dict* d_dict){
	int thr_idx = blockIdx.x * blockDim.x + threadIdx.x;
	if(thr_idx==0) {d_dict->add_entry(key, value, position);}
}

__global__  void Show_on_Gpu(float key, Dict* d_dict, int size, float * res ){
	int thr_idx = blockIdx.x * blockDim.x + threadIdx.x;
	if(thr_idx==0){
	d_dict->get_value(key, res);}}

int main() {

	int max_array_size = 3;
	int max_dict_size = 1;
	float key = 0.;

	float *d_values; float *d_result; float *h_result;
	float *values = (float *)malloc(3 * sizeof(float));
	values[0]=1.0; values[1]=2.0; values[2]=3.0;

	cudaMalloc(&d_values, 3 * sizeof(float));
	cudaMemcpy(d_values, values, 3 * sizeof(float), cudaMemcpyHostToDevice);

	cudaMalloc(&d_result, 3 * sizeof(float));


	Dict *dev_test;


	Set_on_Gpu<<<32,64>>>(max_dict_size,  max_array_size,  dev_test);
	Add_on_Gpu<<<32,64>>>(key, d_values,  0 , dev_test);
	Show_on_Gpu<<<32,64>>>(key, dev_test, max_array_size, d_result);
	cudaMemcpy(h_result, d_result, 3 * sizeof(float), cudaMemcpyDeviceToHost);
       for(int i=0; i<3; i++){cout << h_result[i]<<endl; }

	return 0;
}

I strongly suspect that the problem is that the the instance of the class is not cuda-malloced. I have seen similar case in the other examples but here the attributes of the class are private and I’m not sure how to design a method to deal with the memory allocation. Can you show me a reference or write a simple example?
Thank you for your time and attention.

Yes, that is correct.

In C++, if you pass a variable by value to a called function, then a copy of that variable is made for use by the called function. Therefore this is going to be problematic:

if(thr_idx==0) {d_dict = &Dict(max_dict_size, max_array_size);}

It is legal to reassign a pointer passed by value that way, but the reassignment/modification only affects the item copy passed to that function. It will have no “global” effect. (Taking the address of a temporary object like that also gives me some concern, but I haven’t chased down the rathole of C++ to determine the guaranteed lifetime of that temp object.) Therefore, this attempt to “set” the d_dict pointer for later use by other functions could not possibly work. This is all C++, so far, not specific to CUDA. If you are going to be writing this sort of code, my hope would be that you would have enough C++ knowledge to spot problems like this.

There are a few ways to fix this. A fairly straightforward approach, which relies mainly just on C++ to address this, is to pass the value to be modified as a pointer-to-pointer. The pointer must point to a pointer location that has a proper allocation (in this case, accessible from device code). The pointer location that has the allocation is what we will modify and use from device code.

In addition to that observation, you’ve provided no proper allocation for h_result; the compiler should issue a warning related to this.

If I address those 2 items, your code seems to produce sensible results for me:

$ cat t2187.cu
#include <iostream>
using namespace std;

class Dict {
private:
    struct KeyValuePair {
        float key;
        float *value;
        int value_size;
    };

    KeyValuePair *data;

    int max_size;

public:
    __device__   Dict(int max_dict_size, int max_array_size) :  max_size(max_dict_size) {
        data = new KeyValuePair[max_size];
        for (int i = 0; i < max_size; i++) {
            data[i].value = new float[max_array_size];
            data[i].value_size = max_array_size; }}

    __device__   ~Dict() {
        for (int i = 0; i < max_size; i++){delete[] data[i].value;}
        delete[] data;}

    __device__   void add_entry(float key, float value[],  int position) {
        if (position == max_size){return;}

        data[position].key = key;
        int value_size = data[position].value_size;
        for (int i = 0; i < value_size; i++) {
            data[position].value[i] = value[i];}}


    __device__   void get_value(float key, float value[]) {
        for (int i = 0; i < max_size; i++) {
            if (data[i].key == key){
                for (int j = 0; j < data[i].value_size; j++) {value[j] = data[i].value[j];}
                }}}
};

__device__   void cuda_max(float i, float j, float& result){
  if(i>j){result = i;} result =j;}
__device__  void  cuda_min(float i, float j, float& result){
  if(i>j){result = j;} result = i;}


__global__  void Set_on_Gpu(int max_dict_size, int max_array_size,  Dict **d_dict){
        int thr_idx = blockIdx.x * blockDim.x + threadIdx.x;
        if(thr_idx==0) {*d_dict = new Dict(max_dict_size, max_array_size);}
}

__global__ void Add_on_Gpu(float key, float value[],  int position, Dict** d_dict){
        int thr_idx = blockIdx.x * blockDim.x + threadIdx.x;
        if(thr_idx==0) {(*d_dict)->add_entry(key, value, position);}
}

__global__  void Show_on_Gpu(float key, Dict** d_dict, int size, float * res ){
        int thr_idx = blockIdx.x * blockDim.x + threadIdx.x;
        if(thr_idx==0){
        (*d_dict)->get_value(key, res);}}

int main() {

        int max_array_size = 3;
        int max_dict_size = 1;
        float key = 0.;

        float *d_values; float *d_result;
        float *values = (float *)malloc(3 * sizeof(float));
        float *h_result = (float *)malloc(3 * sizeof(float));
        values[0]=1.0; values[1]=2.0; values[2]=3.0;

        cudaMalloc(&d_values, 3 * sizeof(float));
        cudaMemcpy(d_values, values, 3 * sizeof(float), cudaMemcpyHostToDevice);

        cudaMalloc(&d_result, 3 * sizeof(float));


        Dict **dev_test;
        cudaMalloc(&dev_test, sizeof(Dict *));

        Set_on_Gpu<<<32,64>>>(max_dict_size,  max_array_size,  dev_test);
        Add_on_Gpu<<<32,64>>>(key, d_values,  0 , dev_test);
        Show_on_Gpu<<<32,64>>>(key, dev_test, max_array_size, d_result);
        cudaMemcpy(h_result, d_result, 3 * sizeof(float), cudaMemcpyDeviceToHost);
       for(int i=0; i<3; i++){cout << h_result[i]<<endl; }

        return 0;
}
$ nvcc -o t2187 t2187.cu
$ compute-sanitizer ./t2187
========= COMPUTE-SANITIZER
1
2
3
========= ERROR SUMMARY: 0 errors
$

Perfect, your example solved my doubts.
Thank you very much for help, it has been very appreciated.

Best regards.
Giovanni.