How to pass large arguments in CUDA kernels Kernel arguments

Hi All,

At this moment i have around 60 args that i need to pass in the cuda kernel.

For that, i was coping a host pointer struct to the device pointer struct (element by element as shown in NVIDIA_CUDA_Programming_Guide_2.2.pdf).

But i am not able to access the pointer type struct elements inside the kernel by using (->) operator (as it gives me - Warning: Cannot tell what pointer points to, assuming global memory space).

Can anybody please suggest me better way to pass around 60 parameters inside the CUDA kernel?

regards,
Nabarun.

Indeed use a pointer to struct:

GGPUParams *pDeviceParams;

 cudaMalloc( ( void ** )&( pDeviceParams), sizeof( GGPUParams ) );

 cudaMemcpy( pDeviceParams, pHostParams, sizeof( GGPUParams ), cudaMemcpyHostToDevice ) );

 ....

 myKernel<<< >>>( pDeviceParams );

 ...

 ...

 cudaFree( pDeviceParams  );

Hope that helps,

eyal

1 Like

Constant memory is another possibility. You can write constants and the addresses of devices pointers into constant memory before you launch the kernel, and then they are available to the kernel when it executes. There is constant memory cache and a broadcast mechanism, performance wise it should be little different to passing the same data by argument to the kernel.

1 Like

Even though you get the warning, it should still work perfectly fine (provided the assumption about global memory is actually correct - which in this case it seems like it is).

I use a structure of dynamically allocated arrays for a couple of implementations and I also get one of those warnings anytime I access an array through the struct pointer, but the implementation works fine and I don’t appear to have any coalescing issues within the arrays themselves.

Hello Eyal,

No that is not working, I am again getting garbage values inside kernel.

But one interesting thing - by mistake once i was copying a structure pointer (that i initialized in host and then set the elements in device, elements by elements like i did earlier) to a pointer that i allocated in device (like you were doing).

and surprisingly that was working but only when i was doing “cudaMemcpyHostToDevice” without any Memcopy error.

but again when i tried “cudaMemcpyDeviceToDevice” (which is the right way, i guess ** ) it was again not working.

Any Idea.

Please let me know if my English was not clear.

**

since when i allocate memory of all elements of any structure (elements by elements ) in Device, the entire structure has to be on the device (am i right ? ). So when i copy such struct to a struct define by your method it should be Device to Device type (i am not so clear with this point)

regards,

Nabarun

But In my case, i am getting garbage values, the moment when i try to access (->) the elements inside kernel. Allocation in global is not problem. As i get the element back to host memory if i dont go inside the kernel, but when it goes inside the kernel it is giving me garbage values.

Maybe you can post the code you use… that might be helpful.

eyal

Well Eyal,

Here is a struct

typedef struct{

float *p00 , *p50;

} testStruct;

which i am keeping in a common header file (commonStruct.h) so that it can be available to any file when required.

this is my main.cpp file -

[codebox]#include <stdio.h>

#include <cutil_inline.h>

#include

#include “commonStruct.h”

using namespace std;

extern void goCuda(testStruct *);

void setNull(testStruct *locStruct)

{

locStruct->p00 = NULL;

locStruct->p50 = NULL;

}

void initStruct(testStruct *locStruct)

{

locStruct->p00 = new float[4];

if( 0==locStruct->p00)

{

printf("couldn't allocate memory\n");

exit(1);

}

locStruct->p50 = new float[4];

}

void MallocCUDDA(testStruct *locStruct, size_t sizeP)

{

cudaMalloc((void**)&locStruct->p00, sizeP);

cudaMalloc((void**)&locStruct->p50, sizeP);

}

void cuddaMemCopy(testStruct *d_locStruct, testStruct *h_locStruct,size_t sizeP)

{

cudaMemcpy(d_locStruct->p00, h_locStruct->p00, sizeP,cudaMemcpyHostToDevice);

cudaMemcpy(d_locStruct->p50, h_locStruct->p50, sizeP,cudaMemcpyHostToDevice);

}

//device structName d_sn;

int main()

{

testStruct *h_sn = new testStruct; // declaring host struct 

testStruct *d_sn = new testStruct; // declaring device struct 

setNull(h_sn); // initialising each elements to null

setNull(d_sn); // initialising each elements to null for device



//testStruct d_sn;

size_t sizeP = 4*sizeof(float); // size of array 

initStruct(h_sn); // allocating memory for each elements in the host



MallocCUDDA(d_sn, sizeP); // allocating memory for each elements in the device



for(int i=0; i<4;i++)

{

h_sn->p00[i] = float(i+10) ; // passing same values to each host elements 

h_sn->p50[i] = float(i*4) ;

}



cuddaMemCopy(d_sn, h_sn, sizeP); // copying each elements from host to device 

goCuda(d_sn); // passing the device struct to cuda 

free(h_sn->p00); free(h_sn->p50); 

cudaFree(d_sn->p00);cudaFree(d_sn->p50);

/*h_sn.p00 = new float[4];

h_sn.p50 = new float[4];*/

/*

cudaMalloc((void**)&d_sn.p00, sizeP);

cudaMalloc((void**)&d_sn.p50, sizeP);

int arry[4][4];

arry[0][1] = 2222222;

float a = 20.0;



cudaMemcpy(d_sn.p00, h_sn.p00, sizeP,cudaMemcpyHostToDevice);

cudaMemcpy(d_sn.p50, h_sn.p50, sizeP,cudaMemcpyHostToDevice);	

for(int i=0; i<4;i++)

cout << h_sn.p00[i] << endl;

goCuda(d_sn, arry);

// allocate and populate h_sn

// allocate to device with cudaMalloc

// copy memory with cudaMemcpy

//kernel<<<4,4>>>(d_sn);

// copy back and do what ever

*/

cin.get();

return 0;

}[/codebox]

And this the cudaSolver.cu file -

[codebox]#include

#include “CUSetup.h”

#include <cutil_inline.h>

#include “commonStruct.h”

#include “cublas.h”

using namespace std;

global void kernelSolver(testStruct * d_get, float *d_check)

{

int idx = blockIdx.x*blockDim.x + threadIdx.x;

d_check[idx] = d_get->p50[idx];

}

void goCuda(testStruct * d_sn)

{

dim3 dimBlock(2);

dim3 dimThread(2);

size_t memsize = sizeof(float)*4; // array siize four 

float *d_check, *h_check;//, *d_check2;



h_check = (float *) malloc(memsize); // a host float pointer 

cudaMalloc( (void **) &d_check, memsize); // a device float pointer to check values inside the kernel

cudaMemset( d_check, 0, memsize );



testStruct *pDeviceParams; // struct defined in your WAY 

cudaMalloc( ( void ** )&( pDeviceParams), sizeof( testStruct ) );

cudaMemcpy( pDeviceParams, d_sn, sizeof( testStruct ), cudaMemcpyHostToDevice );



cuCheckError("Mem copy falied : ");

kernelSolver<<<dimBlock, dimThread>>>(pDeviceParams, d_check );

cuCheckError("Kernel execution failed");

cudaFree( pDeviceParams  );



cudaMemcpy( h_check, d_check, memsize, cudaMemcpyDeviceToHost );

for(int i=0; i<4;i++)

cout << "here : you got it  " << h_check[i] << endl;

free(h_check);

cudaFree(d_check);

}

[/codebox]

So What i was saying is that -

when i try to copy d_sn which is an device struct pointer using cudaMemcpyHostToDevice it is working but if i use cudaMemcpyDeviceToDevice it is not working.

and if i use the h_sn pointer i get no values.

It is quite a tedious job but if you simply copy the codes in three files -

  1. commonStruct.h - and put the struct there

  2. main.cpp - and copy the main code there

  3. cudaSolver.cu - and copy the cuda Code - It is definitely going to give testing platform in ur new VC++ project.

Please let me know if something is not understandable.

But anyway I want to thank you for all the help from your side.

regards,

Nabarun.

The code looks ok. I think this is what you mean/what I’ve understand:

  1. You can not pass h_sn to the kernel since its a data on the CPU (on the host).

  2. If you do this:

cudaMemcpy( pDeviceParams, d_sn, sizeof( testStruct ), cudaMemcpyHostToDevice );

this works? good. This is what you should do.

  1. if you do this:
cudaMemcpy( pDeviceParams, d_sn, sizeof( testStruct ), cudaMemcpyDeviceToDevice );

this doesnt work? good. It shouldn’t

If I understand you correctly then what you need to realize is that d_sn is allocated by you on the host (CPU):

testStruct *d_sn = new testStruct;

the data members in d_sn (p00 and p50) are allocated on the device, in MallocCuda:

cudaMalloc((void**)&locStruct->p00, sizeP);	

  cudaMalloc((void**)&locStruct->p50, sizeP);

Therefore if you want to copy the entire data structure (d_sn) to the device you need to copy it from Host to device.

The pointers inside are allready on the device.

Hope that i explained my self well :)

eyal

Yes Eyal,

I was also explaining something like this to myself, fortunately your point bolster my assumption.

Since allocation was done on host, the copy should be from host to device, OK.

But now coming back to your example, where your both host pointer struct and its elements are on host (which is h_sn in my case) is getting passed to the device when i do copy struct, no doubt.

But when i pass the pointer as argument in the kernel i am getting garbage values.

In short i mean copying a entire true (who truly resides on host) host struct to deivce is not working inside kernel.

I hope i am clear this time.

regards,
Nabarun.

:rolleyes: