How to pass large arguments in CUDA kernels Kernel arguments

nabarunpaul · December 16, 2009, 11:11am

Hi All,

At this moment i have around 60 args that i need to pass in the cuda kernel.

For that, i was coping a host pointer struct to the device pointer struct (element by element as shown in NVIDIA_CUDA_Programming_Guide_2.2.pdf).

But i am not able to access the pointer type struct elements inside the kernel by using (->) operator (as it gives me - Warning: Cannot tell what pointer points to, assuming global memory space).

Can anybody please suggest me better way to pass around 60 parameters inside the CUDA kernel?

regards,
Nabarun.

eyalhir74 · December 16, 2009, 11:25am

Indeed use a pointer to struct:

GGPUParams *pDeviceParams;

 cudaMalloc( ( void ** )&( pDeviceParams), sizeof( GGPUParams ) );

 cudaMemcpy( pDeviceParams, pHostParams, sizeof( GGPUParams ), cudaMemcpyHostToDevice ) );

 ....

 myKernel<<< >>>( pDeviceParams );

 ...

 ...

 cudaFree( pDeviceParams  );

Hope that helps,

eyal

avidday · December 16, 2009, 11:28am

Constant memory is another possibility. You can write constants and the addresses of devices pointers into constant memory before you launch the kernel, and then they are available to the kernel when it executes. There is constant memory cache and a broadcast mechanism, performance wise it should be little different to passing the same data by argument to the kernel.

apangborn · December 16, 2009, 3:23pm

Even though you get the warning, it should still work perfectly fine (provided the assumption about global memory is actually correct - which in this case it seems like it is).

I use a structure of dynamically allocated arrays for a couple of implementations and I also get one of those warnings anytime I access an array through the struct pointer, but the implementation works fine and I don’t appear to have any coalescing issues within the arrays themselves.

nabarunpaul · December 17, 2009, 4:22am

Indeed use a pointer to struct:

GGPUParams *pDeviceParams;

 cudaMalloc( ( void ** )&( pDeviceParams), sizeof( GGPUParams ) );

 cudaMemcpy( pDeviceParams, pHostParams, sizeof( GGPUParams ), cudaMemcpyHostToDevice ) );

 ....

 myKernel<<< >>>( pDeviceParams );

 ...

 ...

 cudaFree( pDeviceParams  );

Hope that helps,

eyal

Hello Eyal,

No that is not working, I am again getting garbage values inside kernel.

But one interesting thing - by mistake once i was copying a structure pointer (that i initialized in host and then set the elements in device, elements by elements like i did earlier) to a pointer that i allocated in device (like you were doing).

and surprisingly that was working but only when i was doing “cudaMemcpyHostToDevice” without any Memcopy error.

but again when i tried “cudaMemcpyDeviceToDevice” (which is the right way, i guess ** ) it was again not working.

Any Idea.

Please let me know if my English was not clear.

**

since when i allocate memory of all elements of any structure (elements by elements ) in Device, the entire structure has to be on the device (am i right ? ). So when i copy such struct to a struct define by your method it should be Device to Device type (i am not so clear with this point)

regards,

Nabarun

nabarunpaul · December 17, 2009, 8:53am

But In my case, i am getting garbage values, the moment when i try to access (->) the elements inside kernel. Allocation in global is not problem. As i get the element back to host memory if i dont go inside the kernel, but when it goes inside the kernel it is giving me garbage values.

eyalhir74 · December 17, 2009, 10:10am

Hello Eyal,

No that is not working, I am again getting garbage values inside kernel.

But one interesting thing - by mistake once i was copying a structure pointer (that i initialized in host and then set the elements in device, elements by elements like i did earlier) to a pointer that i allocated in device (like you were doing).

and surprisingly that was working but only when i was doing “cudaMemcpyHostToDevice” without any Memcopy error.

but again when i tried “cudaMemcpyDeviceToDevice” (which is the right way, i guess ** ) it was again not working.

Any Idea.

Please let me know if my English was not clear.

**

since when i allocate memory of all elements of any structure (elements by elements ) in Device, the entire structure has to be on the device (am i right ? ). So when i copy such struct to a struct define by your method it should be Device to Device type (i am not so clear with this point)

regards,

Nabarun

Maybe you can post the code you use… that might be helpful.

eyal

nabarunpaul · December 17, 2009, 11:35am

Well Eyal,

Here is a struct

typedef struct{

float *p00 , *p50;

} testStruct;

which i am keeping in a common header file (commonStruct.h) so that it can be available to any file when required.

this is my main.cpp file -

[codebox]include <stdio.h>

include <cutil_inline.h>

include

include “commonStruct.h”

using namespace std;

extern void goCuda(testStruct *);

void setNull(testStruct *locStruct)

{

locStruct->p00 = NULL;

locStruct->p50 = NULL;

}

void initStruct(testStruct *locStruct)

{

locStruct->p00 = new float[4];

if( 0==locStruct->p00)

{

printf("couldn't allocate memory\n");

exit(1);

}

locStruct->p50 = new float[4];

}

void MallocCUDDA(testStruct *locStruct, size_t sizeP)

{

cudaMalloc((void**)&locStruct->p00, sizeP);

cudaMalloc((void**)&locStruct->p50, sizeP);

}

void cuddaMemCopy(testStruct *d_locStruct, testStruct *h_locStruct,size_t sizeP)

{

cudaMemcpy(d_locStruct->p00, h_locStruct->p00, sizeP,cudaMemcpyHostToDevice);

cudaMemcpy(d_locStruct->p50, h_locStruct->p50, sizeP,cudaMemcpyHostToDevice);

}

//device structName d_sn;

int main()

{

testStruct *h_sn = new testStruct; // declaring host struct 

testStruct *d_sn = new testStruct; // declaring device struct 

setNull(h_sn); // initialising each elements to null

setNull(d_sn); // initialising each elements to null for device



//testStruct d_sn;

size_t sizeP = 4*sizeof(float); // size of array 

initStruct(h_sn); // allocating memory for each elements in the host



MallocCUDDA(d_sn, sizeP); // allocating memory for each elements in the device



for(int i=0; i<4;i++)

{

h_sn->p00[i] = float(i+10) ; // passing same values to each host elements 

h_sn->p50[i] = float(i*4) ;

}



cuddaMemCopy(d_sn, h_sn, sizeP); // copying each elements from host to device 

goCuda(d_sn); // passing the device struct to cuda 

free(h_sn->p00); free(h_sn->p50); 

cudaFree(d_sn->p00);cudaFree(d_sn->p50);

/*h_sn.p00 = new float[4];

h_sn.p50 = new float[4];*/

/*

cudaMalloc((void**)&d_sn.p00, sizeP);

cudaMalloc((void**)&d_sn.p50, sizeP);

int arry[4][4];

arry[0][1] = 2222222;

float a = 20.0;



cudaMemcpy(d_sn.p00, h_sn.p00, sizeP,cudaMemcpyHostToDevice);

cudaMemcpy(d_sn.p50, h_sn.p50, sizeP,cudaMemcpyHostToDevice);	

for(int i=0; i<4;i++)

cout << h_sn.p00[i] << endl;

goCuda(d_sn, arry);

// allocate and populate h_sn

// allocate to device with cudaMalloc

// copy memory with cudaMemcpy

//kernel<<<4,4>>>(d_sn);

// copy back and do what ever

*/

cin.get();

return 0;

}[/codebox]

And this the cudaSolver.cu file -

[codebox]include

include “CUSetup.h”

include <cutil_inline.h>

include “commonStruct.h”

include “cublas.h”

using namespace std;

global void kernelSolver(testStruct * d_get, float *d_check)

{

int idx = blockIdx.x*blockDim.x + threadIdx.x;

d_check[idx] = d_get->p50[idx];

}

void goCuda(testStruct * d_sn)

{

dim3 dimBlock(2);

dim3 dimThread(2);

size_t memsize = sizeof(float)*4; // array siize four 

float *d_check, *h_check;//, *d_check2;



h_check = (float *) malloc(memsize); // a host float pointer 

cudaMalloc( (void **) &d_check, memsize); // a device float pointer to check values inside the kernel

cudaMemset( d_check, 0, memsize );



testStruct *pDeviceParams; // struct defined in your WAY 

cudaMalloc( ( void ** )&( pDeviceParams), sizeof( testStruct ) );

cudaMemcpy( pDeviceParams, d_sn, sizeof( testStruct ), cudaMemcpyHostToDevice );



cuCheckError("Mem copy falied : ");

kernelSolver<<<dimBlock, dimThread>>>(pDeviceParams, d_check );

cuCheckError("Kernel execution failed");

cudaFree( pDeviceParams  );



cudaMemcpy( h_check, d_check, memsize, cudaMemcpyDeviceToHost );

for(int i=0; i<4;i++)

cout << "here : you got it  " << h_check[i] << endl;

free(h_check);

cudaFree(d_check);

}

[/codebox]

So What i was saying is that -

when i try to copy d_sn which is an device struct pointer using cudaMemcpyHostToDevice it is working but if i use cudaMemcpyDeviceToDevice it is not working.

and if i use the h_sn pointer i get no values.

It is quite a tedious job but if you simply copy the codes in three files -

commonStruct.h - and put the struct there
main.cpp - and copy the main code there
cudaSolver.cu - and copy the cuda Code - It is definitely going to give testing platform in ur new VC++ project.

Please let me know if something is not understandable.

But anyway I want to thank you for all the help from your side.

regards,

Nabarun.

eyalhir74 · December 17, 2009, 12:02pm

The code looks ok. I think this is what you mean/what I’ve understand:

You can not pass h_sn to the kernel since its a data on the CPU (on the host).
If you do this:

cudaMemcpy( pDeviceParams, d_sn, sizeof( testStruct ), cudaMemcpyHostToDevice );

this works? good. This is what you should do.

if you do this:

cudaMemcpy( pDeviceParams, d_sn, sizeof( testStruct ), cudaMemcpyDeviceToDevice );

this doesnt work? good. It shouldn’t

If I understand you correctly then what you need to realize is that d_sn is allocated by you on the host (CPU):

testStruct *d_sn = new testStruct;

the data members in d_sn (p00 and p50) are allocated on the device, in MallocCuda:

cudaMalloc((void**)&locStruct->p00, sizeP);	

  cudaMalloc((void**)&locStruct->p50, sizeP);

Therefore if you want to copy the entire data structure (d_sn) to the device you need to copy it from Host to device.

The pointers inside are allready on the device.

Hope that i explained my self well :)

eyal

nabarunpaul · December 18, 2009, 3:04am

Yes Eyal,

I was also explaining something like this to myself, fortunately your point bolster my assumption.

Since allocation was done on host, the copy should be from host to device, OK.

But now coming back to your example, where your both host pointer struct and its elements are on host (which is h_sn in my case) is getting passed to the device when i do copy struct, no doubt.

But when i pass the pointer as argument in the kernel i am getting garbage values.

In short i mean copying a entire true (who truly resides on host) host struct to deivce is not working inside kernel.

I hope i am clear this time.

regards,
Nabarun.

nabarunpaul · December 18, 2009, 3:09am

:rolleyes:

Topic		Replies	Views
Parameters passed to a CUDA kernel exceed 256 bytes. CUDA Programming and Performance	13	6975	September 21, 2009
Another Device Memory Question CUDA Programming and Performance	7	2305	February 9, 2010
Transfering struct with pointers to device memory Used for variable argument list CUDA Programming and Performance	11	26985	January 19, 2011
Passing a structure with a pointer How do you pass a structure with a pointer in it to a kernel CUDA Programming and Performance	8	1236	March 22, 2011
Passing structures into CUDA kernels CUDA Programming and Performance	9	20302	November 19, 2020
How do I pass a double pointers array to the device? I'm getting cudaErrorIllegalAddress CUDA Programming and Performance	12	3496	January 17, 2024
seems that cuda doesn't support pointer to pointer problem report CUDA Programming and Performance	11	11703	March 29, 2012
How to handle Advisory . CUDA Programming and Performance	9	3473	March 26, 2009
Strange memory gremlins Getting pwned by pointers CUDA Programming and Performance	9	12170	July 1, 2009
Defining global variables on the host and device at once? CUDA Programming and Performance	14	14095	December 19, 2020

How to pass large arguments in CUDA kernels Kernel arguments

Related topics