passing an array to a kenel ?

I want to pass a small array (of integers), max of up to 10 values… to my cuda kernel from the host file.

How can I do that without having to create a device pointer and doing a memcpy to copy the contents of the global array to the device array adn then using this in the kernel ??

What i want to do actually is:

[codebox]Host code

int h_arr[8] = {0, 1 , 3, 4, 5, 6, 7}

my_kernel<<<grid, threads>>> (d_in, d_out, h_arr, n)

Device code

global void my_kernel (int* in, int* out, int arr , int n)

{

for (i = 0; i <n; i++)

 dev_func(in, out, arr);

}

device dev_func (int arr)

{

int x = threadIdx.x;

val = (x%8);

out = arr [val] * in;

}[/codebox]

what would be the best way to go about doing this ?? (d_in and d_out are device pointers with appropriate space allocated for them on the device global memory !!)

Even if I copy this small array onto the device global memory (using a memcpy…)

then copy the contents of this array from the global memory onto the shared memory and then pass the shared memory array to the device function…

would it make more sense to this instead ??..is it possible to do this ?? how would I go abt doing this ??? (especially passing the shared memory array to the device function…)
any help would be great…

thanks in advance…

Use constant memory instead. Look for cudaMemcpyToSymbol in the SDK:

__constant__ int	ProjectIntParams[ PROJECT_INT_PARAMETERS_COUNT ];

cudaMemcpyToSymbol( ProjectIntParams, &constHostIntParams[0], PROJECT_INT_PARAMETERS_COUNT * sizeof( int ), 0 );

eyal

thanks eyal…

I did try to use the constant memory…

but I had some issues freeing the constant memory after setting it once…required a restart…

so was looking for some other alternatives…

There is no need to free the constant memory. What did you try to do?

BTW - as a general advice - its not that logic that nVidia would put a piece of hardware (constant memory)

that doesnt work or requires a restart after using it - that usually means one doesnt use it properly.

Constant memory is exactly what you want - check the SDK and programming guide for more information.

cheers

eyal

yeah there might have been a mistake in my using it…

actually i had tried to use more memory than allocated and I guess this is why i needed to to restart it…

i’ll fix it up…

thanks again :)

constant memory has some limitations when accessing it. If all threads access the same field, everything is OK, but if all threads access different fields (e.g. param[threadIdx.x]) the code is serialized - at least that’s what the manuals say.

If you really want to pass an array, pack it into a structure:

struct MyStruct {

  int params[8];

}

[quote name=‘PDan’ post=‘550120’ date=‘Jun 8 2009, 11:26 AM’]

constant memory has some limitations when accessing it. If all threads access the same field, everything is OK, but if all threads access different fields (e.g. param[threadIdx.x]) the code is serialized - at least that’s what the manuals say.

If you really want to pass an array, pack it into a structure:

[codebox]Host code

my_kernel<<<grid, threads>>> (d_in, d_out, n)

Device code

global void my_kernel (int* in, int* out, int n)

{

shared int dim[3]

for (i = 0; i <n; i++) {

dev_func(in, out, dim);

dim[0] +=1;

dim[1] +=1;   

dim[2] +=1;

}

}

device dev_func (int arr[3])

{

int x = threadIdx.x;

out = arr [0] * in + arr[1]*in*2 +arr[2];

}[/codebox]

In the above using global memory tends to hit the performance of the entire kernel pretty bad…so was intending on using shared memory…is there any way to do this…i went through the programming guide, but didn’t find much…

Just pass it. With little luck, compiler will know that now you are using a shared memory pointer instead of global pointer inside your device function.

I use this trick many times, calling the same device function with different pointers. All device functions are inlined - that’s how it is resolved.

Hmm…interesting…but I am not very convinced abt reason as to why it works - yeah the devices functions are all inlined, but how does that resolve this pointer and make the compiler realize that’s its pointing to the shared memory ?? … but since this done by each block within the kernel it logically it should be possible…

but as long as it works its good…but Pdan you use cuda2.2 to and this works for you ??

because in my case it doesn’t seem to work !! - I use a Quadro 5600 Fx, with CUDA 2.1…

Further, it works if only one block is being launched…that is somehow accountable since, Shared memory is local to each SM and there is some problem for shared memory spanning across the multiple SMs…

but, that’s also something weird because in my application I do not need the shared memory array to span across the SMs…