I want to pass a small array (of integers), max of up to 10 values… to my cuda kernel from the host file.
How can I do that without having to create a device pointer and doing a memcpy to copy the contents of the global array to the device array adn then using this in the kernel ??
global void my_kernel (int* in, int* out, int arr , int n)
{
…
…
for (i = 0; i <n; i++)
dev_func(in, out, arr);
…
…
…
}
device dev_func (int arr)
{
int x = threadIdx.x;
val = (x%8);
out = arr [val] * in;
}[/codebox]
what would be the best way to go about doing this ?? (d_in and d_out are device pointers with appropriate space allocated for them on the device global memory !!)
Even if I copy this small array onto the device global memory (using a memcpy…)
then copy the contents of this array from the global memory onto the shared memory and then pass the shared memory array to the device function…
would it make more sense to this instead ??..is it possible to do this ?? how would I go abt doing this ??? (especially passing the shared memory array to the device function…)
any help would be great…
constant memory has some limitations when accessing it. If all threads access the same field, everything is OK, but if all threads access different fields (e.g. param[threadIdx.x]) the code is serialized - at least that’s what the manuals say.
If you really want to pass an array, pack it into a structure:
constant memory has some limitations when accessing it. If all threads access the same field, everything is OK, but if all threads access different fields (e.g. param[threadIdx.x]) the code is serialized - at least that’s what the manuals say.
If you really want to pass an array, pack it into a structure:
In the above using global memory tends to hit the performance of the entire kernel pretty bad…so was intending on using shared memory…is there any way to do this…i went through the programming guide, but didn’t find much…
Just pass it. With little luck, compiler will know that now you are using a shared memory pointer instead of global pointer inside your device function.
I use this trick many times, calling the same device function with different pointers. All device functions are inlined - that’s how it is resolved.
Hmm…interesting…but I am not very convinced abt reason as to why it works - yeah the devices functions are all inlined, but how does that resolve this pointer and make the compiler realize that’s its pointing to the shared memory ?? … but since this done by each block within the kernel it logically it should be possible…
but as long as it works its good…but Pdan you use cuda2.2 to and this works for you ??
because in my case it doesn’t seem to work !! - I use a Quadro 5600 Fx, with CUDA 2.1…
Further, it works if only one block is being launched…that is somehow accountable since, Shared memory is local to each SM and there is some problem for shared memory spanning across the multiple SMs…
but, that’s also something weird because in my application I do not need the shared memory array to span across the SMs…