Hi all, currently I have written some code that optimizes the positions of many particles. It does this by using a method that makes repeated calls to kernels and then copies this data back to the host, which then makes some changes to the data and sends it back to device memory and then keeps repeating. These memory transfers kill the speed of the program (obviously) so I began transporting the host code in little pieces to the GPU so that all the data can stay on the GPU until the end of the algorithm. The code is in little pieces because it needs to stop when it reaches a point where one of the old kernels needs to be called, which is then called by the host. So the problem is that the old kernels must remain kernels because they are too hard to write as device functions however the host code needs to know when to call which kernels.
The problem can be summarized as this. Is there a way to return a boolean (indirectly) without doing a memory transfer. I was thinking maybe like if some condition changes some kind of exception is thrown or something so that then the host code can catch the exception and now know to call a different kernel. Or is there a fast way to copy a single int back to the host? Any thoughts would be very much appreciated. Thanks.
Run your first kernel as you do now, and have it set its decision flag to device memory. Don’t even bother sending it to the CPU.
Now run a second, bigger kernel which is actually a wrapper two or more alternate code paths in it. When the kernel starts, it reads that “which kernel” decision flag from DEVICE memory and decides which subkernel to call. The host CPU doesn’t need to be involved. You could set the kernels up in the same stream so they’d execute one after the other with no latency.
This is how I do my own “GPU kernel scheduling” and it works fine even when the subkernels are quite large. The main downside is longer compile times.
The other potential problem is that your kernel resources (thread count, shared memory, register count) is worst case of the two subkernels. In my app it didn’t matter, but it might be a bigger deal if your alternate code paths have very different requirements.
Run your first kernel as you do now, and have it set its decision flag to device memory. Don’t even bother sending it to the CPU.
Now run a second, bigger kernel which is actually a wrapper two or more alternate code paths in it. When the kernel starts, it reads that “which kernel” decision flag from DEVICE memory and decides which subkernel to call. The host CPU doesn’t need to be involved. You could set the kernels up in the same stream so they’d execute one after the other with no latency.
This is how I do my own “GPU kernel scheduling” and it works fine even when the subkernels are quite large. The main downside is longer compile times.
The other potential problem is that your kernel resources (thread count, shared memory, register count) is worst case of the two subkernels. In my app it didn’t matter, but it might be a bigger deal if your alternate code paths have very different requirements.
Thanks for the responses guys. Could someone expound on what SPWorley suggested? I am confused about what a decision flag is (does it have another name where I could look it up and read about it?) I am also a little confused about the subkernels. Do you mean that you are combining all the kernels into a giant one? Because I dont think that would work in this instance. Or is there a way to call a kernel from the device (I thought that wasn’t possible). I am sorry if these are very obvious questions.
Thanks for the responses guys. Could someone expound on what SPWorley suggested? I am confused about what a decision flag is (does it have another name where I could look it up and read about it?) I am also a little confused about the subkernels. Do you mean that you are combining all the kernels into a giant one? Because I dont think that would work in this instance. Or is there a way to call a kernel from the device (I thought that wasn’t possible). I am sorry if these are very obvious questions.
SPWorley’s solution can certainly work in many circumstances.
Here is an alternate solution that I use. Have the same device flag value, just cudaMemcpy it back to the host after the kernel call and make the code path decision on the host. Such a small cudamemcpy only costs ~10 microseconds. I use it in cases where a condition in one kernel (i.e. a check to see if a certain computation needs to be done) results in a branch in host code leading to several other only somewhat related kernel calls. I also use it to check when kernels that generate lists of indeterminate length overflow their max value - at which point the host must respond by growing the array and then re-calling the kernel.
for example:
__global__ void do_something(..., unsigned int *d_conditions)
{
...
if (error)
d_conditions[0] = 1;
....
if (would_overflow)
atomicMax(&d_conditions[1], my_element_idx+1)
else
write element my_element_idx
}
// .... on host
do_something<<<...>>>(..., d_conditions);
cudaMemcpy(h_conditions, d_conditions, sizeof(unsigned int) * 2);
if (h_conditions[0])
{
cudaMemset(d_conditions, 0, sizeof(unsigned int)*2);
throw some_error
}
if (h_conditions[1])
{
cudaMemset(d_conditions, 0, sizeof(unsigned int)*2);
handle reallocating the arrays with the new size from h_conditions[1]
}
Note how the code nicely avoids calling cudaMemset prior to every call by only resetting it when an error flag is actually set.
Is there a performance hit with this method? Yes, but it will be essentially 0 if the kernel itself takes longer than 1ms.
SPWorley’s solution can certainly work in many circumstances.
Here is an alternate solution that I use. Have the same device flag value, just cudaMemcpy it back to the host after the kernel call and make the code path decision on the host. Such a small cudamemcpy only costs ~10 microseconds. I use it in cases where a condition in one kernel (i.e. a check to see if a certain computation needs to be done) results in a branch in host code leading to several other only somewhat related kernel calls. I also use it to check when kernels that generate lists of indeterminate length overflow their max value - at which point the host must respond by growing the array and then re-calling the kernel.
for example:
__global__ void do_something(..., unsigned int *d_conditions)
{
...
if (error)
d_conditions[0] = 1;
....
if (would_overflow)
atomicMax(&d_conditions[1], my_element_idx+1)
else
write element my_element_idx
}
// .... on host
do_something<<<...>>>(..., d_conditions);
cudaMemcpy(h_conditions, d_conditions, sizeof(unsigned int) * 2);
if (h_conditions[0])
{
cudaMemset(d_conditions, 0, sizeof(unsigned int)*2);
throw some_error
}
if (h_conditions[1])
{
cudaMemset(d_conditions, 0, sizeof(unsigned int)*2);
handle reallocating the arrays with the new size from h_conditions[1]
}
Note how the code nicely avoids calling cudaMemset prior to every call by only resetting it when an error flag is actually set.
Is there a performance hit with this method? Yes, but it will be essentially 0 if the kernel itself takes longer than 1ms.