If the above concept is applied on C/C++, the output value on each position should be a multiply of 10, but when I tried to use the above code on CUDA, the value will only stack on 10. I assume this is because CUDA processes them in parallel. Is there any way that CUDA can do a sequential processing? That is to wait until the first output value computation is done then go to the next output computation.
Hi, Luis, thanks for the pointer, but the value I see are all still 10, the output didn’t get an increase one after another. As an additional information, I use a 16x16 BlockDim, and 20x20 GridDim. Perhaps you could give me another hints?
Hi, Luis, thanks again for the help. By the statement of only 1 block should doing all the processes, do you mean that I can only have a dimBlock(WIDTH,HEIGHT) and a dimGrid(1,1) ?
If I’m having a large size of WIDTH and HEIGHT, wouldn’t CUDA get pretty slow? Which means if I’m using CUDA for sequential processing, the processing speed won’t be delightful or even messed up.
I’ve tried your code, and I think I got the grasp of it.
When I tried to run the code on with buffer size of (50x50), I can still see some thing, and the result is in sequential order. But when I tried so use a larger buffer, say 1000x2000, it took so long for the CUDA to finish the process, and it even crash after some time (the program exit by itself).
I think this might be caused by the number of threads allowed on each block is very limited, say 256 or 512 according to different GPU type. Can you give me some insight about this? Is there a way for CUDA to do fast sequential processing?
about your 1st suggestion for giving up CUDA, it was my last resort. Since my system is pretty big, I have about 10 processes, where eight of them can be done in parallel, but 2 of them must be done in sequential, such that process 1-4 are parallel, 5 is sequential, 6-8 is parallel, 9 is sequential, and 10 is parallel.
I’m still thinking the possibility of changing the algorithm to parallel, but it seems to be no luck. The simple logic behind my algorithm is out[thread] = out[thread-1], that is the computation for the current output value requires the previous output to be calculated first.
I can just divide the processes into two, such as CUDA(for parallel) and CPU(for sequential), but it would require at least twice memory transfer from device to host which is somewhat time expensive.
So I’m wondering if there is a way to do all of them in CUDA with hope that the processing time can be pushed down to its limit.
Thanks for the link about histogram, I’ll take a good look on it.
If there is any additional information that you could share with me, I’d really appreciate it.
Thanks.
Start your sequential kernel with <<<1, 1>>> block/thread configuration. Obviously no need to use threadIdx/blockIdx, then, and you should resort to registers/shared memory as temporary storage.
Or if the configuration is <<<1, x>>> for whatever reason, the following can be used:
__syncthreads();
if (threadIdx.x == 0)
{
<sequential bit>
}
__syncthreads();