I am new in CUDA programming. I want to write a kernel to effectively process data from some files. In order to do this I read in host code the contents from file, pass to the kernel the device struct pointers. The kernel processes all file information. One file to process has aproximately 4 MB. Because the kernel code has many branches divergences I programmed thread blocks with only one thread for every block. Because all the data is allocated in global memory the kernel runs slowly than expected. It seems that all kernel parameters are allocated in the same partition camping in global memory and thread blocks are queued up. I want to allocate kernel parameters in different partitions camping, so that the block threads are evenly distributed wtihin global memory. Where can I find more details about partition camping?
I think it is nearly impossible to have significant partition camping if your block size is 1. Keep in mind that the scheduling unit on CUDA devices is a “warp” of 32 threads from the same block. The scheduler cannot combine threads from different blocks into the same warp, so you are guaranteed the multiprocessor is operating at 1/32 efficiency with your block configuration. Even if you have lots of branch divergence, running with a block size that is a multiple of 32 should always be at least as fast as a block size of 1, if not faster.
I set a timer to see exactly how much time take the kernel execution. Running kernel to process 55 files take 35 seconds, and for 250 files take 37 seconds (in both cases thread blocks having one thread per block). Every file has the same format : a header and a bit section. The header is processed on the host building the device pointers for kernel, the bit section is processed in kernel. The branch divergences in kernel have place taking into account the value of the current bit to read; for example “if (bit value == 1) decode next n bits”.
Designing the kernel parameters to have only one block with threads, every thread processing 1 file I get a kernel execution time of 4 minutes(!!!); also in this case the parameters are passed in global memory.
I run the project in netBeans IDE 6.8 on Ubuntu.
Does any one have a explanation about the kernel execution? As I said the branch divergences I suppose to slow too much.