I am running a CUDA code inside a mex file in Matlab, to accelerate the computation of a radiation pattern. I am using Tesla C2075. If I use 512 threads, everything works fine. However, if I use 1024 threads, I get garbage output. I always thought that the computed results on a GPU are independent of the execution configuration. Not only that, I have to call the mex file multiple times, and each time I get the same output from the GPU, although the input to the mex file is different for each call. It is as if the memory got stuck in the GPU when I use 1024 threads. Actually, this phenomenon starts at 604 threads/block.
I don’t use CUDA with Matlab, but this sounds like there are not enough registers to run a block with more than 603 threads and therefore the kernel is failing to launch. Are you checking for CUDA errors in your host code?
Thanks. I think you are right. I just added cutilCheckMsg, and the mex file/kernels failed to launch. This explains why that without error checking, the mex function ran extremely fast, since none of the kernels were launching. My conclusion is that more threads/block is not necessarily good. Since there is probably nothing I can do about registers, I will just have to wait until we buy the K20 Tesla card.
Why you are strictly looking at running your kernel with 512+ number of threads. What if you are running 256 threads per block(or less/equivalent)? In general, I don’t see any issue. Do you have some specific reason to do so?
– Mandar Gurav
I was under the mistaken impression that using more threads/block leads to higher occupancy. As I am learning from websites and tutorials, this is not the case. However, it is interesting that we tried the same thing on GTX 680 (i.e., 1024 threads/block), but the threads launched successfully, with no improvement in run-time performance. I am learning from this experience that one should always regression-test the GPU version against the 100%-sure CPU version.
You are talking about maximum theoretical occupancy which you can achieve only in the case of simple kernels such as A[i]=B[i]+C[i]. In practice one should check all possibilities of threads/block. One of my codes was faster when running with 128 threads /block.
In one of my cases it is 64 threads per block!!!
Can I know the number of registers being used for your kernel? These all things depend on the number of registers. As you told us that your code runs correctly on 680 which is a Kepler card and has 255 registers per thread limit and in all 64K registers (Correct me if I am wrong!). So, 1024 threads might be fitting in and hence you could run your kernel.
As I have mentioned above, we are using very few threads (and hence less occupancy) but we are claiming that we got the best performance out of the given card is mainly because of “Register Spill” - my main enemy in Fermi-CUDA game!!! The number of registers being used for each thread is much more than the limit (64-1=63 registers per thread). The excess registers were spilling to caches. To reduce this spill we are using less number of threads.
So important point here is your number of registers per thread for your kernel.
The number of registers on Tesla C2075 is 32768 registers/block. However, I don’t know how to count the number of registers needed by the code. Also, I noticed that the execution configuration is important not only in run-time performance, but also in the correctness of the results. I learned that I needed to serialize the kernel call, since, after performing regression-testing with 100%-sure CPU results, the error (without serialization) was higher than just error produced by machine precision. I think some memory was somehow being overwritten inside the GPU, and I am very confident that I correctly allocated and released memory. It seems that one has to be very careful and gentle when implementing algorithms on a GPU.
Floating point computations are slightly different on different systems even on different compilers, cause of floating point model. you can use other cpu compiler and get other results. difference should not be huge though. Also difference maybe because of numerically unstable algorithm. Form what you say I can suggest to read developers guide, you have missed important information about CUDA.
Floating point computations aren’t black magic, cards with compute capability 2.0+ are IEEE 754 compliant for both single and double precision. With that said, Your results may differ slightly from the CPU (since x87 processors generally use 80-bit double-extended precision internally by default).
How much error were you getting?
Without serialization, error is about 1e-6. With serialization, the error is on the order of 1e-15. The 1e-15 error can be attributed to the differences in how GPUs and CPUs compute, but I think the 1e-6 error is caused by possible memory overwriting issues inside the GPU. I am ok with serialization of the kernel execution, as long as the error remains within 1e-15 after regression testing with the 100%sure CPU code.
You can find out how many register a kernel uses by using the flags -Xptxas -v when compiling. Also the cuda profiler gives this information. You multiply the number of registers per kernel with the total number of threads per block and you can find how many registers you need.
Thank you. The -Xptxas -v worked. The code uses 52 registers. Since Tesla C2075 has only 32768 registers/block, this explains why the GPU code works only up to 608 threads/block. Adding another warp of 32 threads exceeds the 32768 limit. I think the first half of the mystery is solved. I will investigate further the possible memory overwriting.
Since the GPU code is part of a large Matlab-based simulation, I could not use cuda profiler.
Thanks for the info.