Hi, I am running into a weird situation. I have a CUDA kernel that includes many arrays in shared memory and many automatic variables. Say A is an array of type float in shared memory, and x is an automatic variable of type float. A and x are both used many times during the execution of the kernel function. It takes 3 seconds to complete my application. If I add one assignment statement A[???] = x at the end of the kernel, the runtime increases to 8 seconds. However, if I add A[???] = 1.234, or float y = x * 2.345, instead of adding A[???] = x, the runtime doesn’t increase from 3 seconds.
Could anyone please help me? Any comments are welcome. Thanks a lot in advance.