I have observed that most of the time my kernel reports different execution time. I am using CUDA events for timing informations. This variation some time goes to 5 to 10%. Any idea why this is so?
Also, when I increased the amount of shared memory by a factor of 5 to 10, the performance degraded drastically, as bad as 300 times poor performance. I did not get any error saying that amount of shared memory left is insufficient. I am wondering why the increased amount of shared memory would degrade the performance. Would anybody tell me why it is so?
Thanks in advance