I’m getting a very strange behaviour in cuda. When I run two different kernels A and B of different block sizes squentially changing the block size of one of them affects the performance of the other. Both are using their own extern shared memory who’s size is specified at the kernel invocation similarly to the convolution example.
Any input whould be appreciate
I’ve experienced similar problems, which I believe may be related to running my kernels on the primary video card.
If you run under the emulator, do you get consistent results?
What OS and configuration are you using?