global void addKernel(int* c, const int* a, const int* b)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = 0; i < NUM_ITERATIONS_IN_KERNEL; i++)
{
c[index] = a[index] + b[index];
}
}
Tried other more complex, but int32 kernels also. Same result.
Also tried sm_89, sm_100, sm_101, sm_120. Best result on sm_89 strangely.