I am trying to use this kernel for different matrix sizes.

My kernel looks like this:

```
int num_blocks = (nac+threads-1)/threads; //nac = 1900, threads = 32
recon_reduce<<< num_blocks, threads>>>(mtxa_imd, xsha_d, ysha_d, fac_d);
const int threads =32;
const int iters =150; //change iters depending on matrix size
const int nact=1900;
__global__ void recon_reduce(float* im, float* x, float* y, float* out)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
float val = 0;
int index1 = 0;
int index2 = 0;
__shared__ float x_val[iters];
__shared__ float y_val[iters];
x_val[tid] = x[tid];
y_val[tid] = y[tid];
#pragma unroll
for(int i = 0; i < iters; i++)
{
index1 = 2*nact*i+tid;
index2 = (2*i+1)*nact+tid;
val += (im[index1] * x_val[i]) + (im[index2] * y_val[i]);
}
out[tid] = val;
}
```

Observations:

- For iters = 150, the cuda_profiler gives the GPU time as 8.64 and CPU time as 34 with occupancy as 0.167. The cudaEventRecord records the total time on the GPU as 0.19990 ms. Also I get 294 precision errors in the range (-32 to 32).

“ptxas info : Used 12 registers, 1200+0 bytes smem, 64 bytes cmem[0]”

- For iters = 438, the cuda_profiler gives the GPU time as 6.432 and CPU time as 32 with occupancy as 0.167. The cudaEventRecord records the total time on the GPU as 0.19840 ms. Also I get 1687 errors. The values are completely different.

“ptxas info : Used 12 registers, 3504+0 bytes smem, 64 bytes cmem[0]”

- For iters = 584, the cuda_profiler gives the GPU time as 8.64 and CPU time as 34 with occupancy as 0.104. The cudaEventRecord records the total time on the GPU as 0.19728 ms. Also I get 1687 errors. The values are completely different.

“ptxas info : Used 12 registers, 4672+0 bytes smem, 64 bytes cmem[0]”

- For iters = 1168, the cuda_profiler gives the GPU time as 8.672 and CPU time as 41 with occupancy as 0.104. The cudaEventRecord records the total time on the GPU as 0.28381 ms. Also I get 1687 errors. The values are completely different.

“ptxas info : Used 12 registers, 9344+0 bytes smem, 64 bytes cmem[0]”

- For iters = 1750, the cuda_profiler gives the GPU time as 2794.02 and CPU time as 2821 with occupancy as 0.062. The cudaEventRecord records the total time on the GPU as 2.98550 ms. Also I get 1735 errors. The values are completely different.

“ptxas info : Used 12 registers, 14000+0 bytes smem, 64 bytes cmem[0]”

Questions: (1) How do I rectify the precision errors?

(2) When the iterations increases, the kernel produces incorrect results. How do I remove these errors?

(3) The code is structured such that the shared memory is dependednt on the no. of iterations, which may be a incorrect choice. ANy suggestions to use less shared memory for large iterations such as 584 and above?

Thanks in advance :)