Kernel call inside for loop resulting in wrong results

Hi, I want to compute matrix-vector multiplication inside for loop by changing the vector data every time and writing output into different memory locations. My code is given below

__global__ void MatVectMultiplication(float *device_Mat, float *device_Vect,int matRowSize, int vlength, float *device_ResVect)
  {
        int tidx = blockIdx.x*blockDim.x + threadIdx.x;
        int tidy = blockIdx.y*blockDim.y + threadIdx.y;
        int tindex=tidx+gridDim.x*BLOCKSIZE*tidy;
        if(tindex<matRowSize)
        {
           int i;int m=tindex*vlength;
           device_ResVect[tindex]=0.00;
           for(i=0;i<vlength;i++)
           {
               device_ResVect[tindex] += device_Mat[m+i]*device_Vect[i];
           }
        }
        __syncthreads();

  }//end of MatVect device function

Kernel Call :
dim3 blockSize(16, 16);
dim3 gridSize(1, 1);

for (int i_l = 0; i_l <= 3; i_l += iste)
   {
      MatVectMultiplication<<<gridSize,blockSize>>>(d_B, &d_A[i_l], 256, 256, &d_C[i_l*256], i_l);
       cudaErrCheck(cudaThreadSynchronize());
   }

cudaErrCheck(cudaMemcpy(C_lp, d_C, 256 * 3 * sizeof(float), cudaMemcpyDeviceToHost));

Only the first 256 values are matching with the CPU function results, the remaining all GPU outputs are coming as zeros.

Can anyone give some clues?

  1. Please format your code correctly. You can edit your post. The way to format code is not to put > at the beginning of each line. There are several methods, one is to select your code in the edit box then click the </> button at the top of the edit box.

  2. I generally recommend that people who are asking for help with a debugging issue provide a complete code. What you have posted here is not a complete code.

  3. It should be evident that your kernel code design calculates exactly one output value per thread. Therefore the number of computed output values could not exceed the number of threads you launch. Since you are launching one block of 16x16 threads, its not surprising that your results only contain 16x16=256 computed output values.

Hi @Robert_Crovella, Thank you for your suggestions, I will follow your suggestions.

Regarding the 3rd point, as per the kernel call, it will produce only 256 values only. But I am calling the kernel inside for loop with different input and output pointer locations. After completed the for loop output pointer should contain 256*4(for loop is iterating 4 times) output values. But I am getting only 256 values after the for loop completion.

for (int i_l = 0; i_l <= 3; i_l += iste)
   {
      MatVectMultiplication<<<gridSize,blockSize>>>(d_B, &d_A[i_l], 256, 256, &d_C[i_l*256], i_l);
       cudaErrCheck(cudaThreadSynchronize());
   }

I understood my mistake, now it is working fine