Operation result depend on number of threads?

Dear all,

I am new to the community and try to get some experiences using CUDA for my project.

One typical problem I have to tackle is a matrix vector multiplication where each row of the matrix has to be multiplied with a vector.

My solution reads as follows

template<typename T>
__global__ void gpu_mat_vec (T *matrix,
                             const T *vector,
                             const int nRows,
                             const int nCols)
{
   // Here we multiply a matrix in column major format with a vector
   // Each row of the matrix is multiplied with the vector
   size_t y = threadIdx.y + blockIdx.y * blockDim.y;
   while (y < nCols)  {
      size_t x = threadIdx.x + blockIdx.x * blockDim.x;
      while (x < nRows)  {
         size_t offsetVec = y;
         size_t offsetMat = x + y * nRows;
         matrix[offsetMat] = matrix[offsetMat] * vector[offsetVec];
         x += blockDim.x * gridDim.x;
      }
      y += blockDim.y * gridDim.y;
   }
}

The call uses then two dimensional Grids and Blocks. In a serial test (grids = dim3(1,1); threads = dim3(1,1)) the result is what I expect. When increasing the threadsize eventually the resulting matrix is the same as the initial one. So no multiplication is performed anymore. This happens on the order of threads = dim3(50,50).

In my opinion race conditions cannot be an issue since ever matrix element address is unique.

Is there a flaw in my logic?

I would appreciate any suggestions.

Per block thread limit is 1024 so you are (on most cards) limited to threads = dim(32,32)

After that, your kernel will not launch.

Did you check the cudaGetLastError()?

Thank you very much for the answer.

I misunderstood the maxThreads output from the device to be 1024 in each dimension.

I will use the cudaGetLastError() function.

Again, thank you very much.