Operation result depend on number of threads?

blange · May 6, 2014, 4:07pm

Dear all,

I am new to the community and try to get some experiences using CUDA for my project.

One typical problem I have to tackle is a matrix vector multiplication where each row of the matrix has to be multiplied with a vector.

My solution reads as follows

template<typename T>
__global__ void gpu_mat_vec (T *matrix,
                             const T *vector,
                             const int nRows,
                             const int nCols)
{
   // Here we multiply a matrix in column major format with a vector
   // Each row of the matrix is multiplied with the vector
   size_t y = threadIdx.y + blockIdx.y * blockDim.y;
   while (y < nCols)  {
      size_t x = threadIdx.x + blockIdx.x * blockDim.x;
      while (x < nRows)  {
         size_t offsetVec = y;
         size_t offsetMat = x + y * nRows;
         matrix[offsetMat] = matrix[offsetMat] * vector[offsetVec];
         x += blockDim.x * gridDim.x;
      }
      y += blockDim.y * gridDim.y;
   }
}

The call uses then two dimensional Grids and Blocks. In a serial test (grids = dim3(1,1); threads = dim3(1,1)) the result is what I expect. When increasing the threadsize eventually the resulting matrix is the same as the initial one. So no multiplication is performed anymore. This happens on the order of threads = dim3(50,50).

In my opinion race conditions cannot be an issue since ever matrix element address is unique.

Is there a flaw in my logic?

I would appreciate any suggestions.

robosmith · May 6, 2014, 6:11pm

Per block thread limit is 1024 so you are (on most cards) limited to threads = dim(32,32)

After that, your kernel will not launch.

Did you check the cudaGetLastError()?

blange · May 6, 2014, 8:52pm

Thank you very much for the answer.

I misunderstood the maxThreads output from the device to be 1024 in each dimension.

I will use the cudaGetLastError() function.

Again, thank you very much.

Topic		Replies	Views
Need help understanding kernel function, grid and block CUDA Programming and Performance	6	520	October 12, 2021
problem of matrix multiplication vector x matrix CUDA Programming and Performance	4	1221	August 22, 2010
Parallel Matrix Multiplication in Cuda - A Question about Threads/Blocks and Tensor Cores Jetson Xavier NX cuda	3	961	October 18, 2021
Max threads/block CUDA Programming and Performance	10	22209	March 7, 2011
Thread Block Size CUDA Programming and Performance	1	858	September 17, 2009
Help with block size and block numbers CUDA Programming and Performance	3	1776	November 26, 2009
Threads in Matrix Multiplication CUDA Programming and Performance	1	2834	December 16, 2008
matrix multiplication benchmark CUDA Programming and Performance	8	4391	May 21, 2010
Problems in deciding Gridsize & Blocksize for kernel CUDA Programming and Performance	13	8808	June 8, 2010
matrix multiplication CUDA Programming and Performance	10	3804	March 7, 2010

Operation result depend on number of threads?

Related topics