# Operation result depend on number of threads?

Dear all,

I am new to the community and try to get some experiences using CUDA for my project.

One typical problem I have to tackle is a matrix vector multiplication where each row of the matrix has to be multiplied with a vector.

``````template<typename T>
__global__ void gpu_mat_vec (T *matrix,
const T *vector,
const int nRows,
const int nCols)
{
// Here we multiply a matrix in column major format with a vector
// Each row of the matrix is multiplied with the vector
size_t y = threadIdx.y + blockIdx.y * blockDim.y;
while (y < nCols)  {
size_t x = threadIdx.x + blockIdx.x * blockDim.x;
while (x < nRows)  {
size_t offsetVec = y;
size_t offsetMat = x + y * nRows;
matrix[offsetMat] = matrix[offsetMat] * vector[offsetVec];
x += blockDim.x * gridDim.x;
}
y += blockDim.y * gridDim.y;
}
}
``````

The call uses then two dimensional Grids and Blocks. In a serial test (grids = dim3(1,1); threads = dim3(1,1)) the result is what I expect. When increasing the threadsize eventually the resulting matrix is the same as the initial one. So no multiplication is performed anymore. This happens on the order of threads = dim3(50,50).

In my opinion race conditions cannot be an issue since ever matrix element address is unique.

Is there a flaw in my logic?

I would appreciate any suggestions.

Per block thread limit is 1024 so you are (on most cards) limited to threads = dim(32,32)

After that, your kernel will not launch.

Did you check the cudaGetLastError()?

Thank you very much for the answer.

I misunderstood the maxThreads output from the device to be 1024 in each dimension.

I will use the cudaGetLastError() function.

Again, thank you very much.