Dear all,

I am new to the community and try to get some experiences using CUDA for my project.

One typical problem I have to tackle is a matrix vector multiplication where each row of the matrix has to be multiplied with a vector.

My solution reads as follows

```
template<typename T>
__global__ void gpu_mat_vec (T *matrix,
const T *vector,
const int nRows,
const int nCols)
{
// Here we multiply a matrix in column major format with a vector
// Each row of the matrix is multiplied with the vector
size_t y = threadIdx.y + blockIdx.y * blockDim.y;
while (y < nCols) {
size_t x = threadIdx.x + blockIdx.x * blockDim.x;
while (x < nRows) {
size_t offsetVec = y;
size_t offsetMat = x + y * nRows;
matrix[offsetMat] = matrix[offsetMat] * vector[offsetVec];
x += blockDim.x * gridDim.x;
}
y += blockDim.y * gridDim.y;
}
}
```

The call uses then two dimensional Grids and Blocks. In a serial test (grids = dim3(1,1); threads = dim3(1,1)) the result is what I expect. When increasing the threadsize eventually the resulting matrix is the same as the initial one. So no multiplication is performed anymore. This happens on the order of threads = dim3(50,50).

In my opinion race conditions cannot be an issue since ever matrix element address is unique.

Is there a flaw in my logic?

I would appreciate any suggestions.