Cuda program causes windows desktop to stop working (or stop rendering)

I implemented a program for matrix multiplication using CUDA. Everything works fine when the matrix size is small, but when the matrix size reaches the order of 1000*1000, the desktop will freeze for 3~5 seconds when the program finish. This process does not show any error message.

With some searching I temporarily solved the problem by increasing the windows TDR limit from 2 seconds to 4 seconds. But I don’t think this is the real reason, since the running time of the program is much less than 2s (not reach the TDR limit)

So, what is the real cause of this problem?

my graphics card is NVIDIA GeForce RTX 3070 Laptop GPU (for both display and cuda computing)
my cuda version is 11.4
my os is windows11

My cuda code is as follows

template<typename T>
__global__ void mat_mul_kernel(const T *a, const T *b, T *c, size_t n) {
    __shared__ T as[tiled_2d][tiled_2d];
    __shared__ T bs[tiled_2d][tiled_2d];
    int bx = blockIdx.x; int by = blockIdx.y;
    int tx = threadIdx.x; int ty = threadIdx.y;
    int row = bx * tiled_2d + tx;
    int col = by * tiled_2d + ty;
    int N = (n + tiled_2d - 1) / tiled_2d;
    float value = 0;
    for (int i = 0; i < N; i++) {
        int aR = row; int aC = i * tiled_2d + ty;
        int bR = i * tiled_2d + tx; int bC = col;
        if (aR < n && aC < n && bR < n && bC < n) {
            as[tx][ty] = a[aR * n + aC];
            bs[ty][tx] = b[bR * n + bC];
        } else {
            as[tx][ty] = 0;
            bs[ty][tx] = 0;
        for (int k = 0; k < tiled_2d; k++) {
            value += as[tx][k] * bs[ty][k];
    if (row < n && col < n)
        c[row * n + col] = value;

template<typename T>
void mat_mul(const T *a, const T *b, T *c, size_t n) {
    T *d_a, *d_b, *d_c;
    size_t ds = n * n * sizeof(T); // data size
    cudaMalloc((void **)&d_a, n * n);
    cudaMemcpy(d_a, a, ds, cudaMemcpyHostToDevice);
    cudaMalloc((void **)&d_b, n * n);
    cudaMemcpy(d_b, b, ds, cudaMemcpyHostToDevice);

    cudaMalloc((void **)&d_c, n * n);

    dim3 dimGrid((unsigned int)ceil(n / tiled_2d), (unsigned int)ceil(n / tiled_2d), 1);
    dim3 dimBLock(tiled_2d, tiled_2d, 1);
    mat_mul_kernel<T><<<dimGrid, dimBLock>>>(d_a, d_b, d_c, n);
    // kernel

    cudaMemcpy(c, d_c, ds, cudaMemcpyDeviceToHost);

I suggest using proper CUDA error checking. google “proper CUDA error checking”, take the first hit, and apply it to your code.

you are not computing the sizes correctly for your cudaMalloc operations.

This is correct:

This is not:

cudaMalloc, like malloc, takes a size in bytes.

You could also study a CUDA sample code like vectorAdd and spot your error.

In addition, you haven’t defined tiled_2d anywhere in the code you have shown, Its inconvenient to work on problems like this without a complete example. I’ll assume that tiled_2d is some sort of constant definition. In that case, this construct won’t do what you want:

That isn’t the source of the problem you’re describing, however.

I also generally suggest to people on windows who are evaluating code performance (or just want to have a better chance of avoiding the WDDM TDR) to be sure to build a release project, not a debug project.

1 Like

Thank you very much for your answer. I fixed the problem of “wrong use of cudamalloc”, and then changed the TDR limit back to 2S. After that, everything work well.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.