Question about cudaMalloc Behavior When Exceeding Physical VRAM on GTX 1070

Hello, I have a question about CUDA memory allocation behavior that I’ve encountered during testing.

Environment

  • WSL2 (32GB allocated)
  • Ubuntu 20.04.6 LTS
  • Cuda 12.2
  • GPU: NVIDIA GTX 1070 (8GB VRAM)

Situation

I wrote a test code that attempts to allocate 28GB of memory using cudaMalloc on a GTX 1070 which has only 8GB of VRAM. Theoretically, this should fail as it exceeds the physical VRAM, but it seems to execute.

Code

#include <cstdio>
#include <cstdlib>
#include <cuda_runtime.h>

// Error checking macro
#define CHECK_CUDA(call) do {                                \
    cudaError_t err = call;                                  \
    if (err != cudaSuccess) {                                \
        fprintf(stderr, "CUDA Error: %s (err num=%d) at %s:%d\n", \
                cudaGetErrorString(err), err, __FILE__, __LINE__); \
        exit(EXIT_FAILURE);                                  \
    }                                                        \
} while(0)

// Simple kernel: fill array with a value
__global__ void fillKernel(int* arr, size_t N, int value) {
    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        arr[idx] = value;
    }
}

int main() {
    // Calculate number of ints for 28GB
    const size_t totalBytes = 28ULL * 1024ULL * 1024ULL * 1024ULL; // 28GB
    const size_t N = totalBytes / sizeof(int);

    printf("Attempting to allocate total memory: %.2f GB\n", totalBytes / (1024.0 * 1024.0 * 1024.0));
    printf("Number of array elements: %zu\n", N);

    // Print VRAM info before allocation
    size_t freeMem, totalMem;
    CHECK_CUDA(cudaMemGetInfo(&freeMem, &totalMem));
    printf("Before allocation - Total VRAM: %.2f GB, Free VRAM: %.2f GB\n",
           totalMem / (1024.0 * 1024.0 * 1024.0),
           freeMem / (1024.0 * 1024.0 * 1024.0));

    // Memory allocation
    int* d_arr = nullptr;
    cudaError_t mallocErr = cudaMalloc(&d_arr, totalBytes);
    if (mallocErr != cudaSuccess){
      fprintf(stderr, "cudaMalloc failed! CUDA Error: %s (err num=%d) at %s:%d\n",
            cudaGetErrorString(mallocErr), mallocErr, __FILE__, __LINE__);
        return EXIT_FAILURE;
    }
    printf("cudaMalloc successful!\n");

    // Print VRAM info after allocation
    CHECK_CUDA(cudaMemGetInfo(&freeMem, &totalMem));
    printf("After allocation - Total VRAM: %.2f GB, Free VRAM: %.2f GB\n",
           totalMem / (1024.0 * 1024.0 * 1024.0),
           freeMem / (1024.0 * 1024.0 * 1024.0));

    // Execute fillKernel: write value (1234) to all elements
    dim3 block(256);
    dim3 grid((N + block.x - 1) / block.x);

    fillKernel<<<grid, block>>>(d_arr, N, 1234);
    CHECK_CUDA(cudaGetLastError());
    CHECK_CUDA(cudaDeviceSynchronize());
    printf("Kernel completed initializing all 28GB with 1234!\n");

    // Prepare host array of same size (main memory)
    int* h_arr = (int*)malloc(totalBytes);
    if (!h_arr) {
        fprintf(stderr, "Host memory allocation failed. System memory might be insufficient.\n");
        cudaFree(d_arr);
        return EXIT_FAILURE;
    }

    // Copy from GPU to host
    printf("Attempting to copy all 28GB from GPU to host...\n");
    CHECK_CUDA(cudaMemcpy(h_arr, d_arr, totalBytes, cudaMemcpyDeviceToHost));
    CHECK_CUDA(cudaDeviceSynchronize());
    printf("cudaMemcpy successful!\n");

    // Verify some samples to check if values are correct
    bool dataValid = true;
    for (int i = 0; i < 10; i++) {
        if (h_arr[i] != 1234) {
            printf("Verification failed: Value at index %d is %d instead of 1234\n", i, h_arr[i]);
            dataValid = false;
            break;
        }
    }
    if (dataValid) {
        printf("Sample verification successful: Memory correctly written with 1234!\n");
    }

    // Free memory
    free(h_arr);
    CHECK_CUDA(cudaFree(d_arr));

    return 0;
}

Questions

  1. I understand that cudaMalloc allocates memory directly on the physical VRAM. How is it possible that a 28GB allocation works on a GPU with only 8GB VRAM?
  • Am I misunderstanding something about how cudaMalloc works?
  • Is there something special happening in the WSL2 environment?
  1. What would nvidia-smi or other monitoring tools show for VRAM usage in this situation?
  • How is the actual memory allocation/usage handled in this case?

I suspect this might be using CPU memory instead of GPU memory. If so:

  1. How can I verify whether the allocation is actually using CPU memory instead of GPU VRAM?
  • Are there specific monitoring tools or commands that can help distinguish this?
  1. If you could recommend any documentation or resources about:
  • CUDA memory allocation behavior in WSL2
  • How to properly monitor actual VRAM usage / CPU memory usage

Would greatly appreciate detailed explanations about this behavior. Thank you!

(This is my first post on the NVIDIA Developer Forums, so please let me know if I need to provide any additional information or if there’s anything I should clarify further. Thank you!)

Even though it is WSL2, the memory management behavior is under the control of Windows WDDM. It is possible for WDDM to oversubscribe GPU memory. I’ve not personally witnessed this level of oversubscription, but that may be the explanation.

Since this behavior is entirely unspecified by NVIDIA and CUDA, and not under the control of CUDA, my recommendation would be to not rely on WDDM oversubscription. There are no guarantees of its behavior, that I know of.