Hello, I have a question about CUDA memory allocation behavior that I’ve encountered during testing.
Environment
- WSL2 (32GB allocated)
- Ubuntu 20.04.6 LTS
- Cuda 12.2
- GPU: NVIDIA GTX 1070 (8GB VRAM)
Situation
I wrote a test code that attempts to allocate 28GB of memory using cudaMalloc on a GTX 1070 which has only 8GB of VRAM. Theoretically, this should fail as it exceeds the physical VRAM, but it seems to execute.
Code
#include <cstdio>
#include <cstdlib>
#include <cuda_runtime.h>
// Error checking macro
#define CHECK_CUDA(call) do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA Error: %s (err num=%d) at %s:%d\n", \
cudaGetErrorString(err), err, __FILE__, __LINE__); \
exit(EXIT_FAILURE); \
} \
} while(0)
// Simple kernel: fill array with a value
__global__ void fillKernel(int* arr, size_t N, int value) {
size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
arr[idx] = value;
}
}
int main() {
// Calculate number of ints for 28GB
const size_t totalBytes = 28ULL * 1024ULL * 1024ULL * 1024ULL; // 28GB
const size_t N = totalBytes / sizeof(int);
printf("Attempting to allocate total memory: %.2f GB\n", totalBytes / (1024.0 * 1024.0 * 1024.0));
printf("Number of array elements: %zu\n", N);
// Print VRAM info before allocation
size_t freeMem, totalMem;
CHECK_CUDA(cudaMemGetInfo(&freeMem, &totalMem));
printf("Before allocation - Total VRAM: %.2f GB, Free VRAM: %.2f GB\n",
totalMem / (1024.0 * 1024.0 * 1024.0),
freeMem / (1024.0 * 1024.0 * 1024.0));
// Memory allocation
int* d_arr = nullptr;
cudaError_t mallocErr = cudaMalloc(&d_arr, totalBytes);
if (mallocErr != cudaSuccess){
fprintf(stderr, "cudaMalloc failed! CUDA Error: %s (err num=%d) at %s:%d\n",
cudaGetErrorString(mallocErr), mallocErr, __FILE__, __LINE__);
return EXIT_FAILURE;
}
printf("cudaMalloc successful!\n");
// Print VRAM info after allocation
CHECK_CUDA(cudaMemGetInfo(&freeMem, &totalMem));
printf("After allocation - Total VRAM: %.2f GB, Free VRAM: %.2f GB\n",
totalMem / (1024.0 * 1024.0 * 1024.0),
freeMem / (1024.0 * 1024.0 * 1024.0));
// Execute fillKernel: write value (1234) to all elements
dim3 block(256);
dim3 grid((N + block.x - 1) / block.x);
fillKernel<<<grid, block>>>(d_arr, N, 1234);
CHECK_CUDA(cudaGetLastError());
CHECK_CUDA(cudaDeviceSynchronize());
printf("Kernel completed initializing all 28GB with 1234!\n");
// Prepare host array of same size (main memory)
int* h_arr = (int*)malloc(totalBytes);
if (!h_arr) {
fprintf(stderr, "Host memory allocation failed. System memory might be insufficient.\n");
cudaFree(d_arr);
return EXIT_FAILURE;
}
// Copy from GPU to host
printf("Attempting to copy all 28GB from GPU to host...\n");
CHECK_CUDA(cudaMemcpy(h_arr, d_arr, totalBytes, cudaMemcpyDeviceToHost));
CHECK_CUDA(cudaDeviceSynchronize());
printf("cudaMemcpy successful!\n");
// Verify some samples to check if values are correct
bool dataValid = true;
for (int i = 0; i < 10; i++) {
if (h_arr[i] != 1234) {
printf("Verification failed: Value at index %d is %d instead of 1234\n", i, h_arr[i]);
dataValid = false;
break;
}
}
if (dataValid) {
printf("Sample verification successful: Memory correctly written with 1234!\n");
}
// Free memory
free(h_arr);
CHECK_CUDA(cudaFree(d_arr));
return 0;
}
Questions
- I understand that cudaMalloc allocates memory directly on the physical VRAM. How is it possible that a 28GB allocation works on a GPU with only 8GB VRAM?
- Am I misunderstanding something about how cudaMalloc works?
- Is there something special happening in the WSL2 environment?
- What would nvidia-smi or other monitoring tools show for VRAM usage in this situation?
- How is the actual memory allocation/usage handled in this case?
I suspect this might be using CPU memory instead of GPU memory. If so:
- How can I verify whether the allocation is actually using CPU memory instead of GPU VRAM?
- Are there specific monitoring tools or commands that can help distinguish this?
- If you could recommend any documentation or resources about:
- CUDA memory allocation behavior in WSL2
- How to properly monitor actual VRAM usage / CPU memory usage
Would greatly appreciate detailed explanations about this behavior. Thank you!
(This is my first post on the NVIDIA Developer Forums, so please let me know if I need to provide any additional information or if there’s anything I should clarify further. Thank you!)