Dear colleagues,
I am facing an issue with a Tesla P100 GPU, where processing stops at 8MiB of memory usage without completing. I have performed several tests using different NVIDIA driver versions and CUDA versions, but the issue persists.
I have dedicated considerable time to troubleshooting and trying various solutions. Despite my efforts, the problem remains unresolved, and I would greatly appreciate any advice or suggestions.
Configuration and Tests Performed:
GPU:Tesla P100 PCIe 16GB with 3584 CUDA cores and 16 GB of GDDR5 memory.
-display
description: 3D controller
product: GP100GL [Tesla P100 PCIe 16GB]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:21:00.0
logical name: /dev/fb0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list fb
configuration: depth=32 driver=nvidia latency=0 mode=1024x768 visual=truecolor xres=1024 yres=768
resources: iomemory:2c00-2bff iomemory:2c40-2c3f irq:210 memory:cd000000-cdffffff memory:2c000000000-2c3ffffffff memory:2c400000000-2c401ffffff
Operating System: Oracle Linux 8.3
Drivers and CUDA Versions Tested:
- Driver 465.19.01 | CUDA 11.3 | Worked
- Driver 560.35.03 | CUDA 12.6 | Did not work
- Driver 535.183.06 | CUDA 12.2 | Did not work
- Driver 535.183.06 | CUDA-Runtime 12.2 | Did not work
- Driver 525.147.05 | CUDA 12.0 | Did not work
- Driver 550.90.12 | CUDA 12.4 | Did not work
Results:
Only the driver 465.19.01 with CUDA 11.3 was able to process correctly. In all other versions, processing halts at 8MiB of memory.
Test Code Used:
The following code was used to test the GPU. It performs an element-wise multiplication on two large vectors (exceeding 8MB).
#include <stdio.h>
#include <cuda_runtime.h>
// Vector size (exceeding 8 MB)
#define VECTOR_SIZE (4 * 1024 * 1024) // 2 million elements (~8MB, assuming sizeof(float) = 4 bytes)
// Kernel function for vector multiplication
__global__ void vectorMultiply(float *a, float *b, float *c, int n) {
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] * b[idx];
}
}
int main() {
// Vector size in bytes
size_t size = VECTOR_SIZE * sizeof(float);
// Pointers for host (CPU) vectors
float *h_a, *h_b, *h_c;
// Allocate memory on the host
h_a = (float *)malloc(size);
h_b = (float *)malloc(size);
h_c = (float *)malloc(size);
// Initialize host vectors
for (int i = 0; i < VECTOR_SIZE; i++) {
h_a[i] = i * 0.5f;
h_b[i] = i * 2.0f;
}
// Pointers for device (GPU) vectors
float *d_a, *d_b, *d_c;
// Allocate memory on the device
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Copy vectors from host to device
cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
// Set up the number of threads and blocks
int threadsPerBlock = 256;
int blocksPerGrid = (VECTOR_SIZE + threadsPerBlock - 1) / threadsPerBlock;
// Launch the kernel on the GPU
vectorMultiply<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, VECTOR_SIZE);
// Copy the result from device to host
cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
// Verify the result (only first 10 elements)
printf("Result (first 10 elements):\n");
for (int i = 0; i < 10; i++) {
printf("%f * %f = %f\n", h_a[i], h_b[i], h_c[i]);
}
// Free memory
free(h_a);
free(h_b);
free(h_c);
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
}
Demonstration of the Issue with nvidia-smi:
The following output from the nvidia-smi command shows the GPU memory usage stops at 8MiB, with no progress made:
# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12 Driver Version: 550.90.12 CUDA Version: 12.4 |
|-----------------------------------------------------------------------------------------|
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|=========================================================================================|
| 0 Tesla P100-PCIE-16GB Off | 00000000:XX:00.0 Off | 0 |
| N/A 28C P0 30W / 250W | 11MiB / 16384MiB | 0% Default |
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 17214 C ./multVec 8MiB |
+-----------------------------------------------------------------------------------------+
I executed these commands in the test:
$ module purge
$ module load cuda-11.2.2-gcc-9.3.0-gaiqybr
$ nvcc -o multVec multVec.cu
$ ./multVec
I have spent significant time troubleshooting this issue and tested multiple solutions, but unfortunately, the problem persists. I would greatly appreciate any suggestions or insights on what might be causing the issue and if there are any additional configurations or adjustments to the drivers that could help resolve it.
Thank you in advance for your help!