Tesla P100 Issue – Processing Stops at 8MiB, Multiple Driver Versions Tested

Dear colleagues,
I am facing an issue with a Tesla P100 GPU, where processing stops at 8MiB of memory usage without completing. I have performed several tests using different NVIDIA driver versions and CUDA versions, but the issue persists.

I have dedicated considerable time to troubleshooting and trying various solutions. Despite my efforts, the problem remains unresolved, and I would greatly appreciate any advice or suggestions.

Configuration and Tests Performed:
GPU:Tesla P100 PCIe 16GB with 3584 CUDA cores and 16 GB of GDDR5 memory.

-display
            description: 3D controller
            product: GP100GL [Tesla P100 PCIe 16GB]
            vendor: NVIDIA Corporation
            physical id: 0
            bus info: pci@0000:21:00.0
            logical name: /dev/fb0
            version: a1
            width: 64 bits
            clock: 33MHz
            capabilities: pm msi pciexpress bus_master cap_list fb
            configuration: depth=32 driver=nvidia latency=0 mode=1024x768 visual=truecolor xres=1024 yres=768
            resources: iomemory:2c00-2bff iomemory:2c40-2c3f irq:210 memory:cd000000-cdffffff memory:2c000000000-2c3ffffffff memory:2c400000000-2c401ffffff

Operating System: Oracle Linux 8.3
Drivers and CUDA Versions Tested:

  • Driver 465.19.01 | CUDA 11.3 | Worked
  • Driver 560.35.03 | CUDA 12.6 | Did not work
  • Driver 535.183.06 | CUDA 12.2 | Did not work
  • Driver 535.183.06 | CUDA-Runtime 12.2 | Did not work
  • Driver 525.147.05 | CUDA 12.0 | Did not work
  • Driver 550.90.12 | CUDA 12.4 | Did not work

Results:
Only the driver 465.19.01 with CUDA 11.3 was able to process correctly. In all other versions, processing halts at 8MiB of memory.

Test Code Used:
The following code was used to test the GPU. It performs an element-wise multiplication on two large vectors (exceeding 8MB).

#include <stdio.h>
#include <cuda_runtime.h>

// Vector size (exceeding 8 MB)
#define VECTOR_SIZE (4 * 1024 * 1024) // 2 million elements (~8MB, assuming sizeof(float) = 4 bytes)

// Kernel function for vector multiplication
__global__ void vectorMultiply(float *a, float *b, float *c, int n) {
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] * b[idx];
    }
}

int main() {
    // Vector size in bytes
    size_t size = VECTOR_SIZE * sizeof(float);

    // Pointers for host (CPU) vectors
    float *h_a, *h_b, *h_c;

    // Allocate memory on the host
    h_a = (float *)malloc(size);
    h_b = (float *)malloc(size);
    h_c = (float *)malloc(size);

    // Initialize host vectors
    for (int i = 0; i < VECTOR_SIZE; i++) {
        h_a[i] = i * 0.5f;
        h_b[i] = i * 2.0f;
    }

    // Pointers for device (GPU) vectors
    float *d_a, *d_b, *d_c;

    // Allocate memory on the device
    cudaMalloc((void **)&d_a, size);
    cudaMalloc((void **)&d_b, size);
    cudaMalloc((void **)&d_c, size);

    // Copy vectors from host to device
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

    // Set up the number of threads and blocks
    int threadsPerBlock = 256;
    int blocksPerGrid = (VECTOR_SIZE + threadsPerBlock - 1) / threadsPerBlock;

    // Launch the kernel on the GPU
    vectorMultiply<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, VECTOR_SIZE);

    // Copy the result from device to host
    cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

    // Verify the result (only first 10 elements)
    printf("Result (first 10 elements):\n");
    for (int i = 0; i < 10; i++) {
        printf("%f * %f = %f\n", h_a[i], h_b[i], h_c[i]);
    }

    // Free memory
    free(h_a);
    free(h_b);
    free(h_c);
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    return 0;
}

Demonstration of the Issue with nvidia-smi:
The following output from the nvidia-smi command shows the GPU memory usage stops at 8MiB, with no progress made:

# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------------------------------------------------------|
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|=========================================================================================|
|   0  Tesla P100-PCIE-16GB           Off |   00000000:XX:00.0 Off |                    0 |
| N/A   28C    P0             30W /  250W |      11MiB /  16384MiB |      0%      Default |
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     17214      C   ./multVec                                    8MiB |
+-----------------------------------------------------------------------------------------+

I executed these commands in the test:

$ module purge
$ module load cuda-11.2.2-gcc-9.3.0-gaiqybr
$ nvcc -o multVec multVec.cu
$ ./multVec

I have spent significant time troubleshooting this issue and tested multiple solutions, but unfortunately, the problem persists. I would greatly appreciate any suggestions or insights on what might be causing the issue and if there are any additional configurations or adjustments to the drivers that could help resolve it.

Thank you in advance for your help!

Welcome marcos.mk.melo,

What’s the actual error? When I run your code, it completes but the “c” array contains zeros. Though once I add the target architecture flag, then it gives correct answers.

 % nvcc -o multVec  multVec.cu; ./multVec
Result (first 10 elements):
0.000000 * 0.000000 = 0.000000
0.500000 * 2.000000 = 0.000000
1.000000 * 4.000000 = 0.000000
1.500000 * 6.000000 = 0.000000
2.000000 * 8.000000 = 0.000000
2.500000 * 10.000000 = 0.000000
3.000000 * 12.000000 = 0.000000
3.500000 * 14.000000 = 0.000000
4.000000 * 16.000000 = 0.000000
4.500000 * 18.000000 = 0.000000
% nvcc -o multVec -arch=sm_60 multVec.cu ; ./multVec
Result (first 10 elements):
0.000000 * 0.000000 = 0.000000
0.500000 * 2.000000 = 1.000000
1.000000 * 4.000000 = 4.000000
1.500000 * 6.000000 = 9.000000
2.000000 * 8.000000 = 16.000000
2.500000 * 10.000000 = 25.000000
3.000000 * 12.000000 = 36.000000
3.500000 * 14.000000 = 49.000000
4.000000 * 16.000000 = 64.000000
4.500000 * 18.000000 = 81.000000

-Mat

Hello MatColgrove ,thank you very much for your response. I’ll test it and get back to you.

Hello Mat, thank you very much for waiting; I was only able to test again today.
Unfortunately, the same issue occurs, and the processing freezes.
I just tried with another older version now (535).

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-16GB           Off | 00000000:21:00.0 Off |                    0 |
| N/A   30C    P0              25W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE-16GB           Off | 00000000:E2:00.0 Off |                    0 |
| N/A   27C    P0              26W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
$ nvcc -o multVec-v4 -arch=sm_60 multVec.cu ; ./multVec-v4
$ module purge 
$ module load cuda-11.2.2-gcc-9.3.0-gaiqybr 
$ nvcc -o multVec-v4 -arch=sm_60 multVec.cu ; ./multVec-v4


+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-16GB           Off | 00000000:21:00.0 Off |                    0 |
| N/A   30C    P0              25W / 250W |     10MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE-16GB           Off | 00000000:E2:00.0 Off |                    0 |
| N/A   26C    P0              25W / 250W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     17525      C   ./multVec-v4                                  8MiB |
+---------------------------------------------------------------------------------------+

I am investigating, and my suspicion is that the CUDA version is not compatible with the GPU driver.

What are the versions of the GPU driver and CUDA you used in your test?

The P100 system I’m has a CUDA 12.0 (585.85.12) driver installed, though while possible, I doubt it’s a driver issue.

Could it be a hardware issue?

Are you able to run other CUDA programs on this device? What happens if you use the second P100 instead (i.e. set the env var CUDA_VISIBLE_DEVICES=1)?

Hi, Mark. Thank you for your patience.

I performed the test as requested using CUDA_VISIBLE_DEVICES=1 and CUDA_VISIBLE_DEVICES=0. However, the issue persists.

Additionally, I conducted a test on the GPU using this site: [CUDA GPU memtest download | SourceForge.net]. I compiled with CUDA 12.2, and during execution, the following error appeared in the server’s dmesg :

My next step will be to perform a test using the same CUDA and driver versions you mentioned (CUDA 12.0 (585.85.12)), as they worked successfully in your case.

If you have any additional recommendations, I’d greatly appreciate them!