Tesla P100 Issue – Processing Stops at 8MiB, Multiple Driver Versions Tested

marcos.mk.melo · November 11, 2024, 9:07pm

Dear colleagues,
I am facing an issue with a Tesla P100 GPU, where processing stops at 8MiB of memory usage without completing. I have performed several tests using different NVIDIA driver versions and CUDA versions, but the issue persists.

I have dedicated considerable time to troubleshooting and trying various solutions. Despite my efforts, the problem remains unresolved, and I would greatly appreciate any advice or suggestions.

Configuration and Tests Performed:
GPU:Tesla P100 PCIe 16GB with 3584 CUDA cores and 16 GB of GDDR5 memory.

-display
            description: 3D controller
            product: GP100GL [Tesla P100 PCIe 16GB]
            vendor: NVIDIA Corporation
            physical id: 0
            bus info: pci@0000:21:00.0
            logical name: /dev/fb0
            version: a1
            width: 64 bits
            clock: 33MHz
            capabilities: pm msi pciexpress bus_master cap_list fb
            configuration: depth=32 driver=nvidia latency=0 mode=1024x768 visual=truecolor xres=1024 yres=768
            resources: iomemory:2c00-2bff iomemory:2c40-2c3f irq:210 memory:cd000000-cdffffff memory:2c000000000-2c3ffffffff memory:2c400000000-2c401ffffff

Operating System: Oracle Linux 8.3
Drivers and CUDA Versions Tested:

Driver 465.19.01 | CUDA 11.3 | Worked
Driver 560.35.03 | CUDA 12.6 | Did not work
Driver 535.183.06 | CUDA 12.2 | Did not work
Driver 535.183.06 | CUDA-Runtime 12.2 | Did not work
Driver 525.147.05 | CUDA 12.0 | Did not work
Driver 550.90.12 | CUDA 12.4 | Did not work

Results:
Only the driver 465.19.01 with CUDA 11.3 was able to process correctly. In all other versions, processing halts at 8MiB of memory.

Test Code Used:
The following code was used to test the GPU. It performs an element-wise multiplication on two large vectors (exceeding 8MB).

#include <stdio.h>
#include <cuda_runtime.h>

// Vector size (exceeding 8 MB)
#define VECTOR_SIZE (4 * 1024 * 1024) // 2 million elements (~8MB, assuming sizeof(float) = 4 bytes)

// Kernel function for vector multiplication
__global__ void vectorMultiply(float *a, float *b, float *c, int n) {
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] * b[idx];
    }
}

int main() {
    // Vector size in bytes
    size_t size = VECTOR_SIZE * sizeof(float);

    // Pointers for host (CPU) vectors
    float *h_a, *h_b, *h_c;

    // Allocate memory on the host
    h_a = (float *)malloc(size);
    h_b = (float *)malloc(size);
    h_c = (float *)malloc(size);

    // Initialize host vectors
    for (int i = 0; i < VECTOR_SIZE; i++) {
        h_a[i] = i * 0.5f;
        h_b[i] = i * 2.0f;
    }

    // Pointers for device (GPU) vectors
    float *d_a, *d_b, *d_c;

    // Allocate memory on the device
    cudaMalloc((void **)&d_a, size);
    cudaMalloc((void **)&d_b, size);
    cudaMalloc((void **)&d_c, size);

    // Copy vectors from host to device
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

    // Set up the number of threads and blocks
    int threadsPerBlock = 256;
    int blocksPerGrid = (VECTOR_SIZE + threadsPerBlock - 1) / threadsPerBlock;

    // Launch the kernel on the GPU
    vectorMultiply<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, VECTOR_SIZE);

    // Copy the result from device to host
    cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

    // Verify the result (only first 10 elements)
    printf("Result (first 10 elements):\n");
    for (int i = 0; i < 10; i++) {
        printf("%f * %f = %f\n", h_a[i], h_b[i], h_c[i]);
    }

    // Free memory
    free(h_a);
    free(h_b);
    free(h_c);
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    return 0;
}

Demonstration of the Issue with nvidia-smi:
The following output from the nvidia-smi command shows the GPU memory usage stops at 8MiB, with no progress made:

# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------------------------------------------------------|
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|=========================================================================================|
|   0  Tesla P100-PCIE-16GB           Off |   00000000:XX:00.0 Off |                    0 |
| N/A   28C    P0             30W /  250W |      11MiB /  16384MiB |      0%      Default |
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     17214      C   ./multVec                                    8MiB |
+-----------------------------------------------------------------------------------------+

I executed these commands in the test:

$ module purge
$ module load cuda-11.2.2-gcc-9.3.0-gaiqybr
$ nvcc -o multVec multVec.cu
$ ./multVec

I have spent significant time troubleshooting this issue and tested multiple solutions, but unfortunately, the problem persists. I would greatly appreciate any suggestions or insights on what might be causing the issue and if there are any additional configurations or adjustments to the drivers that could help resolve it.

Thank you in advance for your help!

MatColgrove · November 12, 2024, 6:30pm

Welcome marcos.mk.melo,

What’s the actual error? When I run your code, it completes but the “c” array contains zeros. Though once I add the target architecture flag, then it gives correct answers.

 % nvcc -o multVec  multVec.cu; ./multVec
Result (first 10 elements):
0.000000 * 0.000000 = 0.000000
0.500000 * 2.000000 = 0.000000
1.000000 * 4.000000 = 0.000000
1.500000 * 6.000000 = 0.000000
2.000000 * 8.000000 = 0.000000
2.500000 * 10.000000 = 0.000000
3.000000 * 12.000000 = 0.000000
3.500000 * 14.000000 = 0.000000
4.000000 * 16.000000 = 0.000000
4.500000 * 18.000000 = 0.000000
% nvcc -o multVec -arch=sm_60 multVec.cu ; ./multVec
Result (first 10 elements):
0.000000 * 0.000000 = 0.000000
0.500000 * 2.000000 = 1.000000
1.000000 * 4.000000 = 4.000000
1.500000 * 6.000000 = 9.000000
2.000000 * 8.000000 = 16.000000
2.500000 * 10.000000 = 25.000000
3.000000 * 12.000000 = 36.000000
3.500000 * 14.000000 = 49.000000
4.000000 * 16.000000 = 64.000000
4.500000 * 18.000000 = 81.000000

-Mat

marcos.mk.melo · November 12, 2024, 7:59pm

Hello MatColgrove ,thank you very much for your response. I’ll test it and get back to you.

marcos.mk.melo · November 28, 2024, 6:17pm

Hello Mat, thank you very much for waiting; I was only able to test again today.
Unfortunately, the same issue occurs, and the processing freezes.
I just tried with another older version now (535).

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-16GB           Off | 00000000:21:00.0 Off |                    0 |
| N/A   30C    P0              25W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE-16GB           Off | 00000000:E2:00.0 Off |                    0 |
| N/A   27C    P0              26W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
$ nvcc -o multVec-v4 -arch=sm_60 multVec.cu ; ./multVec-v4
$ module purge 
$ module load cuda-11.2.2-gcc-9.3.0-gaiqybr 
$ nvcc -o multVec-v4 -arch=sm_60 multVec.cu ; ./multVec-v4

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-16GB           Off | 00000000:21:00.0 Off |                    0 |
| N/A   30C    P0              25W / 250W |     10MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE-16GB           Off | 00000000:E2:00.0 Off |                    0 |
| N/A   26C    P0              25W / 250W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     17525      C   ./multVec-v4                                  8MiB |
+---------------------------------------------------------------------------------------+

I am investigating, and my suspicion is that the CUDA version is not compatible with the GPU driver.

What are the versions of the GPU driver and CUDA you used in your test?

MatColgrove · December 2, 2024, 5:05pm

The P100 system I’m has a CUDA 12.0 (525.85.12) driver installed, though while possible, I doubt it’s a driver issue.

Could it be a hardware issue?

Are you able to run other CUDA programs on this device? What happens if you use the second P100 instead (i.e. set the env var CUDA_VISIBLE_DEVICES=1)?

marcos.mk.melo · December 9, 2024, 4:18pm

Hi, Mark. Thank you for your patience.

I performed the test as requested using CUDA_VISIBLE_DEVICES=1 and CUDA_VISIBLE_DEVICES=0. However, the issue persists.

Additionally, I conducted a test on the GPU using this site: [CUDA GPU memtest download | SourceForge.net]. I compiled with CUDA 12.2, and during execution, the following error appeared in the server’s dmesg :

My next step will be to perform a test using the same CUDA and driver versions you mentioned (CUDA 12.0 (585.85.12)), as they worked successfully in your case.

If you have any additional recommendations, I’d greatly appreciate them!

MatColgrove · December 9, 2024, 4:59pm

Wish I could help more, but this is out of my area of expertise so don’t know what’s wrong. I’ve never seen a “FAULT_PDE ACCESS_TYPE_READ” error before, but doing a bit of research, it appears to be a somewhat generic error so may or may not be the root cause.

All indications are that it’s a driver issue, but I’m also wondering if there’s some interaction with OS or other utility that’s causing it? For example I saw one post where the user had Bumblebee and Primus installed which was the root cause and removing them fixed the issue. Now these are laptop utilities so I doubt you have them installed, but there could be something like this. Also maybe you have an older OS installed?

Again I’m just guessing and I don’t want to steer you down a wrong path. Also since this forum is for the NVHPC Compilers, I doubt there will be many other folks who read this forum who would know driver/hardware issues. You might consider posting on another forum or moving this post to one of the CUDA forums.

marcos.mk.melo · December 11, 2024, 6:21pm

Thank you very much for your help, Mat.
I’m really intrigued by this issue. I’ll move the topic to a place where driver/hardware errors are addressed.
I appreciate your support and hope to resolve the problem.

marcos.mk.melo · December 19, 2024, 7:41pm

Hello Mat,

Tried conducting the test according to your testing environment, but I couldn’t find the driver version you mentioned (585.85.12) with CUDA 12.0 on NVIDIA’s website.
This is what my search looks like:

Could you confirm if this is indeed the correct version?

The latest version I found is this one (550.127.08) with CUDA 12.4:

Thank you very much for your help.

MatColgrove · December 19, 2024, 8:47pm

Typo on my part. Should be 525.85.12.

I’ll fix my earlier post.

Topic		Replies	Views
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64276	April 20, 2011
Need Help with P100 installation (R730 Dell) CUDA Setup and Installation	8	1799	August 18, 2023
Ubuntu 22.04.3 LTS Server, Tesla P100, Driver Version: 470.199.02, CUDA Version: 11.4 CUDA Setup and Installation	3	3282	August 19, 2023
Nvidia Tesla P100 keeps throwing ECC errors CUDA Programming and Performance cuda , ubuntu , driver	2	479	July 2, 2024
Driver Installation for Tesla K80 - Problems CUDA Setup and Installation	17	6458	January 18, 2020
PCI passthrough KVM for CUDA usage CUDA Setup and Installation	6	6647	April 5, 2016
bandwidthTest example throws cudaErrorCallRequiresNewerDriver error when launched via nv-nsight-cu-cli Nsight Compute linux , driver	17	1305	February 9, 2024
2 Tesla C1060s with a legacy GeForce FX 5200 card Need help editing the xorg.conf file for multiple CUDA Programming and Performance	28	35534	January 29, 2009
CentOS 8/Driver 440.33 Tesla V100: nvidia-smi reports error 62 Linux	4	1920	October 12, 2021
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	485	September 11, 2024

Tesla P100 Issue – Processing Stops at 8MiB, Multiple Driver Versions Tested

Related topics