Green Context SM Allocation Not Affecting Kernel Runtime

Hi everyone,

I’m experimenting with CUDA Green Contexts to assign a specific number of SMs to each context and run different kernels in parallel. However, I’m observing that regardless of how many SMs I allocate to each Green Context (via cuDevSmResourceSplitByCount), the execution time for the kernel remains essentially the same in both contexts.

Below is a simplified version of my code. I’m launching a computationally heavy kernel using two different Green Contexts, each associated with its own stream. I allocate only minCount = 1 SM to one of the contexts, while the other context gets the remaining SMs. I expected the kernel in the smaller context to take noticeably longer, but both timings are almost identical.

Is there something I’m misunderstanding about how Green Contexts work, or is there an additional step required to enforce the SM limits?

Any insights would be greatly appreciated!

Thanks in advance!

🖥️ System Info:

  • GPU: NVIDIA Orin (nvgpu)
  • CUDA Version: 12.6
  • Driver Version: 540.4.0
  • OS: Ubuntu 20.04
  • Compiler: nvcc from CUDA 12.6

Executed code

#include <iostream>
#include <cuda_runtime.h>
#include <cmath>
#include <cuda.h>
#include <vector>

__global__ void heavyKernel(float *data, int n, int iterations) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;

    float x = data[idx];
    for (int i = 0; i < iterations; ++i) {
        x = x * 1.0000001f + 0.0000001f;
        x = sinf(x);
        x = sqrtf(fabsf(x));
    }
    data[idx] = x;
}

#define CUDA_RT(call)                                                   \
    do {                                                                \
        cudaError_t _err = (call);                                      \
        if ( cudaSuccess != _err ) {                                    \
            fprintf(stderr, "CUDA error in file '%s' at line %i: %s\n", \
                    __FILE__, __LINE__, cudaGetErrorString(_err));      \
            return _err;                                                \
        } else {                                                        \
            printf("CUDA Runtime call at %s:%d succeeded.\n", __FILE__, __LINE__); \
        }                                                               \
    } while (0)

#define CUDA_DRV(call)                                                  \
    do {                                                                \
        CUresult _status = (call);                                      \
        if ( CUDA_SUCCESS != _status) {                                 \
            fprintf(stderr, "CUDA error in file '%s' at line %i: %i\n", \
                    __FILE__, __LINE__, _status);                       \
            return _status;                                             \
        } else {                                                        \
            printf("CUDA Driver call at %s:%d succeeded.\n", __FILE__, __LINE__); \
        }                                                               \
    } while (0)

int main() {
    CUdevResource input;
    CUdevResource resources[2];
    CUdevResourceDesc desc[2];
    CUgreenCtx gctx[2];
    CUstream streamA, streamB;

    unsigned int nbGroups = 1; // number of groups to create
    unsigned int minCount = 1; // minimum SM count to assign to a green context

    int deviceCount = 0;
    cudaError_t err = cudaGetDeviceCount(&deviceCount); // error variable
    const int n = 1 << 20;  // 1 million elements
    const int iterations = 100000;
    const int total_runs = 10;

    const int threadsPerBlock = 256;
    const int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;

    float tiemposA[total_runs];
    float tiemposB[total_runs];

    cudaEvent_t startA, stopA, startB, stopB;
    cudaEventCreate(&startA);
    cudaEventCreate(&stopA);
    cudaEventCreate(&startB);
    cudaEventCreate(&stopB);

    float *h_data = new float[n];
    for (int i = 0; i < n; ++i) {
        h_data[i] = static_cast<float>(i) / n;
    }

    float *d_data;
    cudaMalloc(&d_data, n * sizeof(float));
    cudaMemcpy(d_data, h_data, n * sizeof(float), cudaMemcpyHostToDevice);

    float tiempos[total_runs];
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    // Preheating
    heavyKernel<<<blocksPerGrid, threadsPerBlock>>>(d_data, n, iterations);

    for (int i = 0; i < total_runs; ++i) {
        std::cout << "Launching kernel " << i << "...\n";

        cudaEventRecord(start);
        heavyKernel<<<blocksPerGrid, threadsPerBlock>>>(d_data, n, iterations);
        cudaEventRecord(stop);

        cudaEventSynchronize(stop);

        cudaError_t err = cudaGetLastError();
        if (err != cudaSuccess) {
            std::cerr << "CUDA error: " << cudaGetErrorString(err) << std::endl;
            return 1;
        }

        float ms = 0;
        cudaEventElapsedTime(&ms, start, stop);
        tiempos[i] = ms;
    }

    cudaMemcpy(h_data, d_data, n * sizeof(float), cudaMemcpyDeviceToHost);

    float sum = 0.0f;
    for (int i = 0; i < total_runs; ++i) sum += tiempos[i];
    std::cout << "Average kernel time: " << (sum / total_runs) << " ms\n";

    // Cleanup
    delete[] h_data;
    cudaFree(d_data);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    printf("Initializing device...\n");
    CUDA_RT(cudaInitDevice(0, 0, 0));

    printf("Getting SM resources from device...\n");
    CUDA_DRV(cuDeviceGetDevResource((CUdevice)0, &input, CU_DEV_RESOURCE_TYPE_SM));
    printf("Total number of SMs: %u\n", input.sm.smCount);

    printf("Dividing resources: (%u SMs) for the first green context.\n", minCount);

    CUDA_DRV(
        cuDevSmResourceSplitByCount(
            &resources[0],   // Array where the groups are written (first group in this case)
            &nbGroups,       // Number of groups to create
            &input,          // Original resource (all SMs from the device)
            &resources[1],   // Remaining resource (SMs not assigned to the group)
            0,               // flags (usually 0)
            minCount         // Minimum number of SMs in the first group
        )
    );
    printf("Resources divided.\n");

    printf("Generating descriptors\n");
    CUDA_DRV(cuDevResourceGenerateDesc(&desc[0], &resources[0], 1));

    printf("Creating green contexts...\n");
    CUDA_DRV(cuGreenCtxCreate(&gctx[0], desc[0], (CUdevice)0, CU_GREEN_CTX_DEFAULT_STREAM));
    printf("Green context A created.\n");

    CUDA_DRV(cuDevResourceGenerateDesc(&desc[1], &resources[1], 1));
    CUDA_DRV(cuGreenCtxCreate(&gctx[1], desc[1], (CUdevice)0, CU_GREEN_CTX_DEFAULT_STREAM));
    printf("Green context B created.\n");

    printf("Creating and associating the streams to the GC\n");
    CUDA_DRV(cuGreenCtxStreamCreate(&streamA, gctx[0], CU_STREAM_NON_BLOCKING, 0));
    CUDA_DRV(cuGreenCtxStreamCreate(&streamB, gctx[1], CU_STREAM_NON_BLOCKING, 0));
    printf("Successfully done\n");

    for (int i = 0; i < total_runs; i++) {
        printf("Launching kernel %d...\n", i);

        // Kernel in streamA
        cudaEventRecord(startA, (cudaStream_t)streamA);
        heavyKernel<<<blocksPerGrid, threadsPerBlock, 0, (cudaStream_t)streamA>>>(d_data, n, iterations);
        cudaEventRecord(stopA, (cudaStream_t)streamA);

        // Kernel in streamB
        cudaEventRecord(startB, (cudaStream_t)streamB);
        heavyKernel<<<blocksPerGrid, threadsPerBlock, 0, (cudaStream_t)streamB>>>(d_data, n, iterations); // heavier
        cudaEventRecord(stopB, (cudaStream_t)streamB);

        // Synchronization
        cudaEventSynchronize(stopA);
        cudaEventSynchronize(stopB);

        float msA = 0.0f, msB = 0.0f;
        cudaEventElapsedTime(&msA, startA, stopA);
        cudaEventElapsedTime(&msB, startB, stopB);

        tiemposA[i] = msA;
        tiemposB[i] = msB;

        // Error check
        err = cudaGetLastError();
        if (err != cudaSuccess) {
            std::cerr << "CUDA error: " << cudaGetErrorString(err) << std::endl;
            return 1;
        }
    }

    // Print averages
    float sumA = 0.0f, sumB = 0.0f;
    for (int i = 0; i < total_runs; ++i) {
        sumA += tiemposA[i];
        sumB += tiemposB[i];
    }

    std::cout << "Average time for kernel A: " << (sumA / total_runs) << " ms\n";
    std::cout << "Average time for kernel B: " << (sumB / total_runs) << " ms\n";

    // Destroy events
    cudaEventDestroy(startA);
    cudaEventDestroy(stopA);
    cudaEventDestroy(startB);
    cudaEventDestroy(stopB);

    return 0;
}

Execution results

Average kernel time: 4482.25 ms
Initializing device…
CUDA Runtime call at expetimento_ingles.cu:124 succeeded.
Getting SM resources from device…
CUDA Driver call at expetimento_ingles.cu:128 succeeded.
Total number of SMs: 8
Splitting resources: (1 SMs) for the first green context.
CUDA Driver call at expetimento_ingles.cu:133 succeeded.
Resources split.
Generating descriptors
CUDA Driver call at expetimento_ingles.cu:147 succeeded.
Creating green contexts…
CUDA Driver call at expetimento_ingles.cu:151 succeeded.
Green context A created.
CUDA Driver call at expetimento_ingles.cu:154 succeeded.
CUDA Driver call at expetimento_ingles.cu:155 succeeded.
Green context B created.
Creating and binding streams to GCs
CUDA Driver call at expetimento_ingles.cu:160 succeeded.
CUDA Driver call at expetimento_ingles.cu:161 succeeded.
Completed successfully
Average kernel time (Context A): 8962.44 ms
Average kernel time (Context B): 8962.4 ms

your posted code seems to be cut off midway through.

Sorry, you’re absolutely right, it’s already being solved.

  • I would recommend CUDA 12.8, not CUDA 12.6.
  • when I run your posted code as-is under compute-sanitizer I get lots of messages about kernel making invalid out-of-bounds accesses
  • If I run your code without compute sanitizer on CUDA 12.8, I get the following print-out at the end:
$ ./t22
Launching kernel 0...
Launching kernel 1...
Launching kernel 2...
Launching kernel 3...
Launching kernel 4...
Launching kernel 5...
Launching kernel 6...
Launching kernel 7...
Launching kernel 8...
Launching kernel 9...
Average kernel time: 247.705 ms
Initializing device...
CUDA Runtime call at t22.cu:122 succeeded.
Getting SM resources from device...
CUDA Driver call at t22.cu:125 succeeded.
Total number of SMs: 84
Dividing resources: (1 SMs) for the first green context.
CUDA Driver call at t22.cu:130 succeeded.
Resources divided.
Generating descriptors
CUDA Driver call at t22.cu:143 succeeded.
Creating green contexts...
CUDA Driver call at t22.cu:146 succeeded.
Green context A created.
CUDA Driver call at t22.cu:149 succeeded.
CUDA Driver call at t22.cu:150 succeeded.
Green context B created.
Creating and associating the streams to the GC
CUDA Driver call at t22.cu:154 succeeded.
CUDA Driver call at t22.cu:155 succeeded.
Successfully done
Launching kernel 0...
Launching kernel 1...
Launching kernel 2...
Launching kernel 3...
Launching kernel 4...
Launching kernel 5...
Launching kernel 6...
Launching kernel 7...
Launching kernel 8...
Launching kernel 9...
Average time for kernel A: 5111.43 ms
Average time for kernel B: 262.373 ms

Example snippet of compute-sanitizer output:

$ compute-sanitizer ./t22
========= COMPUTE-SANITIZER
Launching kernel 0...
Launching kernel 1...
Launching kernel 2...
Launching kernel 3...
Launching kernel 4...
Launching kernel 5...
Launching kernel 6...
Launching kernel 7...
Launching kernel 8...
Launching kernel 9...
Average kernel time: 247.598 ms
Initializing device...
CUDA Runtime call at t22.cu:122 succeeded.
Getting SM resources from device...
CUDA Driver call at t22.cu:125 succeeded.
Total number of SMs: 84
Dividing resources: (1 SMs) for the first green context.
CUDA Driver call at t22.cu:130 succeeded.
Resources divided.
Generating descriptors
CUDA Driver call at t22.cu:143 succeeded.
Creating green contexts...
CUDA Driver call at t22.cu:146 succeeded.
Green context A created.
CUDA Driver call at t22.cu:149 succeeded.
CUDA Driver call at t22.cu:150 succeeded.
Green context B created.
Creating and associating the streams to the GC
CUDA Driver call at t22.cu:154 succeeded.
CUDA Driver call at t22.cu:155 succeeded.
Successfully done
Launching kernel 0...
========= Invalid __global__ read of size 4 bytes
=========     at heavyKernel(float *, int, int)+0xa0
=========     by thread (32,0,0) in block (0,0,0)
=========     Address 0x71882ba00080 is out of bounds
=========     and is 262,016 bytes before the nearest allocation at 0x71882ba40000 of size 65,536 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: main [0x97f2] in t22
=========
========= Invalid __global__ read of size 4 bytes
=========     at heavyKernel(float *, int, int)+0xa0
=========     by thread (33,0,0) in block (0,0,0)
=========     Address 0x71882ba00084 is out of bounds
=========     and is 262,012 bytes before the nearest allocation at 0x71882ba40000 of size 65,536 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: main [0x97f2] in t22
=========
========= Invalid __global__ read of size 4 bytes
=========     at heavyKernel(float *, int, int)+0xa0
=========     by thread (34,0,0) in block (0,0,0)
=========     Address 0x71882ba00088 is out of bounds
=========     and is 262,008 bytes before the nearest allocation at 0x71882ba40000 of size 65,536 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: main [0x97f2] in t22
=========
========= Invalid __global__ read of size 4 bytes
=========     at heavyKernel(float *, int, int)+0xa0
=========     by thread (35,0,0) in block (0,0,0)
=========     Address 0x71882ba0008c is out of bounds
=========     and is 262,004 bytes before the nearest allocation at 0x71882ba40000 of size 65,536 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: main [0x97f2] in t22
=========
========= Invalid __global__ read of size 4 bytes
=========     at heavyKernel(float *, int, int)+0xa0
=========     by thread (36,0,0) in block (0,0,0)
=========     Address 0x71882ba00090 is out of bounds
=========     and is 262,000 bytes before the nearest allocation at 0x71882ba40000 of size 65,536 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: main [0x97f2] in t22
=========
========= Invalid __global__ read of size 4 bytes
=========     at heavyKernel(float *, int, int)+0xa0
=========     by thread (37,0,0) in block (0,0,0)
=========     Address 0x71882ba00094 is out of bounds
=========     and is 261,996 bytes before the nearest allocation at 0x71882ba40000 of size 65,536 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: main [0x97f2] in t22
=========
========= Invalid __global__ read of size 4 bytes
=========     at heavyKernel(float *, int, int)+0xa0
=========     by thread (38,0,0) in block (0,0,0)
=========     Address 0x71882ba00098 is out of bounds
=========     and is 261,992 bytes before the nearest allocation at 0x71882ba40000 of size 65,536 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: main [0x97f2] in t22
=========
========= Invalid __global__ read of size 4 bytes
=========     at heavyKernel(float *, int, int)+0xa0
=========     by thread (39,0,0) in block (0,0,0)
=========     Address 0x71882ba0009c is out of bounds
=========     and is 261,988 bytes before the nearest allocation at 0x71882ba40000 of size 65,536 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: main [0x97f2] in t22
=========
========= Invalid __global__ read of size 4 bytes
=========     at heavyKernel(float *, int, int)+0xa0
=========     by thread (40,0,0) in block (0,0,0)
=========     Address 0x71882ba000a0 is out of bounds
=========     and is 261,984 bytes before the nearest allocation at 0x71882ba40000 of size 65,536 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: main [0x97f2] in t22
=========
========= Invalid __global__ read of size 4 bytes
=========     at heavyKernel(float *, int, int)+0xa0
=========     by thread (41,0,0) in block (0,0,0)
=========     Address 0x71882ba000a4 is out of bounds
=========     and is 261,980 bytes before the nearest allocation at 0x71882ba40000 of size 65,536 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: main [0x97f2] in t22
=========
========= Invalid __global__ read of size 4 bytes
=========     at heavyKernel(float *, int, int)+0xa0
=========     by thread (42,0,0) in block (0,0,0)
=========     Address 0x71882ba000a8 is out of bounds
=========     and is 261,976 bytes before the nearest allocation at 0x71882ba40000 of size 65,536 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: main [0x97f2] in t22
=========
========= Invalid __global__ read of size 4 bytes
=========     at heavyKernel(float *, int, int)+0xa0
=========     by thread (43,0,0) in block (0,0,0)
=========     Address 0x71882ba000ac is out of bounds
=========     and is 261,972 bytes before the nearest allocation at 0x71882ba40000 of size 65,536 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: main [0x97f2] in t22
=========
...

CUDA 12.8, A40 GPU

Hi Robert Crovella,

First of all, thank you very much. I’m trying to test your solution, but I’m running into a problem. Maybe you could help me with this. I’m running tests on a Jetson Orin Nano. Although I’ve been able to install the toolkit containing CUDA 12.8, I encounter an error when running it, as the most recent drivers are designed for CUDA Driver 12.6. Do you know if it’s possible to work with CUDA 12.8 on JetPack 6.2, which includes CUDA 12.6, or is it impossible?

If it were my code, the first thing I would do is fix the illegal access. My general advice is not to draw any conclusions from broken code. I would not suggest drawing any conclusions from my post other than your code is broken and should be fixed.

Once you fix it, run the tests again. The reason I suggested to update from 12.6 to 12.8 was the comments in this thread, specifically:

after I upgrade my CUDA from 12.6 to 12.8, this issue has been fixed.

I have not tested 12.6 myself. I don’t know for certain that 12.8 would make any difference in your observations.

Regarding your question:

I suggest asking that on the Orin forum of your choice. A quick search turns up this (but don’t post in that forum! post in an Orin forum) so I think it may be possible to update from 12.6 to 12.8 (indeed even 12.9 is available now, just). The basic idea appears to be that you don’t disrupt the 12.6 driver installed via the Jetpack, but you install the CUDA 12.8 toolkit, and install the compatibility package. On discrete GPUs where the compatibility package is supported, that is a valid method. So check with the Orin team to confirm.

And since I don’t know if there is any dependency on CUDA 12.6 vs. CUDA 12.8 for this, and since the compatibility path doesn’t really update the GPU driver, and if there is a problem I don’t know where it is, it may be that using the compatibility method doesn’t resolve the issue.

1 Like

Hi, first of all, thank you again for your response. I was able to resolve the error in my code — as you pointed out, it was caused by freeing the memory for the data variable and then trying to use it again in the kernel launches.

cudaFree(d_data);

Now, having fixed that, I’m still facing the original issue: I cannot observe the effects of using separate CUDA streams (green context). Based on my results, the execution time of each kernel remains the same, which is surprising to me — especially since, even in the example you provided, the overlapping behavior is clearly visible.

I’m not sure if this is due to the limited number of SMs on the NVIDIA Orin device, or if something else is not working as expected.

I’ve attached the output of the execution using compute-sanitizer.

========= COMPUTE-SANITIZER
Launching kernel 0...
Launching kernel 1...
Launching kernel 2...
Launching kernel 3...
Launching kernel 4...
Launching kernel 5...
Launching kernel 6...
Launching kernel 7...
Launching kernel 8...
Launching kernel 9...
Average kernel time: 4485.83 ms
Initializing device...
CUDA Runtime call at expetimento.cu:121 succeeded.
Getting SM resources from device...
CUDA Driver call at expetimento.cu:124 succeeded.
Total number of SMs: 8
Dividing resources: (1 SMs) for the first green context.
CUDA Driver call at expetimento.cu:129 succeeded.
Resources divided.
Generating descriptors
CUDA Driver call at expetimento.cu:142 succeeded.
Creating green contexts...
CUDA Driver call at expetimento.cu:145 succeeded.
Green context A created.
CUDA Driver call at expetimento.cu:148 succeeded.
CUDA Driver call at expetimento.cu:149 succeeded.
Green context B created.
Creating and associating the streams to the GC
CUDA Driver call at expetimento.cu:153 succeeded.
CUDA Driver call at expetimento.cu:154 succeeded.
Successfully done
Launching kernel 0...
Launching kernel 1...
Launching kernel 2...
Launching kernel 3...
Launching kernel 4...
Launching kernel 5...
Launching kernel 6...
Launching kernel 7...
Launching kernel 8...
Launching kernel 9...
Average time for kernel A: 8965.25 ms
Average time for kernel B: 8965.11 ms
========= ERROR SUMMARY: 0 errors

I agree that when you remove that line, the code runs without errors on my CUDA 12.8/A40 setup under compute-sanitizer. Here is the non-compute-sanitizer output:

$ ./t22
Launching kernel 0...
Launching kernel 1...
Launching kernel 2...
Launching kernel 3...
Launching kernel 4...
Launching kernel 5...
Launching kernel 6...
Launching kernel 7...
Launching kernel 8...
Launching kernel 9...
Average kernel time: 247.409 ms
Initializing device...
CUDA Runtime call at t22.cu:122 succeeded.
Getting SM resources from device...
CUDA Driver call at t22.cu:125 succeeded.
Total number of SMs: 84
Dividing resources: (1 SMs) for the first green context.
CUDA Driver call at t22.cu:130 succeeded.
Resources divided.
Generating descriptors
CUDA Driver call at t22.cu:143 succeeded.
Creating green contexts...
CUDA Driver call at t22.cu:146 succeeded.
Green context A created.
CUDA Driver call at t22.cu:149 succeeded.
CUDA Driver call at t22.cu:150 succeeded.
Green context B created.
Creating and associating the streams to the GC
CUDA Driver call at t22.cu:154 succeeded.
CUDA Driver call at t22.cu:155 succeeded.
Successfully done
Launching kernel 0...
Launching kernel 1...
Launching kernel 2...
Launching kernel 3...
Launching kernel 4...
Launching kernel 5...
Launching kernel 6...
Launching kernel 7...
Launching kernel 8...
Launching kernel 9...
Average time for kernel A: 5109.53 ms
Average time for kernel B: 262.279 ms

It certainly seems like there is a large disparity between the two reported cases. I don’t know if you are still running on CUDA 12.6 or have switched to CUDA 12.8, but for additional inquiry you may want to ask about it on an Orin forum.

1 Like

Thank you very much for your time, greetings

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.