Accessing beyond bank 0 in __constant__ memory from PTX?

My application uses three kernels, running serially in stream 1, run times of each, 5uS, 1.5mS and 2.4mS.

Kernel 1 writes out 1024 Bytes to global memory and upon ending, an event is triggered in stream 2, cudaMemcpyToSymbolAsync’ing this array to statically assigned constant memory. This constant array is then accessed by kernel 3. This works well.

I now wish to enlarge the array copied to 64kB and doing so triggers the expected error of exceeding the permitted 64kB constant memory limit, due to a small amount of constant memory used elsewhere.

A comment in the margin of the “Technical Specification” table of the Cuda Wiki page, in the “Constant memory size”, field states:

“Constant memory size accessible by CUDA C/C++(1 bank, PTX can access 11 banks, SASS can access 18 banks)”.

Looking in the PTX ISA here and here, makes me wonder if I can write a small device PTX function to utilise a whole 64kB bank, for exclusive use of the copy outlined above?

A comment in the PTX ISA:

"Constant buffers allocated by the driver are initialized by the host, and pointers to such buffers are passed to the kernel as parameters. "

Two things I’m unsure of, assuming this can be done:

  1. How to initialise this on the host - cudaMalloc?
  2. Would I actually have to re-write kernel 1 and 3 completely in PTX, in order to utilise the Kernel Parameter Attribute: .ptr?

This seems like questionable information to me. By observation, no more than three banks of constant memory were ever used by compiler-generated SASS for compute capability 3.0 or higher. The constant bank numbering and task assignments per bank changed multiple times.

Only one bank is definitely 64 KB in size, that is the one backing __constant__ data. Two other banks were used to pass kernel arguments and to store immediate data (as far as it cannot be stored as part of instructions). Those banks may be (significantly) smaller than 64 KB each.

Based on a cursory look at generated SASS, I am under the impression that the most recent GPU architectures may have reduced to two constant memory banks, each of which is 64 KB in size. Please note that I have not investigated this in detail, and this first impression may in fact be incorrect. If you need to know these undocumented implementation details, I would suggest performing your own due diligence.

What is always safe to assume is that you have one constant memory bank of 64KB available for __constant__ (however in as far as the CUDA standard math library makes use of __constant__ data, that will need to be subtracted from the 64 KB). You do not need to drop to PTX level to use this constant memory bank.

You have another constant memory bank available for kernel arguments, at least 4KB in size. For compute capability >= 7.0 and CUDA versions >= 12.1, a total of 32 KB of kernel arguments are supported. Again, no need to drop to PTX to utilize it, simply create a large struct that is passed as a kernel argument. See also:

This is is an interesting development, because (1) it allows programmers to utilize a large block of constant data in addition to __constant__ (2) it provides a way of updating constant memory that is faster than using cudaMemcpyToSymbol{Async}, by simply passing updated data along with the kernel launch, without the need for a separate API call.

With all this in place, the largest single contiguous block of constant memory you have available is still 64 KB of __constant__ memory. If you need the full 64 KB as a large single buffer, try moving the miscellaneous data you currently store in __constant__ to the kernel argument space instead.

1 Like

Thanks for the considered reply Norbert. I hadn’t caught up on the link you provided, I’ll certainly reconsider things around that.

The only reason I thought my outlined plan was feasable, was this quote from the first PTX ISA link above:

"Constant memory is restricted in size, currently limited to 64 KB which can be used to hold statically-sized constant variables. There is an additional 640 KB of constant memory, organized as ten independent 64 KB regions. "

Perhaps the blog note you reference, is providing some access to this - I haven’t read it yet.

Frankly, this is news to me. Surely some software would have exploited this by now, yet I have never come across such a case. I would suggest running some experiments to see whether those 640 KB are in fact present and useable at PTX level.

Unfortunately, this avenue is closed to me, as I’m still languishing at CC 6.1. I’ll attempt some PTX exploration.

CUDA programmers can use the constant bank used for kernel arguments with any currently supported GPU and any CUDA version by passing an appropriate struct together with their other “normal” kernel arguments. However, the utility of doing that will be limited as kernel arguments are limited to 4KB on older platforms and CUDA versions.

What has changed recently is that for compute capability >= 7.0 and CUDA versions >= 12.1 the kernel arguments can be 32 KB in size, which makes this approach more attractive.

Here is a worked example, using CC 3.0 and CUDA 9.2:

#include <stdio.h>
#include <stdlib.h>

#define SCALE    (32)
#define FOO_SIZE ((int)(65536/sizeof(float)))
#define BAR_SIZE ((int)(65536/sizeof(float)/SCALE))

float foo_shadow [FOO_SIZE];
__constant__ float foo [FOO_SIZE];
struct soa {
    float arr [BAR_SIZE];
} bar_shadow;

__global__ void kernel (int i, struct soa bar)
{
    float s = foo[i];
    float t = bar.arr[i / SCALE];
    printf ("i=%d s=%15.8e t=%15.8e prod=%15.8e\n", i, s, t, s*t);
}

int main (void)
{
    int pos = FOO_SIZE-1;

    for (int i = 0; i < FOO_SIZE; i++) {
        foo_shadow [i] = sqrtf ((float)i);
    }
    for (int i = 0; i < BAR_SIZE; i++) {
        bar_shadow.arr [i] = (float)(9999 - i);
    }
    cudaMemcpyToSymbol (foo, foo_shadow, sizeof foo);
    kernel<<<1,1>>>(pos, bar_shadow);
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

This prints:

i=16383 s= 1.27996094e+02 t= 9.48800000e+03 prod= 1.21442700e+06

The generated code shows __constant__ mapped to constant bank 3, and the kernel argument space mapped to constant bank 0:

        code for sm_30
                Function : _Z6kerneli3soa
        .headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"

        /*0008*/                   MOV R1, c[0x0][0x44];
        /*0010*/                   MOV R0, c[0x0][0x140];
        /*0018*/                   IADD32I R1, R1, -0x20;
        /*0020*/                   MOV32I R3, 0x20;
        /*0028*/                   MOV R7, RZ;
        /*0030*/                   SHR R2, R0, 0x1f;
        /*0038*/                   LOP.OR R6, R1, c[0x0][0x24];
        /*0048*/                   IMAD.U32.U32.HI R2, R2, R3, c[0x0][0x140];
        /*0050*/                   LOP32I.AND R12, R6, 0xffffff;
        /*0058*/                   SHL R3, R0, 0x2;
        /*0060*/                   MOV32I R5, 0x0;
        /*0068*/                   SHR R2, R2, 0x5;
        /*0070*/                   STL [R12], R0;
        /*0078*/                   ISCADD R2, R2, 0x144, 0x2;
        /*0088*/                   LDC R4, c[0x3][R3];          <<<< foo[]
        /*0090*/                   LDC R8, c[0x0][R2];          <<<< bar.arr[]
        /*0098*/                   F2F.F64.F32 R2, R4;
        /*00a0*/                   FMUL R4, R4, R8;
        /*00a8*/                   STL.64 [R12+0x8], R2;
        /*00b0*/                   F2F.F64.F32 R8, R8;
        /*00b8*/                   F2F.F64.F32 R10, R4;
        /*00c8*/                   STL.128 [R12+0x10], R8;
        /*00d0*/                   MOV32I R4, 0x0;
        /*00d8*/                   JCAL 0x0;
        /*00e0*/                   EXIT;
        /*00e8*/                   BRA 0xe8;
1 Like

Thanks for the code. I wasn’t clear in my previous reply - I was meaning the 32k option was closed to me, due to pre CC7.0 hardware.

If my PTX explorations show anything beyond what you’ve outlined, I’ll let you know.

I would very much appreciate it. When I read that quote from the PTX ISA manual yesterday I was really dumbfounded: “How could I possibly have missed that in 16 years of working with CUDA?”