__constant__ memory in function scope

Since we can’t have constant in function scope (for example in kernel args, as in OpenCL), is the following read going to have the same performance :

//1
struct Foo { float f[32]; };
__constant__ Foo foo;

__global__ void test(int i) { 
    printf("%i \n", foo.f[i]);
}
//2
struct Foo { float f[32]; };
__constant__ Foo foo;

__global__ void test(int i) {
    Foo* fooPtr = &foo;
    printf("%i\n", fooPtr->f[i]);
}

So, my question is if the second read is going to be through the same constant cache ?

Kernel arguments (that is, function arguments to global functions) are passed in constant memory on all architectures supported by the current CUDA version (that is, all GPU architectures >= sm_20). They are just passed in a separate bank of constant memory, not the same bank used to map user-defined constant data. You can easily see this for yourself by disassembling a small test app with cuobjdump --dump-sass.

But I am passing some gigabytes as input data ? Or this applies only for those that are passed not though pointer, but as a copy ? There is no point in declaring memory, that is often going to be read by many thread as constant ?

constant memory is limited to 64KB. The advantage of constant memory is that it is read through a small constant cache (I think around 4KB in size), which provides a broadcast feature that allows data to be sent to all threads in a warp in the same cycle. This requires that all threads in the warp present the same address i.e., the access is uniform. If the access is not uniform, serialization occur.

If you are passing GBs of data to a kernel, simply allocate memory with cudaMalloc(), then pass the pointer(s) returned by it to your kernel(s). Use “const” and “restrict” attributes on these pointers where possible to allow the compiler greater freedom in re-ordering loads for maximum performance.

I would suggest to carefully review the sections describing the various memory spaces in the CUDA documentation, in particular the CUDA C Programming Guide and the Best Practices Guide.

Thank you for your answer.
However, I was not asking that at all.
I am very familiar with the constant memory, the constant cache (it is actually 8KB on each SMX), I am already using restrict, __ldg (or ld.global.nc.whatever).

I am asking if I make a regular pointer from a constant memory, I would have the same broadcast & cache features as if I am using the regular constant data. The purpose of doing that, as I have already described, is to mimic the OpenCL constant memory which is being declarad in function scope (in contrast to the file scope in CUDA). In OpenCL 2.0 there is constant in file scope, but we have not yet seen 1.2 implemented from some of the major vendors.

My second question was figure out what you mean in the first place - if it is true that mem passed to kernel is implicitly constant, is it true only for mem passed by copy (aka it obviously does not applies to pointer that could be gigabytes) …

However, I highly appreciate the effort you made to answer me. But please, lets stick to the point, we all have read the docs.

“I am asking if I make a regular pointer from a constant memory, I would have the same broadcast & cache features as if I am using the regular constant data.”

if i (attempt to) unpack this:

a pointer, which may be constant itself (in most cases the pointer would be constant)
the pointer pointing to constant memory space, or global memory space
(if it points to global memory space,) the data in global memory space may be constant itself
(if it points to constant memory space, the data is implicitly constant)

i do not see how you would manage to load tons of data in constant memory space, so i do not quite follow how you then link this to OpenCl constant memory

“to mimic the OpenCL constant memory which is being declarad in function scope (in contrast to the file scope in CUDA)”

kernels can not write to constant memory space, but apart from that, i do think that with some effort, the host can manage constants at the function or kernel level, as it can track the progress of kernels, as well as the expiry of constants and thus constant memory variables

Sorry, apperantly I am having troubles to explain what I want.
Imagine that I have this struct …

struct VeryOftenUsedData {
    float something[32];
};

struct KernelParams {
//tons of pointers to global memory here, like:
    float* restrict ptr0;
    //...
    VeryOftenUsedData* restrict foo;
//
}

//this will be called once to store some pointers in the params struct
//we do this for convenience
__global__
void setParams(KernelParams* params, float* restrict  ptr0 /*points to gigabytes of data, we don't want that to be const...*/, /*tons of other pointers*/, VeryOftenUsedData* restrict foo) {
    params->ptr0 = ptr0;
    //...
    params->foo = foo;
    //...
}

//this will be called many times, does the actuall work
__global___
void kernel(KernelParams* params) {
    doALotOfWork(params); //uses the vars that are set in params, the VeryOftenUsedData is read a lot in a __constant__ memory friendly way
}

setParams<<<1,1>>>(/*tons of params*/, ptr0, /*...*/, veryOftenUseData);
while (haveWorkToDo)
    kernel<<<32,128>>>(params);

Do you think it might be better to use that instead :

struct VeryOftenUsedData {
    float something[32];
};
__constant__ VeryOftenUsedData kOftenUsed;

struct KernelParams {
//tons of pointers to global memory here, like
    float* restrict ptr0;
    //...
    VeryOftenUsedData* restrict foo;
//
}

__global__
void setParams(KernelParams* params, /*tons of pointers*/, float* restrict  ptr0) {
    params->foo = &kOftenUsed;
    params->ptr0 = ptr0; 
}
setParams<<<1,1>>>(/*tons of params*/, ptr0, /*...*/, veryOftenUseData);
while (haveWorkToDo)
    kernel<<<32,128>>>(params);

Aka, is the cast to T* restrict going to confuse the compiler so much, it would not do the proper constant cache / broadcast that is expected from a constant memory ?

I hope this makes is a bit clear.
Thanks.

are you asking whether a (derivative) structure containing a number of pointers - or simply a pointer for that matter - pointing to - conveying - an original pointer with attribute declarations, like restrict or const or what have you, would retain or lose these attribute declarations of the original pointer…?

Yes (infact I was just asking for the particular case of casting constant T to __global T* restrict, but a general answer would be fine too).
In the end, I could just check out what the SASS looks like, but I thought somebody could have ready answer (or has done in depth tests or has ready explanantion). Thought, I highly suspsect that the general attribute behaviour should stay (which means, my 2 examples would behave the same).

njuffa must know; he wrote several scholarly papers on the point of pointers, and sits on the c/c++ panel

i think the compiler might not be smart enough to track the inherited attributes or properties; but at the same time, it should be easy and acceptable to reinforce such attributes…

(i actually do not know the true answer)

I assume this was meant to be humorous because nothing in the above quote is true.

If my answers so far seem somewhat off the mark it is because I am trying to figure out as to what exactly the original poster is asking. My suggestion regarding study of the documentation were meant to be helpful as the question(s) seemed to indicate that this might clarify things. My apologies if I was off the mark there as well.

If the question is, “is direct access to constant memory always going to result in exactly the same code as accessing said constant memory through a pointer”, I believe the answer is “not necessarily” because in certain instances such accesses through pointers may force the constant memory data to be “demoted” to simple global data. This has something to do with how memory space specific pointers are mapped to generic pointer space and how this interacts with the CUDA ABI. It has been several years since I last looked into this. It would be best to compare the generated SASS code.

I am also under the impression that the Maxwell architecture has abolished the separate constant cache, which has been subsumed into a general read-only cache, but it would be best to double-check that recollection against the relevant documentation.

Thanks guys.
It turns that the answers of that is not pretty obvious for anybody. English is not my primary language, so I happen to lack means of expression from time to time, so my questions might not be easy to get, sorry about that.

I will track down what happens (when I find some available time) in the PTX and SASS and will update you with the results. I have a bunch of GPUs here, so I can test on different architectures (2.1, 3.0, 3.5, 3.7, 5.2).
In stead of only looking at the code, some benchmarks would be nice to be made. Can you think of any appropriate code to show us if the constant cache is there and if it is working as expected (through pointer and through regular constant memory) ?

What is the ultimate goal of this work? Designing a microbenchmark to reverse engineer certain GPU hardware aspects? One idea could be to use the fact that the performance of the constant cache differs for non-uniform versus uniform access, while this makes no difference to the other caches (to my knowledge). In other words, perform one run where all threads in a warp access the same address, then another where all threads have a different address.

To expand a bit on the “demotion to global” aspect. GPUs provide multiple memory spaces and offer specialized load instructions for each memory space, e.g. LDC for constant memory, LDL for local memory. But C/C++ are largely designed around a “a pointer is a pointer is a pointer” concept meaning there is a single unified address space with generic pointers.

To support full C++, GPU architectures therefore support generic pointers and an ABI for sm_20 and later (= all architectures supported by CUDA 7.0). For function calls, the ABI requires any pointers passed to be generic pointers. The hardware supplies pointer conversion facilities (exposed as the CVTA instruction at the PTX level) to convert between generic and memory-space specific pointers. As I recall there were limitations to that for pointers to constant memory, at least initially. These hardware limitations may have disappeared with more recent GPU architectures, or a software workaround may have been found.

Since most functions get inlined even when the ABI is in use (which it is by default), situations that require constant-to-generic pointer conversions are unlikely to occur. They can probably be forced by using separate compilation which precludes inlining and therefore requires pointers-to-constant to be converted to generic pointer. Where conversion is not feasible, the constant data must be “demoted” to global data by the compiler, as the conversion between pointers to global memory and generic pointer is always possible.

If anybody cares, as I promised, here are the results.
Using cosntant data from a global pointer does NOT use the constant reads, when the initialization of the pointer happens in one kernel, and its usage in another.
Using cosntant data in 3.7 and 5.2 didn’t seem to have any benefit for our kernel (the render time was exactly the same), where in 2.1 the speed increase was drastic (25% overall).