CUDA C++ - Class accessessing constant memory results in load from global memory instead


I have been working on creating a class that would handle an array of data. The class has an overloaded () operator to carry out the array indexing. Since array indexing relies on the array dimensions (NX & NY), and these dimensions are constant, it would be beneficial to store the dimension values in the device constant memory instead of the global memory. The code below is supposedly designed to do so. However, when the code is profiled, it seems the values are loaded from the global memory instead of the constant memory.

For example, in the code below, the class “Foo” is used to create, free, copy and access an array members (array, d_array) on host and device sides. The dimensions of the array are copied to the device constant memory and stored in (nx & ny). The array index is evaluated properly and retrieves the correct data. However, when I profiled the code using Nsight Compute (attached file), the overloaded () operator is carrying out 3 loads from the global memory, instead of 2 loads from constant memory (nx & ny) and 1 load from global memory (d_array).

Any suggestion how to make the overloaded operator() load the (nx & ny) values from constant memory instead of global memory ?


class_constMem_1.ncu-rep (207.5 KB)

#include <iostream>
#include <algorithm>
#ifdef _WIN64
#include <device_launch_parameters.h>
#include <cuda_runtime.h>

using namespace std;

__constant__ int nx, ny;

struct Foo {

	Foo(int NX, int NY, int NZ) {		
		arr = new float[NX * NY * NZ]; /* Allocate memory on host side */
		for (int i = 0; i != NX * NY * NZ; ++i) { /* Fill the memory on the host side with data */
			arr[i] = float(i);
		cudaMalloc(&d_arr, NX * NY * NZ * sizeof(float)); /* Allocate memory on the device side*/
		cudaMemcpy(d_arr, arr, NX * NY * NZ * sizeof(float), cudaMemcpyHostToDevice); /* Copy data from host side to device side */
	~Foo() { 
		delete[] arr; /* Free memory on the host side */
		cudaFree(d_arr); /* Free memory on the device side */

	__device__ inline float& operator()(int x, int y, int z) { return d_arr[x + nx * (y + ny * (z))]; } /* Overloaded operator for loading data */

	float* arr;
	float* d_arr; 

__global__ void cuda_kernel(Foo* obj) {
	printf("loaded value from device side = %f\n", (*obj)(1, 1, 1));

int main() {
	int NX = 10, NY = 10, NZ = 10; /* Array dimensions */
	Foo h_obj(NX, NY, NZ); /* Create a class object on the host side */
	cout << "Loaded value from host side = " << h_obj.arr[1 + 10 * (1 + 10 * 1)] << endl; /* Test: Load a value from array on the host side */

	Foo* d_obj; /* Create a device pointer to a class objet */
	cudaMalloc(&d_obj, sizeof(Foo)); /* Allocate memory for the device pointer */
	cudaMemcpy(d_obj, &h_obj, sizeof(Foo), cudaMemcpyHostToDevice); /* Copy the object pointer from host to device */
	cudaMemcpyToSymbol(nx, &NX, sizeof(int)); /* Copy data to device constant memory */
	cudaMemcpyToSymbol(ny, &NY, sizeof(int)); /* Copy data to device constant memory */

	cuda_kernel << <1, 1 >> > (d_obj); cudaDeviceSynchronize(); cudaPeekAtLastError(); /* Test kernel*/
	cout << "Program end!" << endl;
	return 0;

__constant__ memory is a logical space that is backed by a cache and what is effectively global memory - i.e. the same physical backing as global memory, not the same logical space.

If you miss in the __constant__ cache, then a load is going to be triggered from this physical backing.

There is no possibility that the operator() you have shown is actually loading nx or ny from the logical global space, vs. the logical __constant__ space.

If you have further questions about why the profiler reports these details the way it does, I suggest asking that question on the nsight compute forum

If you paste your code into CUDA C++, NVCC 12.0.1, SASS (compute_80) you will see that nx and ny are loaded via an immediate constant load in the IADD3 and IMAD instruction. Constant bank 3 (c[3]) is the constant bank reserved for user defined constants. Constant bank 0 (c[0]) is the location for driver/compiler constants (e.g. location of the stack, gridDim, blockDim, …) and kernel parameters.

 IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] 
 IMAD.MOV.U32 R4, RZ, RZ, c[0x0][0x160] 
 MOV R5, c[0x0][0x164] 
 ULDC.64 UR4, c[0x0][0x118] 
 IADD3 R1, R1, -0x8, RZ 
 LDG.E.64 R2, [R4.64+0x8] 
 IMAD.MOV.U32 R7, RZ, RZ, 0x1 
 IADD3 R0, R7, c[0x3][0x4], RZ         <<-- ny is loaded from c[3][4]
 IMAD R7, R0, c[0x3][0x0], R7          <<-- nx is loaded from c[3][0]
 IMAD.WIDE R2, R7, 0x4, R2 
 LD.E R2, [R2.64] 
 IADD3 R6, P0, R1, c[0x0][0x20], RZ 
 UMOV UR4, 32@lo($str) 
 UMOV UR5, 32@hi($str) 
 IADD3.X R7, RZ, c[0x0][0x24], RZ, P0, !PT 
 IMAD.U32 R4, RZ, RZ, UR4 
 MOV R5, UR5 
 F2F.F64.F32 R8, R2 
 STL.64 [R1], R8 
 MOV R20, 32@lo((cuda_kernel(Foo*) + .L_x_0@srel)) 
 MOV R21, 32@hi((cuda_kernel(Foo*) + .L_x_0@srel)) 
 CALL.ABS.NOINC `(vprintf) 
 BRA `(.L_x_1)

Thanks for the reply, I will check the nsight compute forum and documentation.

Thanks for the reply,

I was compiling in debug mode, resulting in a more complicated instruction set. However, compiling in the release mode gave similar results to I was getting several loading operations (LD) instead of only one. But compiling in release mode removed the many loads. But there are still two load operations (LDG.E.64 R2, [R4.64+0x8]) & (LD.E R2, [R2.64]). Would you please elaborate on that? I expected only one load operation from the global memory.


Why? From a very quick perusal of the code (so it is entirely possible I missed something), I see a pointer obj being passed to the kernel, so wouldn’t one expect: load obj->d_arr (a 64-bit load), copy d_arr[0] (32-bit load followed by 32-bit store)?

Thanks for the reply,

The way you mentioned it makes it more clear to me. It seems I don’t know much about the process of loading the object and different types, such as 64-bit and 32-bit loads. Would you suggest a reference to read on this topic? I work in scientific computations and try to minimize the bottleneck of global memory load/store.

I would claim that to understand the different memory spaces it is sufficient to read the CUDA documentation.

To correlate machine instructions with high-level code requires general understanding of processor instruction sets and studying how HLL code gets translated to machine instructions. I picked this up “organically” over 40 years of programming on half a dozen processor families. It is made more difficult by the extensive transformations applied by modern optimizing compilers and the fact that NVIDIA provides only very basic documentation of GPU machine instructions.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.