Do const restrict pointers ever generate LDG.CI loads on CUDA 7?

allanmac · March 4, 2015, 10:57pm

I’m very very clear on what __ldg() does and when to use it but am surprised to see that, as far I can tell, none of my dutifully decorated pointers across a large CUDA codebase generate an LDG.CI.

On CUDA 7 and sm_50, it appears that an explicit call to __ldg() is required to generate the expected constant load.

Is this known? The “likelihood” of detection appears to be approaching zero.

CudaaduC · March 4, 2015, 11:07pm

So is there a difference between CUDA 6.5 and CUDA 7.0 in this regard?

Is looking at the verbose compile output a valid indicator of a constant load?

For example:

1>0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info    : Used 48 registers, 2884 bytes smem, 376 bytes cmem[0], 12 bytes cmem[2]

What is the difference between cmem[0] and cmem[2] ?

Robert_Crovella · March 4, 2015, 11:12pm

I don’t observe this casually (1 datapoint).

When I compile the code in the answer here:

http://stackoverflow.com/questions/27239835/parallelizing-a-for-loop-1d-naive-convolution-in-cuda

(the one that has been decorated with const restrict) for either sm_35, sm_50, or sm_52 on CUDA 7 RC, and dump the sass, I find LDG instructions.

fully worked example, based on code above:

$ cat test.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

// RG*RG*MAXN must fit within mytype

#define MAXN 100000
#define RG 10
#define USECPSEC 1000000ULL
#define nTPB 256

typedef double mytype;

void conv(const mytype *A, const mytype *B, mytype* out, int N) {

    for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
            out[i + j] += A[i] * B[j];
}

unsigned long long dtime_usec(unsigned long long prev){
  timeval tv1;
  gettimeofday(&tv1,0);
  return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}



__global__ void conv_Kernel2(const mytype * __restrict__ A, const mytype * __restrict__ B, mytype *out, const int N){
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < (2*N)-1){
      mytype my_sum = 0;
      for (int i = 0; i < N; i++)
        if (((idx < N) && (i <= idx)) || ((idx >= N) && (i > (idx-N)))) my_sum += A[i]*B[idx-i];
      out[idx] = my_sum;
    }
}

int main(int argc, char *argv[]){


  mytype *h_A, *d_A, *h_result, *d_result, *result, *h_B, *d_B, *A, *B;
  if (argc != 2) {printf("must specify N on the command line\n"); return 1;}
  int my_N = atoi(argv[1]);
  if ((my_N < 1) || (my_N > MAXN)) {printf("N out of range\n"); return 1;}
  B   = (mytype *)malloc(my_N*sizeof(mytype));
  A   = (mytype *)malloc(my_N*sizeof(mytype));
  h_A = (mytype *)malloc(my_N*sizeof(mytype));
  h_B = (mytype *)malloc(my_N*sizeof(mytype));
  h_result = (mytype *)malloc(2*my_N*sizeof(mytype));
  result   = (mytype *)malloc(2*my_N*sizeof(mytype));

  cudaMalloc(&d_B, my_N*sizeof(mytype));
  cudaMalloc(&d_A, my_N*sizeof(mytype));
  cudaMalloc(&d_result, 2*my_N*sizeof(mytype));

  for (int i=0; i < my_N; i++){
    A[i] = rand()%RG;
    B[i] = rand()%RG;
    h_A[i] = A[i];
    h_B[i] = B[i];}

  for (int i=0; i < 2*my_N; i++){
    result[i]   = 0;
    h_result[i] = 0;}

  unsigned long long cpu_time = dtime_usec(0);
  conv(A, B, result, my_N);
  cpu_time = dtime_usec(cpu_time);

  cudaMemset(d_result, 0, 2*my_N*sizeof(mytype));

  unsigned long long gpu_time = dtime_usec(0);
  cudaMemcpy(d_A, h_A, my_N*sizeof(mytype), cudaMemcpyHostToDevice);
  cudaMemcpy(d_B, h_B, my_N*sizeof(mytype), cudaMemcpyHostToDevice);
  conv_Kernel2<<<((2*(my_N-1))+nTPB-1)/nTPB,nTPB>>>(d_A, d_B, d_result, my_N);
  cudaDeviceSynchronize();
  cudaMemcpy(h_result, d_result, 2*my_N*sizeof(mytype), cudaMemcpyDeviceToHost);
  gpu_time = dtime_usec(gpu_time);

  for (int i = 0; i < 2*my_N; i++) if (result[i] != h_result[i]) {printf("mismatch2 at %d, cpu: %d, gpu %d\n", i, result[i], h_result[i]); return 1;}
  printf("Finished.  Results match.  cpu time: %ldus, gpu time: %ldus\n", cpu_time, gpu_time);
  printf("cpu/gpu = %f\n", cpu_time/(float)gpu_time);
  return 0;
}
$ nvcc -arch=sm_50 -o test test.cu
$ cuobjdump -sass test |grep LDG
        /*0138*/               @P0 LDG.E.CI.64 R12, [R2];                                      /* 0xeed5a0000000020c */
        /*0148*/               @P0 LDG.E.CI.64 R6, [R4];                                       /* 0xeed5a00000000406 */
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2014 NVIDIA Corporation
Built on Tue_Dec__9_18:10:46_CST_2014
Cuda compilation tools, release 7.0, V7.0.17
$

allanmac · March 4, 2015, 11:19pm

@CudaaduC, PTXAS output won’t help. You can only tell if LDG.CI’s are being generated by dumping the SASS.

allanmac · March 4, 2015, 11:22pm

@txbob, thanks for the simple counter-example to reassure me that they’re being generated.

I dug up code from a previously filed LDG bug report and added another test case and… now have a reproducer.

Filing a bug (#1617935).

Synopsis:

Packaging up your properly decorated pointers into a const struct which is then passed to the kernel appears to ignore the const _restrict qualifiers:

SPWorley · March 5, 2015, 12:04am

That’s a subtle bug catch!
NVidia should award you two Tshirts for that bug report instead of just one as usual. Or make it a different color to stand out at GTC.

njuffa · March 5, 2015, 12:05am

I am not enough of a C/C++ language lawyer, but in my experience the various modifiers/qualifiers (const, volatile, restrict) have non-obvious semantics once they are being used in non-trivial situations. I learned that the hard way. Best I recall, packaging up attributed pointers in a struct is one of those non-trivial uses that results in unintuitive semantics. Even after re-reading the relevant standards and talking to compiler engineers I never developed a comprehensive understanding about these attributes.

So while it is entirely possible that the compiler misses a chance for optimization here, I think it is also possible that the semantics prescribed by C/C++ simply do not allow the compiler to extract all the information it needs to determine that use of LDG is in fact safe.

Filing a bug is of course an excellent course of action, as it will bring clarity about the situation. The observed behavior will either be accepted as a bug or the desired behavior accepted as an RFE, or you willl get feedback that everything works as it should.

For now I would suggest sticking the const and restrict modifiers on simple pointer arguments to functions and leaving it at that.

allanmac · March 5, 2015, 12:14am

I was waiting for a lawyer to point out the exact line of the C++ spec where my expectations were incorrect but it looks like I still have legal standing. :)

Until then, there are two reasons to use the idiom I’m flogging:

Clarity: it packages the constant kernel args under one name.
Efficiency: one marshalling call on the host instead of one per kernel argument.

The defense rests!

@SPWorley, I’m definitely a redshirt when it comes to CUDA bugs.

Robert_Crovella · March 5, 2015, 12:17am

Posting of pictures to contain code seems incovenient. Perhaps that is your intent - you don’t want your code to be searchable or easily copyable?

Or is there a clever way to copy the code from that picture if I wanted to play with it?

allanmac · March 5, 2015, 12:25am

Nope, that is never my intent.

I usually throw snippets up on gist but it was easier to cut and paste color-coded text since the actual bug report, which has a link, also has the code.

You can now find the snippet here.

It would be great if someone could find a quick workaround!

Topic		Replies	Views
Does the use of 16-bit, __restrict__ const kernel arguments hurt performance? CUDA Programming and Performance	4	4290	May 24, 2018
Maxwell (sm_50) instruction: LDG.E ? CUDA Programming and Performance	25	8536	August 15, 2015
Random execution times and freezes with concurent kernels - 2 CUDA Programming and Performance	5	2640	November 10, 2015
Problem occurred when compiling julia_gpu.cu from Cuda by example using nvcc CUDA Programming and Performance	13	4406	November 11, 2014
first install of cuda CUDA Setup and Installation	6	7637	February 12, 2017
Undefined reference to a CUDA function in a dll (MSVC 2017 + MingW) CUDA Programming and Performance	10	1919	May 18, 2018
CUDA very slow performance CUDA Programming and Performance	21	16719	March 6, 2020
slow pointers initiation in kernel CUDA Programming and Performance	21	1773	July 8, 2016
Always got this warning when nvprof cuda file "This can happen if device ran out of memory or if a device kernel was stopped due to an assertion" on just HellowWorld GPU CUDA Programming and Performance	9	2557	January 31, 2019
CUDA compile trouble CUDA Programming and Performance	47	5117	November 8, 2010

Do const __restrict__ pointers *ever* generate LDG.CI loads on CUDA 7?

Related topics

Do const restrict pointers ever generate LDG.CI loads on CUDA 7?