cudaMemcpyToSymbol not working

Hello,

I have a simple problem that assigns value to some global constant variables on device, such as:

#include <iostream>
#include <cuda.h>

__constant__ float FCONST[2];

int main(){
     float a[2];
     float b[2];
     a[0] = 1.0f;
     a[1] = 2.0f;
     
     cudaMemcpyToSymbol( FCONST, a, 2*sizeof(float), 0, cudaMemcpyHostToDevice);
     cudaMemcpy(b, FCONST, 2*sizeof(float), cudaMemcpyDeviceToHost);    
     
     std::cout<< "a[0]=" << a[0] << " , a[1]=" << a[1] << std::endl;    
     std::cout<< "b[0]=" << b[0] << " , b[1]=" << b[1] << std::endl;     
     return 0;
}

However, the above code does not run correctly, and I could not figure out the reason.
It gave me:

a[0]=1 a[1]=2
b[0]=0 b[1]=0

I tried google the problem, but all threads/discussions are almost the same implying the above code is correct.

I’ve tested using “cudaMemcpyDefault”, or &FCONST[0] instead of FCONST, but none of the methods works.
Maybe I missed some thing here. Please help me to clarify it.

Great thanks!
Best

#edit:
I also tested the code with “cudaMemcpyFromSymbol”.
This time the “b” gets correct results, but when I checked with NSIGHTS, the “FCONST” remains 0 when I paused to check the value.
So I’m a bit confused if variables on device can not be checked at run-time with NSIGHT, or are shown incorrectly?

Thanks.

This is not legal:

cudaMemcpy(b, FCONST, 2*sizeof(float), cudaMemcpyDeviceToHost);

in the future, use proper CUDA error checking.

when copying to or from a symbol you must use the appropriate API

try:

cudaMemcpyFromSymbol(b, FCONST, 2*sizeof(float));

If you continue to have trouble, please implement proper CUDA error checking, before asking others for help.

Hello,

Thanks for your reply first.

In fact I have some other problems with “cublas”, and the above problem is one possible reason, which is now proved to be untrue. And I do make cuda error check in my code.

Here is the problem: I would like to run cublas matrix operations (such as “cublasDgeam”) based on the global device constants.

#include <iostream>
    #include <cuda.h>

    __constant__ double DCONST[2];

    int main(){
         double a[2];
         a[0] = 0.0;
         a[1] = 1.0;        
         checkCudaErrors( cudaMemcpyToSymbol( DCONST, a, 2*sizeof(double), 0, cudaMemcpyHostToDevice) );

         cublasHandle_t global_cublasHandle;
         checkCublasErrors( cublasCreate(&global_cublasHandle) );


         double * dat_rsc , * dat_dst;
         size_t pitch_rsc, pitch_dst, nrow_rsc, ncol_rsc;
         nrow_rsc = 1024;    // some value
         ncol_rsc = 800;     // some value
         size_t T_s = sizeof(double);         

         // allocate data on device and set to 0
         checkCudaErrors( cudaMallocPitch(&dat_rsc, &(pitch_rsc), ncol_rsc*T_s, nrow_rsc) );
         checkCudaErrors( cudaMemset(dat_rsc, 0, pitch_rsc*nrow_rsc) ); 
         checkCudaErrors( cudaMallocPitch(&dat_dst, &(pitch_dst), nrow_rsc*T_s, ncol_rsc) );
         checkCudaErrors( cudaMemset(dat_dst, 0, pitch_dst*ncol_dst) ); 



         // ... dat_src is initialized with some value by Memcpy successfully ... //

         // a simple transpose by cublasDgeam 
         checkCublasErrors(  cublasDgeam ( global_cublasHandle,     CUBLAS_OP_T,   CUBLAS_OP_T,
                                        nrow_rsc, ncol_rsc,
                                        (const double*) &DCONST[1],
                                        (const double*) dat_rsc,  (int) (pitch_rsc/T_s),
                                        (const double*) &DCONST[0],
                                        (const double*) dat_rsc,  (int) (pitch_rsc/T_s),
                                                        dat_dst,  (int) (pitch_dst/T_s) ) );      

         checkCublasErrors( cublasDestroy(global_cublasHandle) ) ;        

         return 0;
    }

However, every time I run it, the dat_dst is always 0, which means the operation is incorrect. Therefore I suspect it might be the reason that “DCONST” remains 0 all the time (which is not the case).

Would you please check what is wrong in the above code?

Thanks a lot!

It is not legal in CUDA to take the address of a device variable (or device function) in host code.

This is a device variable:

__constant__ double DCONST[2];

This is taking the address of a device variable in host code:

(const double*) &DCONST[1],

Since constant memory is not writable from device code, I can’t imagine why you would do this, it certainly could not be for functionality or performance reasons. You already have alpha and beta available to you in host code - it’s what you used to initialize constant memory. Why not just pass pointers to those values in host code?

checkCublasErrors(  cublasDgeam ( global_cublasHandle,     CUBLAS_OP_T,   CUBLAS_OP_T,
                                        nrow_rsc, ncol_rsc,
                                        (const double*) a+1,
                                        (const double*) dat_rsc,  (int) (pitch_rsc/T_s),
                                        (const double*) a,
                                        (const double*) dat_rsc,  (int) (pitch_rsc/T_s),
                                                        dat_dst,  (int) (pitch_dst/T_s) ) );

That makes sense, and I also suspect it is because of the pointers on device however I do not have evidence on it.

In fact, we would like to put all variables (such as dat_rsc, dat_src, pitch_d, nrow_rsc, ncol_rsc) on device and launch “cublasDgeam” from host. Is it possible to do that?

Thanks for your clarification.

dat_rsc and dat_dst are already on the device (i.e. they point to data/allocations that are located in device memory). You may be confused in general about what constitutes a host pointer in CUDA and what constitutes a device pointer in CUDA.

No, you can’t put everything including the scalar (non-pointer) variable on the device and call the function from host code. The API isn’t designed to work that way. You could do that in device code, but you have specifically said “and launch “cublasDgeam” from host”.

BTW CUDA provides no guarantees that I know of that the pitch will be whole-number divisible by a particular amount. It will probably work in practice, but I’m not sure it is guaranteed. I think it’s not typical to use the CUBLAS API with a pitched allocation, although it’s possible to make it work with an appropriate choice of leading dimension parameters (and acknowledging the aforementioned lack of guarantee). I personally think the safe thing to do would be a run-time check that the operation is whole-number divisible.

https://stackoverflow.com/questions/36362306/how-do-the-cusparse-and-cublas-libraries-deal-with-memory-allocated-using-cudama

Thanks a lot for your explanation!
I will notice the pitch thing for sure.

Best.