CUDA header input file Declaring device variables in a separate .h or .cuh file

diablobanshee · June 9, 2011, 2:57pm

I will begin this post as I always do: I am a noob so I may describe things in a somewhat strange way… you have been warned :-)

A little bit of background: I am working in a research lab that is trying to model plant growth patterns using simple dynamical equations for plant biomass and nutrient and water concentration. When I came in they had all of their input variables scattered throughout about 500+ lines of code, some declared in multiple places (once for extern “C” and once for device functions). Being an engineer I couldn’t stand this, it seemed inefficient and too easy to forget to update all instances of a variable, so I pushed to have all the input variables put into a single header file that would be used by the host and device functions. The ultimate goal is to hve a header file initialize all of the variables, then have a .txt file that has their values so the model can be changed without remaking the source each time.

So I went through and moved ALL variables to a separate header file. This appeared to work, there were no complaints about variables not being initialized properly or whatnot. However, I noticed that the behavior of the code was not what we expected at all. It turned out that though the variables were being declared properly and given values, they were passing in null values to the functions. So I decided to take a step back and work on just one variable and make sure I fully understood what was happening before tackling all variables.

We have a variable called “ppt” which represents precipitation. It is used in one device function and one extern “C” function. I thought that declaring it in a header file as

__constant__ float ppt = 123.456;

would allow both functions to access it. In fact both could use it, but via a printf statement in the extern “C”, the value being passed in was ppt = 0.000000. I couldn’t get my cuPrintf to work, so I’m not sure what is being passed into the device function (that is another of my posts, it can be found here: The Official NVIDIA Forums | NVIDIA). Either way, it does not seem correct.

So I played around with different setups and I finally found that declaring two variables in a header file with the form

__device__ float dev_ppt = 123.456;

float ppt = 123.456;

seemed to do the trick, the dynamics were working again. However, this messed up the graphics being displayed. BTW, we are using openGL for the graphics, I don’t know if that is pertinent to this issue or not. To figure out what the source of the issue was, I commented out the variable declarations in the header file and the functions themselves in turn to see at which point the graphics were messed up. It turned out that if I declare the host ppt in either place doesn’t make a difference, but declaring the device ppt in the header file is what caused the graphics to be strange. I tried using both device float and constant float and got the same results.

The correct graphics should look like this:

The messed up graphics are like this:

I tried to do a screen grab at about the same time, but as you can see the fps rate is pretty high, so don’t worry too much about the shade of orange that the dots are. The messed up part is that they look distorted, they should have clean, smooth edges.

Any ideas of what is going on and how to fix it? Also, any ideas on how to declare the variable once so the user doesn’t have to enter it twice? When I move things to a .txt file, this may become unimportant as I can just assign the same value from the text file to two separate variables in the code, but that doesn’t seem to be the best way about it…

Any help on this (or my other post) would be GREATLY appreciated!

opensw · June 10, 2011, 2:17am

I will begin this post as I always do: I am a noob so I may describe things in a somewhat strange way… you have been warned :-)

A little bit of background: I am working in a research lab that is trying to model plant growth patterns using simple dynamical equations for plant biomass and nutrient and water concentration. When I came in they had all of their input variables scattered throughout about 500+ lines of code, some declared in multiple places (once for extern “C” and once for device functions). Being an engineer I couldn’t stand this, it seemed inefficient and too easy to forget to update all instances of a variable, so I pushed to have all the input variables put into a single header file that would be used by the host and device functions. The ultimate goal is to hve a header file initialize all of the variables, then have a .txt file that has their values so the model can be changed without remaking the source each time.

So I went through and moved ALL variables to a separate header file. This appeared to work, there were no complaints about variables not being initialized properly or whatnot. However, I noticed that the behavior of the code was not what we expected at all. It turned out that though the variables were being declared properly and given values, they were passing in null values to the functions. So I decided to take a step back and work on just one variable and make sure I fully understood what was happening before tackling all variables.

We have a variable called “ppt” which represents precipitation. It is used in one device function and one extern “C” function. I thought that declaring it in a header file as
__constant__ float ppt = 123.456;
would allow both functions to access it. In fact both could use it, but via a printf statement in the extern “C”, the value being passed in was ppt = 0.000000. I couldn’t get my cuPrintf to work, so I’m not sure what is being passed into the device function (that is another of my posts, it can be found here: The Official NVIDIA Forums | NVIDIA). Either way, it does not seem correct.

So I played around with different setups and I finally found that declaring two variables in a header file with the form
__device__ float dev_ppt = 123.456;

float ppt = 123.456;
seemed to do the trick, the dynamics were working again. However, this messed up the graphics being displayed. BTW, we are using openGL for the graphics, I don’t know if that is pertinent to this issue or not. To figure out what the source of the issue was, I commented out the variable declarations in the header file and the functions themselves in turn to see at which point the graphics were messed up. It turned out that if I declare the host ppt in either place doesn’t make a difference, but declaring the device ppt in the header file is what caused the graphics to be strange. I tried using both device float and constant float and got the same results.

The correct graphics should look like this:

[attachment=27672:Correct_Graphics.png]

The messed up graphics are like this:

[attachment=27673:Messed_Up_Graphics.png]

I tried to do a screen grab at about the same time, but as you can see the fps rate is pretty high, so don’t worry too much about the shade of orange that the dots are. The messed up part is that they look distorted, they should have clean, smooth edges.

Any ideas of what is going on and how to fix it? Also, any ideas on how to declare the variable once so the user doesn’t have to enter it twice? When I move things to a .txt file, this may become unimportant as I can just assign the same value from the text file to two separate variables in the code, but that doesn’t seem to be the best way about it…

Any help on this (or my other post) would be GREATLY appreciated!

Good morning there,

"

I couldn’t get my cuPrintf to work, so I’m not sure what is being passed into the device function

"

Why are you using cuPrintf to debug a variable on GPU?? Use Parallel Nsight… it is easy to use.

About your question

"

Any ideas of what is going on and how to fix it?

"

How do you pass variable to device function?? If you have a host scalar variable, you can pass it directly to device function (but without & or *!!)… if you want to pass an reference or pointer you should first declare your variable, at this point you should move your data from host to device and at least you can use device variable on device from your device function.

Hope this help.

Paolo

diablobanshee · June 10, 2011, 12:52pm

Thank you Paolo!

I have not heard of Parallel Nsight, but I will look into it.

As of now I am not passing variables into the device function, rather I am trying to declare all of the variables at once and just access them from the functions, both on the host and the device. From what I understand, passing variables from the host to the device is extremely slow and inefficient, so I’d like to copy the variable one time so it is accessible from both the host and the device. I have tried copying the memory unsuccessfully, but I think that is just a user error in not fuly understanding what I’m doing.

If I declare, for example,

float ppt = 123.456;

__constant__ float dev_ppt;

CUDA_SAFE_CALL( cudaMalloc( (void**) &dev_ppt, sizeof(float)));

CUDA_SAFE_CALL( cudaMemcpy( dev_ppt, ppt, sizeof(float), cudaMemcpyHostToDevice) );

Will this allow me to set the value for ppt once and have it readily accessible from the host and the device without further transferring between them?

Thanks again, Paolo!

PS, I just looked up the Parallel Nsight. Unfortunately I am not running Windows, my coding is all done on a MacBook. I will look around for similar products though, that is exactly what I need! I tried using NVIDIA’s cuprof but my programs are not self-terminating so it not only crashed cuprof, but I had to restart my computer twice because the graphics were completely messed up.

hyqneuron · June 10, 2011, 4:09pm

I’m under the impression that constant objects can only be used by kernels that reside in the same .cu file.

constant objects exist as symbols in .cubin files. Maybe a kernel in one .cubin file will have problem accessing symbols in another .cubin file.
I’m not sure if it’s a problem with linking or if it’s true that kernels cannot access symbols defined in other cubins.

opensw · June 10, 2011, 6:39pm

Thank you Paolo!

I have not heard of Parallel Nsight, but I will look into it.

As of now I am not passing variables into the device function, rather I am trying to declare all of the variables at once and just access them from the functions, both on the host and the device. From what I understand, passing variables from the host to the device is extremely slow and inefficient, so I’d like to copy the variable one time so it is accessible from both the host and the device. I have tried copying the memory unsuccessfully, but I think that is just a user error in not fuly understanding what I’m doing.

If I declare, for example,
float ppt = 123.456;

__constant__ float dev_ppt;

CUDA_SAFE_CALL( cudaMalloc( (void**) &dev_ppt, sizeof(float)));

CUDA_SAFE_CALL( cudaMemcpy( dev_ppt, ppt, sizeof(float), cudaMemcpyHostToDevice) );
Will this allow me to set the value for ppt once and have it readily accessible from the host and the device without further transferring between them?

Thanks again, Paolo!

PS, I just looked up the Parallel Nsight. Unfortunately I am not running Windows, my coding is all done on a MacBook. I will look around for similar products though, that is exactly what I need! I tried using NVIDIA’s cuprof but my programs are not self-terminating so it not only crashed cuprof, but I had to restart my computer twice because the graphics were completely messed up.

For debugging on Linux and Mac read this CUDA Toolkit Documentation 12.3 Update 1 (or go to your CUDA installation directory and under doc open cuda-gdb.pdf); cuda-gdb is without GUI then you should use command line but it is better than cuPrintf ;-). I think that cuda-gdb is only your way to debugging CUDA application on Mac.

Some notes:

If you want to share same variables between host and device you can use zero-copy memory (you do not need to use cudaMemcpy and similar anymore) but this is not the best solution for performance, use this only for integrated GPU;
constant memory is not always good choice for performance, you can use it when you want to issue same data to 16 threads of same warp (if each thread of same warp needs different value from constant array this increases “traffic” on GPU and kills performance!);
Best way to achieve good performance (in transfer rate) is to use pinned memory, sometimes it is called page-locked memory, with asynchronous calls; but you should make sure to reduce memcpy calls!!
For kernel, when possible, use cache instead to global memory;
If you have many scalar variables you can put them in one array, in this way with one memcpy you have transferred all your data.

Following, there are some examples how to use memcpy with various memory kinds.

[zero-copy memory]

Sorry, but I never use it then read doc\CUDA_C_Programming_Guide.pdf from your CUDA installation directory.

[global memory]

define sizeArray 512

int *dP;

int *hP;

// mallocs on host

hP = (int *)malloc(sizeArray * sizeof(int));

// TODO: adds data to hP

// malloc on device

cudaMalloc(&dP, sizeArray * sizeof(int));

// memcpy from host to device

cudaMemcpy(dP, hP, sizeArray * sizeof(int), cudaMemcpyHostToDevice);

// TODO: kernel call

// frees memory

free(hP);

cudaFree(dP);

[pinned memory]

define sizeArray 512

int *dP;

int *hP;

// mallocs on host

cudaMallocHost(&hP, sizeArray * sizeof(int));

// TODO: adds data to hP

// mallocs on device

cudaMalloc(&dP, sizeArray * sizeof(int));

// memcpy from host to device

cudaMemcpy(dP, hP, sizeArray * sizeof(int), cudaMemcpyHostToDevice);

// TODO: kernel calls

// frees memory

cudaFreeHost(hP);

cudaFree(dP);

[constant memory]

define sizeArray 512

constant int dP;

int hP;

// TODO: adds data to hP

cudaMemcpyToSymbol(“dP”, hP, sizeArray * sizeof(int), 0, cudaMemcpyHostToDevice);

// TODO: kernel calls

If you need texture cache, asynchronous memcpy and so on read doc\CUDA_C_Programming_Guide.pdf from your CUDA installation directory; for best practices read this doc\CUDA_C_Best_Practices_Guide.pdf

Cheers

Paolo

diablobanshee · June 10, 2011, 9:58pm

I’ve been getting the sense from some of what I read on forums that this may be the case. I know that CUDA files link differently from c++ files, but I’m confused as to why it is allocating and recognizing the symbol, but not the value.

Thank you for your input!

diablobanshee · June 10, 2011, 10:05pm

For debugging on Linux and Mac read this CUDA Toolkit Documentation 12.3 Update 1 (or go to your CUDA installation directory and under doc open cuda-gdb.pdf); cuda-gdb is without GUI then you should use command line but it is better than cuPrintf ;-). I think that cuda-gdb is only your way to debugging CUDA application on Mac.

Some notes:

If you want to share same variables between host and device you can use zero-copy memory (you do not need to use cudaMemcpy and similar anymore) but this is not the best solution for performance, use this only for integrated GPU;

constant memory is not always good choice for performance, you can use it when you want to issue same data to 16 threads of same warp (if each thread of same warp needs different value from constant array this increases “traffic” on GPU and kills performance!);

Best way to achieve good performance (in transfer rate) is to use pinned memory, sometimes it is called page-locked memory, with asynchronous calls; but you should make sure to reduce memcpy calls!!

For kernel, when possible, use cache instead to global memory;

If you have many scalar variables you can put them in one array, in this way with one memcpy you have transferred all your data.

Following, there are some examples how to use memcpy with various memory kinds.

[zero-copy memory]

Sorry, but I never use it then read doc\CUDA_C_Programming_Guide.pdf from your CUDA installation directory.

[global memory]

define sizeArray 512

int *dP;

int *hP;

// mallocs on host

hP = (int *)malloc(sizeArray * sizeof(int));

// TODO: adds data to hP

// malloc on device

cudaMalloc(&dP, sizeArray * sizeof(int));

// memcpy from host to device

cudaMemcpy(dP, hP, sizeArray * sizeof(int), cudaMemcpyHostToDevice);

// TODO: kernel call

// frees memory

free(hP);

cudaFree(dP);

[pinned memory]

define sizeArray 512

int *dP;

int *hP;

// mallocs on host

cudaMallocHost(&hP, sizeArray * sizeof(int));

// TODO: adds data to hP

// mallocs on device

cudaMalloc(&dP, sizeArray * sizeof(int));

// memcpy from host to device

cudaMemcpy(dP, hP, sizeArray * sizeof(int), cudaMemcpyHostToDevice);

// TODO: kernel calls

// frees memory

cudaFreeHost(hP);

cudaFree(dP);

[constant memory]

define sizeArray 512

constant int dP;

int hP;

// TODO: adds data to hP

cudaMemcpyToSymbol(“dP”, hP, sizeArray * sizeof(int), 0, cudaMemcpyHostToDevice);

// TODO: kernel calls

If you need texture cache, asynchronous memcpy and so on read doc\CUDA_C_Programming_Guide.pdf from your CUDA installation directory; for best practices read this doc\CUDA_C_Best_Practices_Guide.pdf

Cheers

Paolo

Wow, this looks great! Unfortunately I am away from my desk for the weekend, but first thing Monday morning I’ll be testing these methods out.

The idea is to initialize and set the variables once, copy them to the device once, and then use them repeatedly. The variables I am trying to move to this file are all constant, but perhaps will vary in size between runs. My end goal is to have a header file that initializes all of the variables for both the host and device, gives them safe default values, then attempts to load a text file that will then overwrite some variables, similar to a Fortran namelist file. The end goal is to allow the user to modify the system parameters for the simulation without having to remake the file each time, and the code will not rely on a complete (or even existing) input text file, just use it as it is provided.

Thanks again, and have a great weekend! I’m not sure about where you are, but the weather is beautiful here in Atlanta. Time to get rid of this LCD tan I’ve been working on all week! haha

~Josh

opensw · June 11, 2011, 10:16am

Wow, this looks great! Unfortunately I am away from my desk for the weekend, but first thing Monday morning I’ll be testing these methods out.

The idea is to initialize and set the variables once, copy them to the device once, and then use them repeatedly. The variables I am trying to move to this file are all constant, but perhaps will vary in size between runs. My end goal is to have a header file that initializes all of the variables for both the host and device, gives them safe default values, then attempts to load a text file that will then overwrite some variables, similar to a Fortran namelist file. The end goal is to allow the user to modify the system parameters for the simulation without having to remake the file each time, and the code will not rely on a complete (or even existing) input text file, just use it as it is provided.

Thanks again, and have a great weekend! I’m not sure about where you are, but the weather is beautiful here in Atlanta. Time to get rid of this LCD tan I’ve been working on all week! haha

~Josh

At the moment, I am in Italy… but I hope to go back next week to my homeland in Croatia :D.

Have a nice weekend

Paolo