Hi,
I have a project that is organized in classes. My code is separated in cpp and h files. Inside h files I have the classes definitions and inside cpp I have the implementations.
I’d like to know what is the clean way of create hybrid methods (that run in both host and device hardware). I tried everything but I can’t make it work.
Thanks.
Hi,
I have a project that is organized in classes. My code is separated in cpp and h files. Inside h files I have the classes definitions and inside cpp I have the implementations.
I’d like to know what is the clean way of create hybrid methods (that run in both host and device hardware). I tried everything but I can’t make it work.
Thanks.
What was your problem exactly? What did you stumble upon?
Some general notes on your problem:
-
device functions (so device member functions as well) cannot be linked. Everything has to be processed in the same .cu files. While you can split your code into separate .h files, all the implementation (and not just declaration) must be somehow included in your main .cu file. I hope there will be some linking in the future, though…
-
Up to device 1.3, there is no support for function calling. What compiler does is to inline everything. This also implies, you cannot have virtual functions. While 2.0 devices support function calling, I bet it is an expensive operation (correct me if I am wrong) – much more expensive than on host side, as you have to keep a stack for every single thread which is running.
-
If you have a method (or constructor/destructor) which is supposed to run both on host and device, use “host device” in front of its name. If you provide only host constructors, the class will have no constructors for the device side (even no default constructors)
-
If you want your code to diverge, depending on if it is running on host or device side, use macro:
#ifdef __CUDA_ARCH__
//device code
#else
//host code
#endif
Your code is parsed -twice-, once for host and once for device. The macro exists only when parsing device code. This functionality was somewhat broken in previous versions of CUDA but I hope it works fine now.
What was your problem exactly? What did you stumble upon?
Some general notes on your problem:
-
device functions (so device member functions as well) cannot be linked. Everything has to be processed in the same .cu files. While you can split your code into separate .h files, all the implementation (and not just declaration) must be somehow included in your main .cu file. I hope there will be some linking in the future, though…
-
Up to device 1.3, there is no support for function calling. What compiler does is to inline everything. This also implies, you cannot have virtual functions. While 2.0 devices support function calling, I bet it is an expensive operation (correct me if I am wrong) – much more expensive than on host side, as you have to keep a stack for every single thread which is running.
-
If you have a method (or constructor/destructor) which is supposed to run both on host and device, use “host device” in front of its name. If you provide only host constructors, the class will have no constructors for the device side (even no default constructors)
-
If you want your code to diverge, depending on if it is running on host or device side, use macro:
#ifdef __CUDA_ARCH__
//device code
#else
//host code
#endif
Your code is parsed -twice-, once for host and once for device. The macro exists only when parsing device code. This functionality was somewhat broken in previous versions of CUDA but I hope it works fine now.
Ok. Here’s my problem: I have a huge project. This project has many classes. Each class is in a separated file. Now I want to “see” those classes in my kernel function. So I have to put them all in a just one cu file? I can’t keep my code divided? I think the cu file will be too big, so it will be hard to find what I want and also understand the code.
I was testing some things and it worked when I, inside a cu file, called a function of another cu file. But now my doubt is: it make my program slower?
Very thanks for the help.
Ok. Here’s my problem: I have a huge project. This project has many classes. Each class is in a separated file. Now I want to “see” those classes in my kernel function. So I have to put them all in a just one cu file? I can’t keep my code divided? I think the cu file will be too big, so it will be hard to find what I want and also understand the code.
I was testing some things and it worked when I, inside a cu file, called a function of another cu file. But now my doubt is: it make my program slower?
Very thanks for the help.
I’ve done this before…check out my CUDA ray tracer at
http://www.cs264.org/projects/web/Patel_Krunal/index.html
to see how best to divy up your classes
I’ve done this before…check out my CUDA ray tracer at
http://www.cs264.org/projects/web/Patel_Krunal/index.html
to see how best to divy up your classes
Thanks for your answer. I didn’t know this inline file. Is it compiled without linking? So Cuda can use it?
Thanks for your answer. I didn’t know this inline file. Is it compiled without linking? So Cuda can use it?
Hey if you look at my post thats a little above yours I have the same issue. The thing is you will have to define all your class member functions as a device function. If you go the inline way - basically everything is included in one file. You can make separate files, but in the end when you build you only build 1 file (i.e. nvcc -c MainWithInlines.cu -o main.o)… Its a pain to debug.
Also you will probably have to modify how your classes operate on data, as you can’t just take a class like
class example{
protected:
float * a;
public:
void AddA(void);
}
void example::AddA(){
/*loop over indexes*/
a[i]+=1;
}
And then throught it into CUDA. It will A. throw errors as you know, and B. Probably won’t operate on your data in an efficient/correct fashion.
You will still have to modify the class so instead of doing serial operations it will be doing parallel operations… otherwise it is pointless and there’s no reason to use you classes in CUDA. i.e. your new AddA function would have to look like:
__device__ void example::AddA(){
uint idx=threadIdx.x+blockIdx.x*Block_Size;
a[idx]+=1;
}
And on top of that you will then have to inline your entire class prototype and the class definition… So much fun! Again see my newer post on how to inline structures to give you an idea of what you need to do.
So in conclusion, changing over all your classes is going to be a manual process… and may not be feasible for your application (or maybe it is- you’re lucky then!).
Hey if you look at my post thats a little above yours I have the same issue. The thing is you will have to define all your class member functions as a device function. If you go the inline way - basically everything is included in one file. You can make separate files, but in the end when you build you only build 1 file (i.e. nvcc -c MainWithInlines.cu -o main.o)… Its a pain to debug.
Also you will probably have to modify how your classes operate on data, as you can’t just take a class like
class example{
protected:
float * a;
public:
void AddA(void);
}
void example::AddA(){
/*loop over indexes*/
a[i]+=1;
}
And then throught it into CUDA. It will A. throw errors as you know, and B. Probably won’t operate on your data in an efficient/correct fashion.
You will still have to modify the class so instead of doing serial operations it will be doing parallel operations… otherwise it is pointless and there’s no reason to use you classes in CUDA. i.e. your new AddA function would have to look like:
__device__ void example::AddA(){
uint idx=threadIdx.x+blockIdx.x*Block_Size;
a[idx]+=1;
}
And on top of that you will then have to inline your entire class prototype and the class definition… So much fun! Again see my newer post on how to inline structures to give you an idea of what you need to do.
So in conclusion, changing over all your classes is going to be a manual process… and may not be feasible for your application (or maybe it is- you’re lucky then!).