Hi there!
I have a cryptographic algorithm that has a huge performance gain if it is compiled into SASS dynamicaly. Each hash function has their own unique operation set and the only way to avoid anormous amount of ifs and divergence is to compile it on the fly.
I have a compiler that compiles it to cubin. But the problem is to load cubin to device. The cost of cuLoadModule[Ex] is very high. Lower PCI-E speed, lower cuLoadModule speed.
So the question is can I load SASS code directly into device memory from device memory? Is there a chance to compile SASS on GPU and then execute a kernel without CPU/PCI-E interference?
I read some posts from 2008-2010 where people were able to solve similar problem with Debugger API and CUDAOpen64, but things has changed, I suppose.
Does gdev still work? If it is, it might be a solution.