I am trying to convert BlackScholes from the CUDA SDK 1.1 to use driver API (running on Linux x86_64 with driver 169.09 in Xen). BlackScholes invokes the same kernel (BlackScholesGPU) NUM_ITERATIONS times. If I load the module in every iteration using cuModuleLoad (and have or do not have a cuModuleUnload before the loop ends), I get a Segmentation fault. If I load the module once (outside the loop) and then call cuModuleGetFunction() in the loop, it fails after a few iterations. If I move the cuModuleGetFunction() out of the loop and just do the cuFuncSetBlockShape(), cuSetFuncSharedSize(), cuParamSet(), cuParamSetSize() and cuLaunchGrid() the code works but the results are still erroneous. I have two concerns:
- Why are the results incorrect (the code exactly corresponds to the host API version)
- In case of an example where I have multiple functions in the same module and I have to do cuLaunchGrid() on all of them in one iteration of the loop, how will I get by without calling cuModuleGetFunction() for each function within the loop that iterates multiple times.
I am working for a very close deadline. Any help in the matter is greatly appreciated. Thanks