I have been studying different approaches on how to execute code on GPU, in managed passion. In my studies I have noticed very strange behavior and am searching for ways in fixing that.
In .Net solution I have a console app that runs computations using:
- a). ManagedCuda library. Library takes the "cubin" file as one of the parameters, then creates a kernel based on the contents of that file. Then, using the managed functions we can allocate memory necessary for on-device execution etc.. Normally this library is using the P/Invoke (DllImport) approach to invoke unmanaged CUDA functions
- b). C++ static library that contains the ".cu" file. Which creates CUBIN for a). Also. there is a wrapper dynamic C++/CLR library built on top, which wraps unmanaged C++ function in a managed environment. This wrapper is then used by the console app.
- c). library that executes the same logic, using CPU
So basically, both approaches share the same base “.cu” file. In case a) - we use P/Invoke to deal with computations. In case b) - we use wrapper built on top of native code.
The results are: While managedCuda executes the code in 60ms - it takes about 1000ms to execute the code through wrapper. CPU is X times as slow as that (x2 in Release, x5 in Debug). All results are verified. CPU approach is used as the result verification baseline:
- CPU: 5300ms
- C++/CLR: 1000ms
- P/Invoke: 60ms
So my question is - why is this taking so much longer time to execute the same code using approach b)?
I have profiled the native function a bit. If we comment out the main line that launches the execution:
proccess KERNEL_ARGS2(inputCount, 1) (d_output, d_outputCalc, d_in1, d_in2, d_in3, d_in4, inputCount, width, height);
then the b) executes the whole code in 7ms. Now, if we forget about the fact that there are no results produced - this means that there is nothing wrong with the pipeline or how the code is built. Wrapper library + pointers initialization + memory allocation + memory copying host/device and back - all of it just takes about 10 ms of 1000ms time. The main problem must something to do how the CUDA code is executed or how some optimizations are applied by linker??
My proofs are in this repo: https://github.com/pavlexander/gpu_tests
You are welcomed to check out the code and try it out yourself.
p.s. I have set the build configuration to “Release” for C++ project, that is, so CUBIN file would receive all optimizations. When running in Debug I have noticed that even the managedCuda performs much slower, which is expected. CUBIN file is 7kb size when optimized. Compared to 77kb before optimizations. So I am sure that Cuda compiler works as expected. The problem must be in Linker or some related config…
p.p. using CUDA 10.2, builds are in x64, VS2019, .net core 3.1