Problem: Managed call through C++/CLI performs many times slower than P/Invoke, that uses the same source

I have been studying different approaches on how to execute code on GPU, in managed passion. In my studies I have noticed very strange behavior and am searching for ways in fixing that.

In .Net solution I have a console app that runs computations using:

  • a). ManagedCuda library. Library takes the "cubin" file as one of the parameters, then creates a kernel based on the contents of that file. Then, using the managed functions we can allocate memory necessary for on-device execution etc.. Normally this library is using the P/Invoke (DllImport) approach to invoke unmanaged CUDA functions
  • b). C++ static library that contains the ".cu" file. Which creates CUBIN for a). Also. there is a wrapper dynamic C++/CLR library built on top, which wraps unmanaged C++ function in a managed environment. This wrapper is then used by the console app.
  • c). library that executes the same logic, using CPU

So basically, both approaches share the same base “.cu” file. In case a) - we use P/Invoke to deal with computations. In case b) - we use wrapper built on top of native code.

The results are: While managedCuda executes the code in 60ms - it takes about 1000ms to execute the code through wrapper. CPU is X times as slow as that (x2 in Release, x5 in Debug). All results are verified. CPU approach is used as the result verification baseline:

  • CPU: 5300ms
  • C++/CLR: 1000ms
  • P/Invoke: 60ms

So my question is - why is this taking so much longer time to execute the same code using approach b)?

I have profiled the native function a bit. If we comment out the main line that launches the execution:

proccess KERNEL_ARGS2(inputCount, 1) (d_output, d_outputCalc, d_in1, d_in2, d_in3, d_in4, inputCount, width, height);

then the b) executes the whole code in 7ms. Now, if we forget about the fact that there are no results produced - this means that there is nothing wrong with the pipeline or how the code is built. Wrapper library + pointers initialization + memory allocation + memory copying host/device and back - all of it just takes about 10 ms of 1000ms time. The main problem must something to do how the CUDA code is executed or how some optimizations are applied by linker??

My proofs are in this repo: https://github.com/pavlexander/gpu_tests
You are welcomed to check out the code and try it out yourself.

p.s. I have set the build configuration to “Release” for C++ project, that is, so CUBIN file would receive all optimizations. When running in Debug I have noticed that even the managedCuda performs much slower, which is expected. CUBIN file is 7kb size when optimized. Compared to 77kb before optimizations. So I am sure that Cuda compiler works as expected. The problem must be in Linker or some related config…

p.p. using CUDA 10.2, builds are in x64, VS2019, .net core 3.1

Milliseconds or microseconds? Does measuring the kernel execution time inadvertently include CUDA context initialization or copy overhead by any chance?

The measurments are in ms, i.e. milliseconds. So 1000ms = 1s.

Regarding the

Yes, b) approach measurements include context initialization + copying from/to device + allocation. 1000ms contains all of it. Hoewever, this process only takes 10ms out of total 1000ms execution time. I have confirmed this by excluding the main execute function. See my comments after:

Just noticed another strange behavior. The results are not produced on a much slower PC, when the data set is large.

To elaborate - I have 1080 Ti stationary PC. When the DataGenerator.Height value set to 5_000 it works fine. On a laptop, however, the kernel produces no results. However, if I set the lower value (500), then it starts working again… Could there be something wrong with the code? or is this hardware limitation? Strangely - managedCuda works just fine, even with a larger dataset on slow PC. But not the wrapper. Attaching the kernel source. Could somebody verify that it contains no errors?

#include <cuda.h> 
#include "cuda_runtime.h"
#include <device_launch_parameters.h> 
#include <texture_fetch_functions.h> 
#include <builtin_types.h> 
#include <vector_functions.h>

#include <stdio.h>
#include <iostream>

#ifdef __CUDACC__
#define KERNEL_ARGS2(grid, block) <<< grid, block >>>
#else
#define KERNEL_ARGS2(grid, block)
#endif

extern "C" // has to be extern, so that managed cuda could see it
{
	//kernel code
	__global__ void proccess(
		unsigned char* __restrict__ output,
		double* __restrict__ outputCalc,

		const int* __restrict__ in1,
		const int* __restrict__ in2,
		const double* __restrict__ in3,
		const unsigned char* __restrict__  in4,

		const int inputCount,
		const int width,
		const int height
	)
	{
		//int index = threadIdx.x;
		int index = blockIdx.x * blockDim.x + threadIdx.x;

		if (index >= inputCount) {
			return;
		}

		bool isTrue = false;
		int varA = in1[index];
		int varB = in2[index];

		double calculatable = 0;
		bool isLastFirstCondition = false;
		for (int row = 0; row < height; row++)
		{
			if (isTrue)
			{
				int idx = width * row + varA;

				if (!in4[idx]) {
					continue;
				}

				calculatable = calculatable + in3[row];
				isTrue = false;

				isLastFirstCondition = true;
			}
			else
			{
				int idx = width * row + varB;

				if (!in4[idx]) {
					continue;
				}

				calculatable = calculatable - in3[row];
				isTrue = true;

				isLastFirstCondition = false;
			}
		}

		output[index] = isLastFirstCondition;
		outputCalc[index] = calculatable;
	}
}

void CudaProccess(
	unsigned char* output,
	const int output_size,
	double* outputCalc,
	const int outputCalc_size,

	const int* in1,
	const int in1_size,
	const int* in2,
	const int in2_size,
	const double* in3,
	const int in3_size,
	const unsigned char* in4,
	const int in4_size,

	const int inputCount,
	const int width,
	const int height
) {
	// init dev variables
	unsigned char* d_output;
	double* d_outputCalc;
	int* d_in1;
	int* d_in2;
	double* d_in3;
	unsigned char* d_in4;

	// calculate native total sizes
	int output_totalSize = output_size * sizeof(unsigned char);
	int outputCalc_totalSize = outputCalc_size * sizeof(double);
	int in1_totalSize = in1_size * sizeof(int);
	int in2_totalSize = in2_size * sizeof(int);
	int in3_totalSize = in3_size * sizeof(double); // TBD: issue?
	int in4_totalSize = in4_size * sizeof(unsigned char);

	// allocate memory for device variables
	cudaMalloc((void**)&d_output, output_totalSize);
	cudaMalloc((void**)&d_outputCalc, outputCalc_totalSize);
	cudaMalloc((void**)&d_in1, in1_totalSize);
	cudaMalloc((void**)&d_in2, in2_totalSize);
	cudaMalloc((void**)&d_in3, in3_totalSize);
	cudaMalloc((void**)&d_in4, in4_totalSize);

	// write host -> device
	cudaMemcpy(d_in1, in1, in1_totalSize, cudaMemcpyHostToDevice);
	cudaMemcpy(d_in2, in2, in2_totalSize, cudaMemcpyHostToDevice);
	cudaMemcpy(d_in3, in3, in3_totalSize, cudaMemcpyHostToDevice);
	cudaMemcpy(d_in4, in4, in4_totalSize, cudaMemcpyHostToDevice);

	proccess KERNEL_ARGS2(inputCount, 1) (d_output, d_outputCalc, d_in1, d_in2, d_in3, d_in4, inputCount, width, height);
	//proccess << <inputCount, 1 >> > (d_output, d_outputCalc, d_in1, d_in2, d_in3, d_in4, inputCount, width, height);

	// Copy output array from GPU back to CPU.
	cudaMemcpy(output, d_output, output_totalSize, cudaMemcpyDeviceToHost);
	cudaMemcpy(outputCalc, d_outputCalc, outputCalc_totalSize, cudaMemcpyDeviceToHost);

	// Free up the arrays on the GPU.
	cudaFree(d_output);
	cudaFree(d_outputCalc);
	cudaFree(d_in1);
	cudaFree(d_in2);
	cudaFree(d_in3);
	cudaFree(d_in4);
}

EDIT: I have figured out why on large datasets the program produces no results. It’s because kernel throws an error, but we don’t see any errors, so the program just thinks that there are no results.I have added:

// check for error
cudaError_t error = cudaGetLastError();
if (error != cudaSuccess)
{
	// print the CUDA error message and exit
	printf("CUDA error: %s\n", cudaGetErrorString(error));
	exit(-1);
}

Apparently the error is: “the launch timed out and was terminated” i.e. “cudaErrorLaunchTimeout”(CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT?). The default timeout value is 2 seconds (seen in NSight monitor). Therefore, since the laptop is very slow computing device - it is not enough time for the kernel to finish processing. That’s why 1080Ti PC works, but laptop isn’t. Regardless, there is still a problem with a slow execution time in managed environment (approach b)). I would appreciate if someone checked the code for apparent mistakes…

I have created a C++ application that includes Cuda kernel in it. No middlemen, no C#, no wrappers, no nothing… Had to migrate some C# code to accomplish that. And the result is - the bare-bone C++ is about as slow as the C++/CLR wrapper!

So, the conclusion is - wrapper has nothing to do with the slowness… I will have to dig into linker config or something else. Sorry for confusion.

I wish I could close this topic here. I will have to create another one, with more precise and up-to-date information. Current subject and description are misleading. Any way I can remove the topic?

Once again, sorry for that.

Your measurement methodology appears to be flawed, because CUDA context initialization is performed in a lazy manner.
I.e. if you comment out the CUDA calls from your code, no context initialization is performed either.

This supports njuffa’s conclusion.

Thank you for the input. This night, found the issue, after countless other sleepless nights! Seems like instead of starting off hundreds parallel tasks - I only used a single thread…

so the fix is to change this:

proccess << <inputCount, 1 >> > (d_output, d_outputCalc, d_in1, d_in2, d_in3, d_in4, inputCount, width, height);

to this:

int grid = (inputCount + 255) / 256;
int block = 256;
proccess << <grid, block >> > (d_output, d_outputCalc, d_in1, d_in2, d_in3, d_in4, inputCount, width, height);

it ended up being a code error…Now, with the wrapper I have almost the same execution time as with managedCuda approach. I can’t believe the answer so close this whole time.