why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time

Enviroment :
Windows 10 64bit
Notebook Dell precision with Intel I7-2820
Nvidia Quadro 4000M driver 376.33
Cuda sdk 8.0

Using Nsight profiler with a sample test program that call cudaGetDeviceProperties the time statistics show an average execution time of 1.363 ms. Some functions from NPP library call implicitly cudaGetDeviceProperties before execution with an high speed penalty especially for the shortest ones.

I’ve another performance problem with cudaMallocPitch.
Using Nsight profiler I’ve measured the execution time of cudaMallocPitch and I’ve found that this time span a large range from 0.7 ms to 250 ms depending from the allocation dimension and from the allocation/deallocation history happened before. For example if I allocate 2 block each of 400 MB with or without a cudaFree of the first block between the two calls, I register an execution time of 250 ms vs 70 ms.
It appear very strange that the worst time happen when I call cudaMallocPitch, cudaFree and cudaMallocPitch again with the same allocation size.
Is this behavior normal?

Thanks

You don’t show code, but I assume these calls are the first CUDA API calls in your code. CUDA is stateful, and the state is initialized lazily, not until the first call to a CUDA API is made. So I suspect your CUDA function calls trigger the creation and initialization of a CUDA context, which takes some time.

To verify that this is the case, you can insert a call cudaFree(0) as the very first CUDA API call, which then triggers CUDA context creation.

Note that some CUDA functions may require acquisition of memory or other resources from the operating system, especially in a Windows WDDM driver environment, so in those cases CUDA is at the mercy of the operating system facilities involved. Highest CUDA performance under Windows is achieved with the TCC driver, however only certain GPUs are supported by it.

This thread may be of interest regarding npp:

https://devtalk.nvidia.com/default/topic/824289/npp-libray-fucntions-call-speed-issue-/

There were supposed to be some improvements in CUDA 8 vs. CUDA 7.5, so not directly applicable to your case. However various npp library issues are discussed there. Without more specifics about your test case, I don’t have anything else to mention.

Regarding the cudamallocpitch performance problem see thread

https://devtalk.nvidia.com/default/topic/963440/cudamalloc-pitch-significantly-slower-on-windows-with-geforce-drivers-gt-350-12/

Thanks for the answers.
@txbob
I will open a new specific thread for NPP function overhead related to my problem.

@njuffa and @HannesF99
A warm up phase is present and so I think it isn’t the answer.
Here the code. Is partially derived from HannesF99 sample presented in https://devtalk.nvidia.com/default/topic/963440/cudamalloc-pitch-significantly-slower-on-windows-with-geforce-drivers-gt-350-12/

#include <string>
#include <iostream>
#include <chrono>
#include <ctime>

#include <cuda.h>
#include <cuda_runtime_api.h>

class timer
{
private:
	std::chrono::time_point<std::chrono::high_resolution_clock> m_start;
	std::chrono::time_point<std::chrono::high_resolution_clock> m_stop;
	std::string m_msg;

public:
	timer() :
		m_start(std::chrono::high_resolution_clock::now())
	{}

	double timeElapsed()
	{
		auto end = std::chrono::high_resolution_clock::now();
		std::chrono::duration<double> elapsed_seconds = end - m_start;
		return elapsed_seconds.count();
	}
};


void testCudaMallocPitch(bool bFree)
{
	// force context creation
	cudaFree(0);

	// width and height of image 'k', in byte -> for an image with ~ 1 MB, 10 MB, 20 MB and 400 MB
	int width[4] = { 1000, 3000, 6000, 12000 };
	int height[4] = { 1000, 3200, 3200, 32000 };

	cudaEvent_t start, stop;
	void* ptr;
	size_t dummy2;

	// warm-up 
	cudaError_t er = cudaMallocPitch(&ptr, &dummy2, 1000, 1000);

	if (cudaSuccess != er)
		throw std::runtime_error("cudaMallocPitch");

	er = cudaFree(ptr);
	if (cudaSuccess != er)
		throw std::runtime_error("cudaFree");

	// events
	cudaEventCreate(&start);
	cudaEventCreate(&stop);

	// now measure the runtime for a cudaMallocPitch of an image with ~ 1 MB, 20 MB and 400 MB
	// two measure are taken for each allocation size
	for (int k = 0; k < sizeof(width) / sizeof(width[0]); ++k)
	{
		std::cout << "Allocation of one image with " << (width[k] * height[k] / 1e6) << " MB\n";
		for (int jj = 0; jj < 3; ++jj)
		{
			timer StartCounter;

			void* bufferPtr = 0;
			size_t pitch = 0;

			cudaEventRecord(start, 0);

			er = cudaMallocPitch(&bufferPtr, &pitch, width[k], height[k]);

			cudaEventRecord(stop, 0);

			if (er != cudaSuccess)
				throw std::runtime_error("cudaMallocPitch");

			double osTime = StartCounter.timeElapsed();

			float cudaTimeInMs=0;
			cudaEventSynchronize(stop);
			cudaEventElapsedTime(&cudaTimeInMs, start, stop);

			if (bFree)
			{
				er = cudaFree(bufferPtr);
				if (cudaSuccess != er)
					throw std::runtime_error("cudaFree");
			}

			std::cout << "cuda time : " << cudaTimeInMs << " ms  ";
			std::cout << "os time : " << osTime * 1000 << " ms \n";
		}
	}

	cudaEventDestroy(stop);
	cudaEventDestroy(start);
}


void testcudaGetDeviceProperties()
{
	// force context creation
	cudaFree(0);

	cudaEvent_t start, stop;
	cudaEventCreate(&start);
	cudaEventCreate(&stop);

	// warmup
	struct cudaDeviceProp 	prop;
	cudaError_t e = cudaGetDeviceProperties(&prop, 0);

	// now measure the runtime for a cudaGetDeviceProperties of an image with ~ 1 MB, 20 MB and 400 MB
	for (int k = 0; k < 10; ++k)
	{
		timer StartCounter;
		cudaEventRecord(start);

		cudaError_t e = cudaGetDeviceProperties(&prop, 0);

		cudaEventRecord(stop);
		cudaEventSynchronize(stop);

		double osTime = StartCounter.timeElapsed();

		float cudaTimeInMs;
		cudaEventElapsedTime(&cudaTimeInMs, start, stop);

		std::cout << "cuda time : " << cudaTimeInMs << " ms   ";
		std::cout << "os time : " << osTime * 1000 << " ms \n";
	}

	cudaEventDestroy(stop);
	cudaEventDestroy(start);
}

int main()
{
	std::cout << "\t cudaGetDeviceProperties\n\n";

	testcudaGetDeviceProperties();

	std::cout << "\n\n\t testCudaMallocPitch with cudaFree\n\n";

	testCudaMallocPitch(true);

	std::cout << "\n\n\t testCudaMallocPitch without cudaFree\n\n";

	testCudaMallocPitch(false);

	return 0;
}

The program output for my configuration is :

 cudaGetDeviceProperties

cuda time : 0.001632 ms os time : 0.999552 ms
cuda time : 0.001632 ms os time : 0.962961 ms
cuda time : 0.001632 ms os time : 0.97724 ms
cuda time : 0.001632 ms os time : 0.958945 ms
cuda time : 0.001632 ms os time : 0.958499 ms
cuda time : 0.001632 ms os time : 1.14145 ms
cuda time : 0.001728 ms os time : 2.98393 ms
cuda time : 0.001632 ms os time : 1.34582 ms
cuda time : 0.001664 ms os time : 1.05979 ms
cuda time : 0.001664 ms os time : 1.05533 ms

 testCudaMallocPitch with cudaFree

Allocation of one image with 1 MB
cuda time : 0.001632 ms os time : 1.14458 ms
cuda time : 0.001632 ms os time : 0.599731 ms
cuda time : 0.001632 ms os time : 0.589914 ms
Allocation of one image with 9.6 MB
cuda time : 0.001664 ms os time : 2.10486 ms
cuda time : 0.001632 ms os time : 5.35072 ms
cuda time : 0.001632 ms os time : 3.01918 ms
Allocation of one image with 19.2 MB
cuda time : 0.001632 ms os time : 4.22935 ms
cuda time : 0.001664 ms os time : 9.38463 ms
cuda time : 0.001632 ms os time : 5.33645 ms
Allocation of one image with 384 MB
cuda time : 0.00176 ms os time : 71.0601 ms
cuda time : 0.001632 ms os time : 98.4781 ms
cuda time : 0.001632 ms os time : 97.8061 ms

 testCudaMallocPitch without cudaFree

Allocation of one image with 1 MB
cuda time : 0.001664 ms os time : 0.86479 ms
cuda time : 0.001632 ms os time : 0.556001 ms
cuda time : 0.001728 ms os time : 0.564925 ms
Allocation of one image with 9.6 MB
cuda time : 0.001664 ms os time : 2.09504 ms
cuda time : 0.001632 ms os time : 2.10798 ms
cuda time : 0.001664 ms os time : 2.13431 ms
Allocation of one image with 19.2 MB
cuda time : 0.001632 ms os time : 4.01962 ms
cuda time : 0.00176 ms os time : 5.31993 ms
cuda time : 0.001632 ms os time : 3.94778 ms
Allocation of one image with 384 MB
cuda time : 0.001632 ms os time : 75.2377 ms
cuda time : 0.001632 ms os time : 71.7473 ms
cuda time : 0.001664 ms os time : 71.6317 ms

  1. cudaGetDeviceProperties average time measured with os high-precision timer is about 1ms and it is too long for this function.
  2. another problem arise : the same time measured using cuda events is very different (wrong).
  3. Nsight show that os time is the correct one. This is an extract from Nsight log function Duration(μs) cudaEventRecord 7.279 cudaGetDeviceProperties 1036.729 cudaEventRecord 6.644 cudaEventSynchronize 48.578
  4. cudaMallocPitch time depend from allocation size and from allocation history. Timing is different if a cudaFree is executed after each cudaMallocPitch or not
  5. same problem using cudaEvent for timing. Nsight show again that os time is the right one

I forgot something using cuda events?!?!?

Thanks

results.xlsx (20.3 KB)

cudaEvents on windows are not reliable for timing host activity. I wouldn’t recommend it.

I have been using the timer code for two decades without issues, it provides microsecond resolution.

I have never used NPP, so I don’t have insights into NPP-specific problems. However, I do know that Windows’s WDDM driver model is responsible for all kind of weird performance artifacts, and would advise staying away from WDDM if at all possible (use TCC on Windows, or use Linux) to maintain sanity. For that reason I stayed on XP as long as I could, obviously that is not an option a this time.

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

@njuffa
Thanks for your code. My timer class is a simple wrapper around std library .
I think it take the same results.

@txbob
What do you mean exactly with “host activity”. cudaEvent can’t be used to measure cuda runtime api functions like cudaMemcpy or cudaMallocPitch or cudaGetDeviceProperties?

Yes, that’s what I mean. That is just an observation based on my own experience.

But doesn’t your data also support it?

Greg Smith (whose judgement I trust in these matters) has pointed out more than once that cudaEvent should not be used for timing host activity. An example is his comment in this Stackoverflow thread:

http://stackoverflow.com/questions/31046158/cudaeventrecord-does-not-time-correctly-on-visual-studio-cpu-code
http://stackoverflow.com/questions/13676102/strategies-for-timing-cuda-kernels-pros-and-cons

@txbob
Ok cudaEvents can’t be used to measure host-only functions. I can realize that cudaDeviceProperties is one of that but why cudaMemcpy can be correctly profiled (as CUDA samples show) and cudaMallocPitch can’t? I imagine that cudaMallocPitch executes on GPU not on CPU. I’m wrong?

@njuffa
thank for links they are very useful

Now coming back to my original problem why is cudaMallocPitch so slow? Is a problem of my enviromment?
My quadro 4000M can’t be switched in TCC mode perhaps because optimus technology doesn’t allow that driver mode.

It’s likely that cudaMallocPitch is slow for the same reason that it is slow for HannesF99.

If you want to avoid these issues, get a GPU in TCC mode or else a GPU running under linux.

It’s acknowledged that Q4000M can’t be placed in TCC mode. If you want to avoid these issues, Q4000M is not a good GPU choice on windows.

WDDM imposes a great many limits on behavior of the CUDA GPU. This is one of them.

I would be extremely surprised if there is any code executed on the GPU for a call to cudaMallocPitch(), or any other kind of GPU memory allocation. If there were, corresponding kernel executions would show up in the profiler. At most, a tiny amount of GPU memory could be manipulated (by host code) that represent control structures of a memory allocator.

But I would think (but don’t have proof) that in general, and with WDDM in particular, all control structures for the GPU memory allocator are actually kept on the host side, and nothing is kept on the GPU. Allocating and freeing GPU memory requires inspecting and modifying these control structures.

We can conclude that with high probability a call to cudaMallocPitch() is 100% host-side activity.

I would think not all hope is lost. With the first generation WDDM introduced with Windows 7, NVIDIA engineers were able to work around the performance limitations inherent in this driver model in various clever ways. They might yet find better ways to do this for the second-generation WDDM used by Windows 10. The basic design issue seems to be that Microsoft designed this driver model to provide the most convenient abstraction and maximum isolation from the operating system’s perspective, performance be damned.

I hope also that the clever NVIDIA engineers might find ways to work around the performance limitations by the Windows 10 WDDM driver model, especially with regarding to the very high runtime of the cudaMallocXXX routines for larger allocations.

In my opinion, it’s too easy to just say ‘get a GPU in TCC mode or switch to Linux’.

Regarding porting to Linux, when you have a big (lots of code, lots of libraries, …) commercial Windows application in media & broadcasting application field, the porting effort itself can be huge. But porting is not enough, you have to test and qualify it for various CUDA GPUS of multiple generations, etc. etc. Furthermore, your customers will also not easily switch to Linux only because you recommend it for your application - they might have installed also other applications which do not have a Linux variant, they might not be comfortable with Linux etc.

Regarding ‘get a GPU in TCC mode’, there are also a few things which can make that difficult in practice, it least in media & broadcasting application field (a core business area of NVIDIA I suppose). First is that recent Tesla cards are only passively cooled, so you cannot put them into a standard graphics workstation like a HP Z800. For example, I don’t know any active cooled Tesla card of the ‘Pascal’ generation and I doubt that there will be ever one. The big Quadro cards (4000 / 5000 / 6000) or Titan cards could be put into TCC mode, which is fine. BUT in our application field it’s quite often the case that one wants to use such a beefy card also for display, rendering, color grading in other applications he uses, like Nuke, Fusion, Resolve, Nucoda, CAD programs etc. etc… So it’s often practically impossible to switch the card to TCC mode. So that makes it a “bit” of a disappointing experience for a customer when he buys e.g. a Quadro M6000 for several thousand dollars and then has to notice that the CUDA acceleration is not getting the expected performance improvement.

I understand your pain. I have extensive development experience with both the Linux and Windows environments, and I am generally OS agnostic as a developer. However, from time to time, my frustration with certain aspects of Windows relevant to programmer producivity and application performance gets the better of me … which is when I start actively recommending Linux. Linux has its own issues, but efficiency is usually not one of them, and when you stick to a sane distro (e.g. no “leaping lemurs”) problems can be minimized.

In the bigger picture I believe it is important that criticism be directed at the most appropriate party in order to achieve the desired effect. It was Microsoft who switched from the old XP driver model, which was easily as efficient as the Linux driver model, to a much less efficient WDDM model. My understanding is that one motivator was that the old model gave the OS insufficient protection from bugs in graphics drivers, another one was that they desired virtualization of all memory, including that of the GPU. That is all nice and good, but what they apparently failed to take into account is that many graphics tasks (and with the advent of CUDA, compute tasks) are performance sensitive, one can’t just go for convenient abstraction layers with a graphics driver and leave it at that. That is an issue that severely affected the original Windows NT, and one would have hoped Microsoft learned the lesson, but apparently not (the institutional memory might be too short).

The NVIDIA recommended solution for the case presented by HannesF99 is a maximus workstation. This is widely documented.

Such workstations have 2 GPUs, one of which is in TCC mode, for compute workloads, and one of which is in WDDM mode for display purposes.

I don’t think anybody ever recommends “Put it in TCC mode” with the concept that you will then have no display device in windows. Of course you need another GPU in that case. That is implicit in the recommendation.

TCC for compute is great when pairing up a powerful compute GPU with a simple GPU <= $100 for driving the GUI. But in HannesF99’s use case, it seems that both powerful compute and powerful graphics are required, which could lead to a pricey solution in the form of a Maximus workstation.

By the way, it seems NVIDIA has widened the original Maximus configurations into a more generic multi-GPU approach: http://www.nvidia.com/object/multi-gpu-technology.html. The FAQ linked there states: “Maximus remains a subset of Multi-GPU technology.”

@njuffa: Exaclty, both powerful compute and graphics are required quite often in media & broadcast application field. Typically, Quadro GPUs are employed.

I am sure NVIDIA would love to sell not one but two Quadro P5000 run in a Maximus-type configuration, so one of them can be dedicated to compute :-)

But seriously, if the market media & broadcast market is not overly price sensitive, and you must stay with Windows, that seems worth exploring to achieve the best possible performance.