Establishing GPU processor and memory usage

hill_matthew · November 10, 2008, 10:53am

Hi

I’m looking for a way to assess the current GPU processor and memory usage/allocation when using a Cuda app (I am interested in Windows XP and Vista).

I’ve tried PerfSDK which works just fine for getting the GPU usage except that it seems to be quite developer focused, I was unable to get the app to work without installing PerfSDK on each PC I want the Cuda app to work with * (I don’t want to go telling the user to go install PerfSDK just to use this app).

So my question is, are there any parts of the Cuda API or other techniques I am missing (I am quite inexperienced in graphics work) that I can use to gather these two metrics?

I basically am looking for the ability to:

Provide some GPU status to enable the user to decide if it’s worth trying to run another instance of the cuda app.
Warn the user if the GPU or memory is getting a little too full before launching another instance of the app if they tried despite the info in 1.

Regards
Matt

I tried copying the dll mentioned in the header file to the app directory and making some changes to the registry to try and enable metric collection but it wouldn’t play nice. I also couldn’t get memory allocation assessment to work at all.

alex_dubinsky · November 10, 2008, 7:23pm

cuMemGetInfo() will return the amount of free memory. see: [url=“http://forums.nvidia.com/index.php?showtopic=67856”]http://forums.nvidia.com/index.php?showtopic=67856[/url]

I have not heard of a function that reports GPU utilization (ie like Task Manager).

hill_matthew · November 11, 2008, 1:22am

Hi, thanks for the reply, very helpful indeed.

I have one issue though, it works perfectly in XP but in Vista all calls to cuMemGetInfo are returning the same values after allocations and frees.

The other thing I was surprised to note was how much memory is used before really doing anything in Cuda, around 31MB on my system is used when I just allocate a very small buffer, thereafter memory is used up roughly as I’d expect (but it seems to be padding it a bit too roughly).

I was a bit surprised to see it report a different total bytes figure in XP and Vista too!

I am using drivers 178.08 in both XP and Vista.

I’ve included the code below incase anybody can spot something silly I am doing (I know I am mixing Driver and Runtime APIs but I gather from [post=“0”]here[/post] this is ok when doing this kind of thing).

[codebox]// memory_stats_test.cpp : Defines the entry point for the console application.

//

#include “stdafx.h”

#include “cuda.h”

#include “cuda_runtime_api.h”

#include <stdio.h>

bool cuda_get_memory_stats(unsigned int* free_mem, unsigned int* total_mem)

{

CUresult cu_status;

CUcontext temporary_context;

CUdevice temporary_device;

*total_mem = *free_mem = 0;

cu_status = cuInit(0);

if(cu_status != CUDA_SUCCESS) return false;

cu_status = cuDeviceGet(&temporary_device, 0);

if(cu_status != CUDA_SUCCESS) return false;

cu_status = cuCtxCreate(&temporary_context, 0, temporary_device);

if(cu_status != CUDA_SUCCESS) return false;

cu_status = cuMemGetInfo(free_mem, total_mem);

if(cu_status != CUDA_SUCCESS) return false;

cu_status = cuCtxDetach(temporary_context);

if(cu_status != CUDA_SUCCESS) return false;

return true;

}

void show_result(unsigned int free_mem, unsigned int total_mem, bool ret)

{

// show result

if(ret == true)

	printf("Total bytes: %d, Free bytes: %d\n", total_mem, free_mem);

else

	printf("Encountered error\n");

}

int _tmain(int argc, _TCHAR* argv)

{

unsigned int free_mem_bytes = 0;

unsigned int total_mem_bytes = 0;

cudaError_t status;

bool ret;

float* tmp_mem1;

float* tmp_mem2;

float* tmp_mem3;

float* tmp_mem4;

// get initial memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// allocate

status = cudaMalloc((void **) &tmp_mem1, 128 * sizeof(float));

// get memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// allocate

status = cudaMalloc((void **) &tmp_mem2, 10000000 * sizeof(float));

// get memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// allocate

status = cudaMalloc((void **) &tmp_mem3, 10000000 * sizeof(float));

// get memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// allocate

status = cudaMalloc((void **) &tmp_mem4, 8194 * sizeof(float));

// get memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// free memory

status = cudaFree(tmp_mem4);

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

status = cudaFree(tmp_mem3);

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

status = cudaFree(tmp_mem2);

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

status = cudaFree(tmp_mem1);

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

return 0;

}[/codebox]

In XP it returns on my 8600GT with 256MB:

[i]Total bytes: 268107776, Free bytes: 222431744

Total bytes: 268107776, Free bytes: 189007872

Total bytes: 268107776, Free bytes: 148965376

Total bytes: 268107776, Free bytes: 108922880

Total bytes: 268107776, Free bytes: 108857344

Total bytes: 268107776, Free bytes: 108922880

Total bytes: 268107776, Free bytes: 148965376

Total bytes: 268107776, Free bytes: 189007872

Total bytes: 268107776, Free bytes: 189007872[/i]

In Vista it returns:

[i]Total bytes: 268435456, Free bytes: 233373696

Total bytes: 268435456, Free bytes: 233373696

Total bytes: 268435456, Free bytes: 233373696[/i]

alex_dubinsky · November 11, 2008, 7:19pm

I ran your code on Vista and got the same result. But then I tried to put a cuMemAlloc() in there instead of a cudaMalloc(), and it looks like that does subtract from free memory on Vista.

I don’t know much about running multiple contexts, but I think this may be Vista’s GPU virtualization at work, allowing each application to allocate from the GPU as much memory as it wants, and then swapping it in and off the video card on demand when the applications’ kernels (or shaders) actually run.

tmurray · November 11, 2008, 7:53pm

Does Vista actually do swapping like that? I didn’t think it did… but Vista has its own magical resource manager that manages all memory allocations and such for you instead of our driver, so that’s going to be part of the problem.

Now I have to investigate this cudaMalloc versus cuMemAlloc, because that sounds pretty wacky…

hill_matthew · November 12, 2008, 8:31pm

Good idea Alex, I verified this code also works on my system with swapping over to cuMemAlloc, code included below for anybody interested.

If it’s a Vista GPU virtualisation issue I guess that it makes sense mixing Driver API and Runtime API creates different GPU contexts and could affect it. A bit of a look around with Google and having read a Microsoft paper on the new graphics system (http://www.microsoft.com/whdc/device/display/WDDM_VA.mspx) I guess Vista advertises to each GPU context the whole of the video card’s memory, regardless how much any other context already allocated in it’s own virtualised space. I get the impression any GPU application has access to the sum total of the memory on the GPU, so some kind of swapping must be done.

This has some quite substantial performance implications as we approach full GPU memory loading as everything gets dragged over the bus, right? I guess that’s always the way with virtual memory management. I wonder if it can be disabled. That may be a silly thought.

It would appear that to get at this info in the Runtime API using Vista is a non-starter for now then. If anybody knows any other methods, please shout up.

[codebox]// memory_stats_test.cpp : Defines the entry point for the console application.

//

#include “stdafx.h”

#include “cuda.h”

#include <stdio.h>

bool cuda_get_memory_stats(unsigned int* free_mem, unsigned int* total_mem)

{

*total_mem = *free_mem = 0;

CUresult cu_status = cuMemGetInfo(free_mem, total_mem);

if(cu_status != CUDA_SUCCESS) return false;

return true;

}

void show_result(unsigned int free_mem, unsigned int total_mem, bool ret)

{

// show result

if(ret == true)

	printf("Total bytes: %d, Free bytes: %d\n", total_mem, free_mem);

else

	printf("Encountered error\n");

}

int _tmain(int argc, _TCHAR* argv)

{

CUresult cu_status;

CUcontext temporary_context;

CUdevice temporary_device;

cu_status = cuInit(0);

if(cu_status != CUDA_SUCCESS) return -1;

cu_status = cuDeviceGet(&temporary_device, 0);

if(cu_status != CUDA_SUCCESS) return -1;

cu_status = cuCtxCreate(&temporary_context, 0, temporary_device);

if(cu_status != CUDA_SUCCESS) return -1;

// init memory stats to zeros

unsigned int free_mem_bytes = 0;

unsigned int total_mem_bytes = 0;

// --- Attempt using Drive API ---

bool ret;

CUdeviceptr tmp_mem1;

CUdeviceptr tmp_mem2;

CUdeviceptr tmp_mem3;

CUdeviceptr tmp_mem4;

// get initial memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// allocate

cu_status = cuMemAlloc(&tmp_mem1, 128 * sizeof(float));

if(cu_status != CUDA_SUCCESS) return -1;

// get memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// allocate

cu_status = cuMemAlloc(&tmp_mem2, 10000000 * sizeof(float));

if(cu_status != CUDA_SUCCESS) return -1;

// get memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// allocate

cu_status = cuMemAlloc(&tmp_mem3, 10000000 * sizeof(float));

if(cu_status != CUDA_SUCCESS) return -1;

// get memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// allocate

cu_status = cuMemAlloc(&tmp_mem4, 8194 * sizeof(float));

if(cu_status != CUDA_SUCCESS) return -1;

// get memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// free memory

cu_status = cuMemFree(tmp_mem4);

if(cu_status != CUDA_SUCCESS) return -1;

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

cu_status = cuMemFree(tmp_mem3);

if(cu_status != CUDA_SUCCESS) return -1;

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

cu_status = cuMemFree(tmp_mem2);

if(cu_status != CUDA_SUCCESS) return -1;

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

cu_status = cuMemFree(tmp_mem1);

if(cu_status != CUDA_SUCCESS) return -1;

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

cu_status = cuCtxDetach(temporary_context);

if(cu_status != CUDA_SUCCESS) return -1;

return 0;

}[/codebox]

If anybody knows how to get at GPU usage (maybe via direct x, who knows) I’d still be very interested! External Image

tmurray · November 12, 2008, 9:28pm

Currently there’s no way to get GPU usage–there’s an NVAPI call, but that doesn’t seem to work with CUDA.

hill_matthew · November 12, 2008, 9:35pm

Thanks for confirming; could it go onto the API features wish-list please (along with a cudaGetMemInfo)? External Media

tmurray · November 12, 2008, 9:47pm

cudaGetMemInfo I can understand, but what’s the advantage of a call to check the load (as opposed to a “give me the best GPU according to these parameters, which includes number of active contexts” call)?

hill_matthew · November 12, 2008, 9:59pm

I am beginning to writing a CUDA VST audio plugin (see http://en.wikipedia.org/wiki/Virtual_Studio_Technology) and I suspect users may want to know how far towards reaching the limit of their GPU they are to help make decisions on whether to use a CPU or GPU based plugin for the next effect.

For instance, if they know they’re nearing the limit (if it’s displayed in a window somewhere showing updating GPU and memory usage), they may choose not to load a plugin known to be computationally expensive or memory hungry and choose another, perhaps CPU bound or a GPU plugin with lighter demands. There are similar dedicated devices on the market (such as www.uad-2.com) and they provide such a reading; I know from experience it’s very useful from a workflow perspective to not to be interrupted as the PC starts to judder to a halt as you load up an external card too much and have to decide which plugins to bypass to get everything going smoothly again. Otherwise you tend to go gung-ho friviously using up an external resource and wind up running out of resource too quickly if there’s no visual feedback.

MisterAnderson42 · November 12, 2008, 10:12pm

This is just my 2c, but in my experience anything more than one context trying to run kernels on a GPU simultaneously slows things down extremely. Specifically, 2 simultaneous instances of my app each run at 1/4 the speed of one instance.

hill_matthew · November 12, 2008, 10:23pm

Thanks for the pointer, I am pretty new/naive at this stage. I’ve done some tests using perhaps 4-6 instances (of some pretty trivial code I admit) but it’s not been too disheartening yet :) I figure that the processing I want to do isn’t going to be maddeningly challenging for the GPU so I am probably going to end up being limited by memory bandwidth rather than kernel execution speed. I am not, either, saying this is definately going to work very well but it’s good fun trying and I think it just might fly.

So if it’s not doing very intensive calculations (and perhaps it’s even less efficient on the GPU) why not just do it on the CPU, you may ask. Well, so the CPU can be used for some other plugin, I would say :) When you max out your CPU and have the GPU just sat there drawing aero glass effects, it’s frustrating not being able to use it for something more focused!

MisterAnderson42 · November 12, 2008, 11:23pm

My situation is simple because each instance pegs the GPU with kernel calls 100% of the time. Your situation is more complicated I guess. You’ve got lots of plugins running all concurrently? Are they “bursty” (i.e, run for a little bit and then pause waiting on some condition for more data to process? Maybe that is why your 4-6 kernel test didn’t show any drastic slowdown as they probably time-shared the gpu nicely. It may be feasible to code in your own % load counter in this case. Just have each plugin time how long it spends running CUDA calls vs how long it spends waiting… although that will probably only work when there is only one plugin running though. Now I can see why you want a GPU usage monitor :)

You’ve got a point on data movement, though. That can be a performance killer. Do your plugins ever operate in chains (ouput of one becomes the input of another)? In that case you can avoid the host<->device copies with a little extra bookeeping in the data structure (although maybe you already do this).

tmurray · November 12, 2008, 11:40pm

I don’t know much about VST plugins, but if they live in the same process (they may not, I believe Photoshop plugins all have their own address space) it’s definitely worth it to just use the thread migration API. It’s way faster than switching contexts.

Also keep in mind that all kernel launches are serialized right now, even if you could fit multiple kernels on the GPU at once (which may be true given the sizes of data you’d probably want to use).

hill_matthew · November 13, 2008, 12:26am

Lots of concurrent plugins? Ohh yes! Bursty? Yep, and sometimes you don’t even know how big your bursts are going to be, it can range from a few tens of samples to thousands. Chains of plugins, well yes, but you would typically let the host manage those chains for more flexibility (though I could envisage a ‘chainer’ of GPU plugins).

Although all VST hosts are different, they tend to have one thread for audio processing from which all plugins are called with a batch of samples, and another from which plugins are configured (set up, parameters changed etc). This presents context difficulties right away, I thought of two solutions:

Go for Driver API and push contexts around as needed to copy/allocate memory and for the actual processing
or
Introduce a new thread and use it for the Runtime API when doing anything feintly regarding Cuda calls or memory allocations/transfers (I could do this with the Driver API too clearly)

Since I’m new to Cuda, I chose number 2 initially but I’ll try number 1 later on.

Using a thread plus a double buffering system helps with the burstiness so I can execute Cuda transfers/operations in a background thread while outputting/collecting other samples in the main audio engine thread (with a trade off for some latency) so the audio engine thread doesn’t get stalled waiting for memory transfers or kernels executions (so long as it can all be processed by the time the buffers need swapping around).

I am hoping that my kernels will run pretty quickly (as I am expecting them to be quite simple) so that having many small/simple serialised kernels hopefully won’t cause too many issues. If it does I’ll just end up increasing the processing buffer latencies to give everything more time to complete before things start blocking when data isn’t ready. As Cuda cards get quicker, presumably that latency will be able to come down with improved bandwidth and kernel execution times lowering (and I recognise my 8600GT is getting a bit long in the tooth anyway so if it works acceptably on that it should work on anything decent).

Regarding the % load counter, I guess I could get it per instance using the timing scheme you mentioned, but then I need to start summing across my plugin threads to give a total, and given Vista uses the GPU for Aero too it would be ideal to have that factored into the mix so I am not reporting false statistics. I used nvperfkit and this gave me the exact GPU usage info I wanted, but it didn’t seem possible to get this running without installing the whole sdk which isn’t so user friendly.

So yeh, a Runtime API memory and GPU load metric would be really useful!

alex_dubinsky · November 13, 2008, 3:40am

I think the best strategy would be to code all your transformations into one kernel, and partition the GPU to run multiple plugins simultaneously. I don’t think any single plugin will really be able to take advantage of the whole GPU and so this scheme will be much more efficient. Plus, you’ll be able to track the GPU load easily. I don’t think this will be difficult to implement, either. In some ways, simpler. You could, for example, hardcode each plugin to execute as a block with 64 threads and 4KB smem. You launch one block per plugin. If one plugin takes longer, has more work to do, etc, the GPU scheduler handles the load-balancing. A data structure records what each block should be doing, and at the top level you just have switch(work[blockIdx.x]){}.

hill_matthew · November 13, 2008, 6:14pm

Do you mean like setting up a daemon / windows service on the CPU such that one single entity communicates with the GPU and have every single plugin locate the daemon and interact with that, rather than the GPU directly itself? I can see how such an approach would simplify management of the GPU and reduce unnecessary contention on the resource, though I’ve never done such a thing. Or perhaps I missed your point entirely…?

alex_dubinsky · November 13, 2008, 7:19pm

Yup, exactly. Daemons are actually really easy. I’m figuring out how to do virtually the exact same thing we’re talking about right now. (Minus the musical aspect.) Actually, I kind of like the market for high-end audio. (There’s something else I’m doing that does involve music.) Maybe we could work together?

hill_matthew · November 15, 2008, 11:55am

Sounds like an interesting idea; how far have you got with implementing the daemon technique and are you noticing improvements?

hill_matthew · November 16, 2008, 6:28pm

If you’re working with daemons as services, I wonder if you’re having the same issues as these people:

http://forums.nvidia.com/index.php?showtopic=77197

Matt

Topic		Replies	Views
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204314	April 13, 2009
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134563	May 26, 2010
CUDA 4.0 CUDA Programming and Performance	63	507394	March 28, 2013
Cuda code performance CUDA Programming and Performance	14	3118	December 16, 2014
CUDA very slow performance CUDA Programming and Performance	21	16559	March 6, 2020
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11128	May 23, 2010
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12707	July 26, 2010
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1199	May 14, 2019
why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time CUDA Programming and Performance	18	2348	January 9, 2017
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8088	June 30, 2010

Establishing GPU processor and memory usage

Related topics