Establishing GPU processor and memory usage

Hi

I’m looking for a way to assess the current GPU processor and memory usage/allocation when using a Cuda app (I am interested in Windows XP and Vista).

I’ve tried PerfSDK which works just fine for getting the GPU usage except that it seems to be quite developer focused, I was unable to get the app to work without installing PerfSDK on each PC I want the Cuda app to work with * (I don’t want to go telling the user to go install PerfSDK just to use this app).

So my question is, are there any parts of the Cuda API or other techniques I am missing (I am quite inexperienced in graphics work) that I can use to gather these two metrics?

I basically am looking for the ability to:

  1. Provide some GPU status to enable the user to decide if it’s worth trying to run another instance of the cuda app.
  2. Warn the user if the GPU or memory is getting a little too full before launching another instance of the app if they tried despite the info in 1.

Regards
Matt

  • I tried copying the dll mentioned in the header file to the app directory and making some changes to the registry to try and enable metric collection but it wouldn’t play nice. I also couldn’t get memory allocation assessment to work at all.

cuMemGetInfo() will return the amount of free memory. see: http://forums.nvidia.com/index.php?showtopic=67856

I have not heard of a function that reports GPU utilization (ie like Task Manager).

Hi, thanks for the reply, very helpful indeed.

I have one issue though, it works perfectly in XP but in Vista all calls to cuMemGetInfo are returning the same values after allocations and frees.

The other thing I was surprised to note was how much memory is used before really doing anything in Cuda, around 31MB on my system is used when I just allocate a very small buffer, thereafter memory is used up roughly as I’d expect (but it seems to be padding it a bit too roughly).

I was a bit surprised to see it report a different total bytes figure in XP and Vista too!

I am using drivers 178.08 in both XP and Vista.

I’ve included the code below incase anybody can spot something silly I am doing (I know I am mixing Driver and Runtime APIs but I gather from [post=“0”]here[/post] this is ok when doing this kind of thing).

[codebox]// memory_stats_test.cpp : Defines the entry point for the console application.

//

#include “stdafx.h”

#include “cuda.h”

#include “cuda_runtime_api.h”

#include <stdio.h>

bool cuda_get_memory_stats(unsigned int* free_mem, unsigned int* total_mem)

{

CUresult cu_status;

CUcontext temporary_context;

CUdevice temporary_device;

*total_mem = *free_mem = 0;

cu_status = cuInit(0);

if(cu_status != CUDA_SUCCESS) return false;

cu_status = cuDeviceGet(&temporary_device, 0);

if(cu_status != CUDA_SUCCESS) return false;

cu_status = cuCtxCreate(&temporary_context, 0, temporary_device);

if(cu_status != CUDA_SUCCESS) return false;

cu_status = cuMemGetInfo(free_mem, total_mem);

if(cu_status != CUDA_SUCCESS) return false;

cu_status = cuCtxDetach(temporary_context);

if(cu_status != CUDA_SUCCESS) return false;

return true;

}

void show_result(unsigned int free_mem, unsigned int total_mem, bool ret)

{

// show result

if(ret == true)

	printf("Total bytes: %d, Free bytes: %d\n", total_mem, free_mem);

else

	printf("Encountered error\n");

}

int _tmain(int argc, _TCHAR* argv)

{

unsigned int free_mem_bytes = 0;

unsigned int total_mem_bytes = 0;

cudaError_t status;

bool ret;

float* tmp_mem1;

float* tmp_mem2;

float* tmp_mem3;

float* tmp_mem4;

// get initial memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// allocate

status = cudaMalloc((void **) &tmp_mem1, 128 * sizeof(float));

// get memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// allocate

status = cudaMalloc((void **) &tmp_mem2, 10000000 * sizeof(float));

// get memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// allocate

status = cudaMalloc((void **) &tmp_mem3, 10000000 * sizeof(float));

// get memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// allocate

status = cudaMalloc((void **) &tmp_mem4, 8194 * sizeof(float));

// get memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// free memory

status = cudaFree(tmp_mem4);

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

status = cudaFree(tmp_mem3);

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

status = cudaFree(tmp_mem2);

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

status = cudaFree(tmp_mem1);

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

return 0;

}[/codebox]

In XP it returns on my 8600GT with 256MB:

[i]Total bytes: 268107776, Free bytes: 222431744

Total bytes: 268107776, Free bytes: 189007872

Total bytes: 268107776, Free bytes: 148965376

Total bytes: 268107776, Free bytes: 108922880

Total bytes: 268107776, Free bytes: 108857344

Total bytes: 268107776, Free bytes: 108922880

Total bytes: 268107776, Free bytes: 148965376

Total bytes: 268107776, Free bytes: 189007872

Total bytes: 268107776, Free bytes: 189007872[/i]

In Vista it returns:

[i]Total bytes: 268435456, Free bytes: 233373696

Total bytes: 268435456, Free bytes: 233373696

Total bytes: 268435456, Free bytes: 233373696

Total bytes: 268435456, Free bytes: 233373696

Total bytes: 268435456, Free bytes: 233373696

Total bytes: 268435456, Free bytes: 233373696

Total bytes: 268435456, Free bytes: 233373696

Total bytes: 268435456, Free bytes: 233373696

Total bytes: 268435456, Free bytes: 233373696[/i]

I ran your code on Vista and got the same result. But then I tried to put a cuMemAlloc() in there instead of a cudaMalloc(), and it looks like that does subtract from free memory on Vista.

I don’t know much about running multiple contexts, but I think this may be Vista’s GPU virtualization at work, allowing each application to allocate from the GPU as much memory as it wants, and then swapping it in and off the video card on demand when the applications’ kernels (or shaders) actually run.

Does Vista actually do swapping like that? I didn’t think it did… but Vista has its own magical resource manager that manages all memory allocations and such for you instead of our driver, so that’s going to be part of the problem.

Now I have to investigate this cudaMalloc versus cuMemAlloc, because that sounds pretty wacky…

Good idea Alex, I verified this code also works on my system with swapping over to cuMemAlloc, code included below for anybody interested.

If it’s a Vista GPU virtualisation issue I guess that it makes sense mixing Driver API and Runtime API creates different GPU contexts and could affect it. A bit of a look around with Google and having read a Microsoft paper on the new graphics system (http://www.microsoft.com/whdc/device/display/WDDM_VA.mspx) I guess Vista advertises to each GPU context the whole of the video card’s memory, regardless how much any other context already allocated in it’s own virtualised space. I get the impression any GPU application has access to the sum total of the memory on the GPU, so some kind of swapping must be done.

This has some quite substantial performance implications as we approach full GPU memory loading as everything gets dragged over the bus, right? I guess that’s always the way with virtual memory management. I wonder if it can be disabled. That may be a silly thought.

It would appear that to get at this info in the Runtime API using Vista is a non-starter for now then. If anybody knows any other methods, please shout up.

[codebox]// memory_stats_test.cpp : Defines the entry point for the console application.

//

#include “stdafx.h”

#include “cuda.h”

#include <stdio.h>

bool cuda_get_memory_stats(unsigned int* free_mem, unsigned int* total_mem)

{

*total_mem = *free_mem = 0;

CUresult cu_status = cuMemGetInfo(free_mem, total_mem);

if(cu_status != CUDA_SUCCESS) return false;

return true;

}

void show_result(unsigned int free_mem, unsigned int total_mem, bool ret)

{

// show result

if(ret == true)

	printf("Total bytes: %d, Free bytes: %d\n", total_mem, free_mem);

else

	printf("Encountered error\n");

}

int _tmain(int argc, _TCHAR* argv)

{

CUresult cu_status;

CUcontext temporary_context;

CUdevice temporary_device;

cu_status = cuInit(0);

if(cu_status != CUDA_SUCCESS) return -1;

cu_status = cuDeviceGet(&temporary_device, 0);

if(cu_status != CUDA_SUCCESS) return -1;

cu_status = cuCtxCreate(&temporary_context, 0, temporary_device);

if(cu_status != CUDA_SUCCESS) return -1;

// init memory stats to zeros

unsigned int free_mem_bytes = 0;

unsigned int total_mem_bytes = 0;

// --- Attempt using Drive API ---

bool ret;

CUdeviceptr tmp_mem1;

CUdeviceptr tmp_mem2;

CUdeviceptr tmp_mem3;

CUdeviceptr tmp_mem4;

// get initial memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// allocate

cu_status = cuMemAlloc(&tmp_mem1, 128 * sizeof(float));

if(cu_status != CUDA_SUCCESS) return -1;

// get memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// allocate

cu_status = cuMemAlloc(&tmp_mem2, 10000000 * sizeof(float));

if(cu_status != CUDA_SUCCESS) return -1;

// get memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// allocate

cu_status = cuMemAlloc(&tmp_mem3, 10000000 * sizeof(float));

if(cu_status != CUDA_SUCCESS) return -1;

// get memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// allocate

cu_status = cuMemAlloc(&tmp_mem4, 8194 * sizeof(float));

if(cu_status != CUDA_SUCCESS) return -1;

// get memory stats

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

// free memory

cu_status = cuMemFree(tmp_mem4);

if(cu_status != CUDA_SUCCESS) return -1;

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

cu_status = cuMemFree(tmp_mem3);

if(cu_status != CUDA_SUCCESS) return -1;

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

cu_status = cuMemFree(tmp_mem2);

if(cu_status != CUDA_SUCCESS) return -1;

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

cu_status = cuMemFree(tmp_mem1);

if(cu_status != CUDA_SUCCESS) return -1;

ret = cuda_get_memory_stats(&free_mem_bytes, &total_mem_bytes);

show_result(free_mem_bytes, total_mem_bytes, ret);

cu_status = cuCtxDetach(temporary_context);

if(cu_status != CUDA_SUCCESS) return -1;

return 0;

}[/codebox]

If anybody knows how to get at GPU usage (maybe via direct x, who knows) I’d still be very interested! :thumbup:

Currently there’s no way to get GPU usage–there’s an NVAPI call, but that doesn’t seem to work with CUDA.

Thanks for confirming; could it go onto the API features wish-list please (along with a cudaGetMemInfo)? :whistling:

cudaGetMemInfo I can understand, but what’s the advantage of a call to check the load (as opposed to a “give me the best GPU according to these parameters, which includes number of active contexts” call)?

I am beginning to writing a CUDA VST audio plugin (see http://en.wikipedia.org/wiki/Virtual_Studio_Technology) and I suspect users may want to know how far towards reaching the limit of their GPU they are to help make decisions on whether to use a CPU or GPU based plugin for the next effect.

For instance, if they know they’re nearing the limit (if it’s displayed in a window somewhere showing updating GPU and memory usage), they may choose not to load a plugin known to be computationally expensive or memory hungry and choose another, perhaps CPU bound or a GPU plugin with lighter demands. There are similar dedicated devices on the market (such as www.uad-2.com) and they provide such a reading; I know from experience it’s very useful from a workflow perspective to not to be interrupted as the PC starts to judder to a halt as you load up an external card too much and have to decide which plugins to bypass to get everything going smoothly again. Otherwise you tend to go gung-ho friviously using up an external resource and wind up running out of resource too quickly if there’s no visual feedback.

This is just my 2c, but in my experience anything more than one context trying to run kernels on a GPU simultaneously slows things down extremely. Specifically, 2 simultaneous instances of my app each run at 1/4 the speed of one instance.

Thanks for the pointer, I am pretty new/naive at this stage. I’ve done some tests using perhaps 4-6 instances (of some pretty trivial code I admit) but it’s not been too disheartening yet :) I figure that the processing I want to do isn’t going to be maddeningly challenging for the GPU so I am probably going to end up being limited by memory bandwidth rather than kernel execution speed. I am not, either, saying this is definately going to work very well but it’s good fun trying and I think it just might fly.

So if it’s not doing very intensive calculations (and perhaps it’s even less efficient on the GPU) why not just do it on the CPU, you may ask. Well, so the CPU can be used for some other plugin, I would say :) When you max out your CPU and have the GPU just sat there drawing aero glass effects, it’s frustrating not being able to use it for something more focused!

My situation is simple because each instance pegs the GPU with kernel calls 100% of the time. Your situation is more complicated I guess. You’ve got lots of plugins running all concurrently? Are they “bursty” (i.e, run for a little bit and then pause waiting on some condition for more data to process? Maybe that is why your 4-6 kernel test didn’t show any drastic slowdown as they probably time-shared the gpu nicely. It may be feasible to code in your own % load counter in this case. Just have each plugin time how long it spends running CUDA calls vs how long it spends waiting… although that will probably only work when there is only one plugin running though. Now I can see why you want a GPU usage monitor :)

You’ve got a point on data movement, though. That can be a performance killer. Do your plugins ever operate in chains (ouput of one becomes the input of another)? In that case you can avoid the host<->device copies with a little extra bookeeping in the data structure (although maybe you already do this).

I don’t know much about VST plugins, but if they live in the same process (they may not, I believe Photoshop plugins all have their own address space) it’s definitely worth it to just use the thread migration API. It’s way faster than switching contexts.

Also keep in mind that all kernel launches are serialized right now, even if you could fit multiple kernels on the GPU at once (which may be true given the sizes of data you’d probably want to use).

Lots of concurrent plugins? Ohh yes! Bursty? Yep, and sometimes you don’t even know how big your bursts are going to be, it can range from a few tens of samples to thousands. Chains of plugins, well yes, but you would typically let the host manage those chains for more flexibility (though I could envisage a ‘chainer’ of GPU plugins).

Although all VST hosts are different, they tend to have one thread for audio processing from which all plugins are called with a batch of samples, and another from which plugins are configured (set up, parameters changed etc). This presents context difficulties right away, I thought of two solutions:

  1. Go for Driver API and push contexts around as needed to copy/allocate memory and for the actual processing
    or
  2. Introduce a new thread and use it for the Runtime API when doing anything feintly regarding Cuda calls or memory allocations/transfers (I could do this with the Driver API too clearly)

Since I’m new to Cuda, I chose number 2 initially but I’ll try number 1 later on.

Using a thread plus a double buffering system helps with the burstiness so I can execute Cuda transfers/operations in a background thread while outputting/collecting other samples in the main audio engine thread (with a trade off for some latency) so the audio engine thread doesn’t get stalled waiting for memory transfers or kernels executions (so long as it can all be processed by the time the buffers need swapping around).

I am hoping that my kernels will run pretty quickly (as I am expecting them to be quite simple) so that having many small/simple serialised kernels hopefully won’t cause too many issues. If it does I’ll just end up increasing the processing buffer latencies to give everything more time to complete before things start blocking when data isn’t ready. As Cuda cards get quicker, presumably that latency will be able to come down with improved bandwidth and kernel execution times lowering (and I recognise my 8600GT is getting a bit long in the tooth anyway so if it works acceptably on that it should work on anything decent).

Regarding the % load counter, I guess I could get it per instance using the timing scheme you mentioned, but then I need to start summing across my plugin threads to give a total, and given Vista uses the GPU for Aero too it would be ideal to have that factored into the mix so I am not reporting false statistics. I used nvperfkit and this gave me the exact GPU usage info I wanted, but it didn’t seem possible to get this running without installing the whole sdk which isn’t so user friendly.

So yeh, a Runtime API memory and GPU load metric would be really useful!

I think the best strategy would be to code all your transformations into one kernel, and partition the GPU to run multiple plugins simultaneously. I don’t think any single plugin will really be able to take advantage of the whole GPU and so this scheme will be much more efficient. Plus, you’ll be able to track the GPU load easily. I don’t think this will be difficult to implement, either. In some ways, simpler. You could, for example, hardcode each plugin to execute as a block with 64 threads and 4KB smem. You launch one block per plugin. If one plugin takes longer, has more work to do, etc, the GPU scheduler handles the load-balancing. A data structure records what each block should be doing, and at the top level you just have switch(work[blockIdx.x]){}.

Do you mean like setting up a daemon / windows service on the CPU such that one single entity communicates with the GPU and have every single plugin locate the daemon and interact with that, rather than the GPU directly itself? I can see how such an approach would simplify management of the GPU and reduce unnecessary contention on the resource, though I’ve never done such a thing. Or perhaps I missed your point entirely…?

Yup, exactly. Daemons are actually really easy. I’m figuring out how to do virtually the exact same thing we’re talking about right now. (Minus the musical aspect.) Actually, I kind of like the market for high-end audio. (There’s something else I’m doing that does involve music.) Maybe we could work together?

Sounds like an interesting idea; how far have you got with implementing the daemon technique and are you noticing improvements?

If you’re working with daemons as services, I wonder if you’re having the same issues as these people:

http://forums.nvidia.com/index.php?showtopic=77197

Matt