High Xorg CPU usage during kernel

I’ve been noticing Xorg processor usage spikes sometimes when running my CUDA programs. I only really see it when the kernel takes awhile to finish. I’m including some code that attempts to recreate the problem.

The kernel is meant to do some busy, work. The usleep is meant to reduce the busy wait from the cudaThreadSynchronize so my process CPU usage stays low.

I’m running

OS: Fedora Core7 x86_64

Linux Kernel: 2.6.21-1.3194.fc7

NVIDIA Driver: 177.67

CUDA: NVIDIA_CUDA_Toolkit_2.0_rhel5.1_x86_64

CUDA SDK: NVIDIA_CUDA_SDK_2.02.0807.1535_linux

Server Version Number: 11.0

Server Vendor String: The X.Org Foundation

Server Vendor Version: 1.3.0 (10300000)

NV-Control Version: 1.17

[codebox]#include <stdio.h>

#include <unistd.h>

global void kernel(float *x)

{

// Just do something that will take time

*x = 2.0;

float a = *x;

for(int i = 0; i < 10000; i++)

{

    a *= cos(a);

}

*x = a;

}

int main(int argc, char **argv)

{

cudaSetDevice(0);

float *deviceMem;

cudaMalloc((void**)&deviceMem, sizeof(float));

for(int i = 0; i < 30;i++)

{

    kernel<<<dim3(1024,1,1),dim3(512,1,1)>>>(deviceMem);

// Reduce CUDA busy wait on thread synchronize

    usleep(500000);

cudaThreadSynchronize();

}

return 0;

}

[/codebox]

So does it recreate the problem? I was seeing VERY similar behavior a while back. It turned out to be a problem with accessing out of bounds memory in my kernel. But the strange thing is that it was producing errors sometimes, and sometimes it wasn’t. If there are problems in your code, they can sometimes be difficult to pinpoint because the behavior becomes erratic.

That being said, I’ve found the following few lines to be EXTREMELY useful in troubleshooting. Just place them after each and every call you make to a CUDA function:

cudaError_t err;

err = cudaGetLastError();

if (cudaSuccess != err)

{

	fprintf(stderr, "Cuda error: %s: %s.\n", "FunctionName()", cudaGetErrorString( err) );

	exit(EXIT_FAILURE);

}

Of course, the ‘err’ variable only needs to be declared once. The above will (hopefully) catch any error immediately after it occurs, so you can pinpoint what’s going on.

Another thing I just noticed is that there is an updated driver out for 64-bit LINUX (version 177.80). Try updating that to see if that helps with anything.

My code recreates the problem for me, but I don’t know if it will create the problem for someone else.
I did not include the error checking in my code for the sake of brevity, however I did run with error checking and there are no errors. The include code does not do much anyways, and only writes to one memory location.
I’m wondering if the nvidia Xorg module is also doing the busy wait when running the kernel.
I will update the driver, but I don’t think that will solve the issue because I’ve seen Xorg go crazy on an updated system.
Again, the included code is just something I wrote to recreate the problem by executing a kernel that takes almost a second to run.

Yeah, I had checked out that code before posting and it looked fine. I guess I was kind of hoping that the problem was with an earlier call (cudaMalloc or whatever) failing.

Anyway, here’s the thread I started relating to the problems I was having. There are some suggestions from Nvidia folks at the beginning, but my questions were largely unanswered and I ended up just posting myself towards the end. It may be worth at least looking at their suggestions:
http://forums.nvidia.com/index.php?showtopic=77612&hl=

In the meantime, I’ll run your code and see what happens.

I’m having the same results. Xorg is eating up CPU. You ARE reading and writing to the same memory location in multiple threads at the same time. Looking back on the thread I had created about this problem, I was doing the same thing. I wonder if that’s causing the problem?

For those experiencing this problem, is the GPU driving X the same GPU as you’re using for CUDA, or do you have multiple GPUs in the system?

Something interesting I found out.

I ran the program without X windows and ran fine.

I ran the program in X windows, logged in from another computer and ran top, and Xorg process wasn’t doing much at all.

I then ran top in a gnome-terminal in Xorg on the computer running the cuda program, and Xorg starts taking up more of the processor.

While running top in that gnome-terminal, I logged in from another computer again, attached gdb to Xorg, and saw that it was consistently in:

_#0 0x00002aaaab8a986b in _nv001232X () from /usr/lib64/xorg/modules/drivers//nvidia_drv.so
#1 0x00002aaaab8a9b8c in _nv001462X () from /usr/lib64/xorg/modules/drivers//nvidia_drv.so
#2 0x00002aaaaba29987 in BitOrderInvert () from /usr/lib64/xorg/modules/drivers//nvidia_drv.so
#3 0x00002aaaaba40d04 in BitOrderInvert () from /usr/lib64/xorg/modules/drivers//nvidia_drv.so
#4 0x00002aaaaba41214 in BitOrderInvert () from /usr/lib64/xorg/modules/drivers//nvidia_drv.so
#5 0x00002aaaab9f364a in BitOrderInvert () from /usr/lib64/xorg/modules/drivers//nvidia_drv.so
#6 0x000000000051c5c4 in ?? ()
#7 0x00000000004bf76d in ?? ()
#8 0x000000000044f7e0 in BlockHandler ()
#9 0x0000000000568395 in WaitForSomething ()
#10 0x000000000044b96a in Dispatch ()
#11 0x000000000043480d in main ()

or

#0 0x00002aaaab8a986b in _nv001232X () from /usr/lib64/xorg/modules/drivers//nvidia_drv.so
#1 0x00002aaaab8a9b8c in _nv001462X () from /usr/lib64/xorg/modules/drivers//nvidia_drv.so
#2 0x00002aaaaba29987 in BitOrderInvert () from /usr/lib64/xorg/modules/drivers//nvidia_drv.so
#3 0x00002aaaaba39f80 in BitOrderInvert () from /usr/lib64/xorg/modules/drivers//nvidia_drv.so
#4 0x00002aaaaba3aa25 in BitOrderInvert () from /usr/lib64/xorg/modules/drivers//nvidia_drv.so
#5 0x00002aaaaba3abdf in BitOrderInvert () from /usr/lib64/xorg/modules/drivers//nvidia_drv.so
#6 0x0000000000529e75 in ?? ()
#7 0x0000000000449f38 in ProcCopyArea ()
#8 0x000000000044baab in Dispatch ()
#9 0x000000000043480d in main ()

So my theory is that X windows has a bunch of events queued up to draw to video while CUDA is running a long kernel. When the kernel is finished, the events are dispatched to draw, causing Xorg to start executing a bunch of events in sequence. But what do I know?

Only 1 card, doing both Xorg and CUDA.

In my case, I have one GPU. It runs X and does my CUDA computations.

When I first monitored the CPU using top, I had it in a separate tab in my konsole. I noticed that when I ran the program and then switched tabs, xorg always seemed to be at a peak and slowly lowered just after switching tabs. I don’t know as much as I should about this stuff, but switching tabs would cause more events to get queued up, right? That would explain why I always saw a peak when switching over (switching tabs created the extra events). It would also explain why the Xorg CPU usage seems much less “peaky” when running top in a different terminal altogether, since I wasn’t doing as much on-screen during execution.

I think you may be onto something here.

Can you give me full code? I’ve got some debugging tools I can use to test this.

I posted the full code in the first post.

My assumption is that while the kernel is doing stuff, other X events generated by other programs using X will be queued up for drawing until the kernel is done, at which point the events for quickly be processed by X, spiking its usage.

Or X is actually trying to process an event, and in order for the nvidia driver to complete it, it does a busywait in waiting for the kernel to finish doing its thing.

Not sure where the spike usage happens, but I’m pretty sure X events are getting queued up.

Damnit, code boxes scroll now. Sorry about that.

This is probably related: I use NX (www.nomachine.com) to access (CUDA) boxes remotely. All these machines use the CUDA device as primary display, and X is always running because I don’t shut it down when I travel home… (there is no fancy GL screensaver involved btw)

One of my benchmarks gave me on my trusty G80:

66 GB/s when sitting in front of the box at university
48 GB/s (same binary, same everything) when NX’ing to that box from home
66 GB/s when NX’ing to another box and ssh’ing to the CUDA machine
66 GB/s when NX’ing to the CUDA box, opening up another shell, ssh’ing to localhost and running the app

Back then, I thought NX was to blame because it forks Xorg, but now, I am no longer sure. It was surely a pain in the lower end of the back to track this down…

I’m trying to figure out what’s different about the last case. Is X forwarding not enabled, so sshing to localhost effectively unsets $DISPLAY?

Nope, and setting / unsetting $DISPLAY does not affect this behaviour. The last observation is indeed weird…

Any luck with the debugging tools tmurray?

Haven’t had a chance yet, will play with this as soon as I have time. (which might not be for another week, sorry)