Irregularity in the timings A few statistics

Hey there,

I’ve been doing some timings of my 2 kernels. I’m calling 2 kernels one after the other. I’m using cudaThreadSynchronize() at the beginning of the first kernel and another time at the end of the second kernel. I’m averaging on 10,000 iterations (I don’t take into account the first 10,000). Here are a few statistics:

Average: 0.87ms

0.2% are below 0.81ms
30% are below 0.83ms
92% are below 0.87ms

7.4% are above 0.9ms
6.5% are above 1ms
2% are above 1.5ms

Maximum: 3.25ms
Minimum: 0.805ms

I was wondering if these variations in the timings were expected. Did anyone experience this?
Being above 1.1 x average approximately 7% of the time is quite fair I reckon. But why does the maximum reach more than 3ms?!? My application needs to run at 1,000Hz. So I need to stay below 1ms at all iterations. Because of these peaks, my rendering is a bit jerky (I compute the positions of some vertices with these kernels, that I render afterwards). I don’t think I’m doing something wrong.
It’s worth noting that I also tried to decrease the size of my model. With a smaller model, I notice approximately the same figures (average 0.32ms, 6% above 0.4ms, maximum 1.2ms).

Could I fix that? Or is it normal? And if it’s normal, perhaps someone from NVIDIA can explain me from where does it come from and if this irregularity will be fixed in a next release. That would be great.

Have you already looked whether the slowdown is data dependent? I don’t nearly get the same variations, at least if I call the same kernel with the same data that many times.

I process the same data at each iteration. It’s a finite element solver who computes the position of vertices at each iterations basically. So it’s not exactly the same data since the values depend from the previous iteration, but it’s exactly the same number of operations (same number of texture fetches, multiplication etc.). Is it what you meant by data dependent?

If I don’t average, here is a sample from what I got:

time: 0.842 ms
time: 0.835 ms
time: 1.337 ms
time: 0.834 ms
time: 0.835 ms
time: 0.835 ms
time: 0.832 ms
time: 0.822 ms
time: 0.83 ms
time: 0.837 ms
time: 0.831 ms
time: 0.834 ms
time: 0.836 ms
time: 0.832 ms
time: 0.83 ms
time: 0.832 ms
time: 0.828 ms
time: 0.832 ms
time: 0.838 ms
time: 0.812 ms
time: 0.827 ms
time: 2.175 ms
time: 0.847 ms
time: 0.827 ms
time: 0.837 ms
time: 0.844 ms
time: 0.838 ms
time: 0.835 ms
time: 0.839 ms
time: 0.842 ms
time: 0.839 ms
time: 0.836 ms
time: 1.337 ms
time: 0.835 ms
time: 0.827 ms
time: 0.819 ms

The big times are not regular… It sounds random to me.

Any cache access can potentially introduce data dependency and make your kernels run at different speeds. You mentioned that you use texture fetches. Is it possible that the coordinates of your fetches change from run to run? This would change the cache hit/miss ratio, causing your kernel to run longer or shorter as a result.

The other thing to check is your constant memory access. These are cached too and if you access different elements of the constant memory depending on your data, then it is another data dependency that could potentially skew your timings.

It’ll be interesting to look at your code/data and see why you see mostly fast timings (0.8ms) and then a sudden change (2ms). Does your FEM reach some boundary condition that causes a dramatic change in data access pattern?

Since you mention rendering, I would assume you are using the same CUDA device for compute and display. Would it be possible, for testing purposes, to run your code without the display in a linux console environment (no X)? I’ve noticed in benchmarking my code that the slightest display update can slow down the CUDA compute measurably (and I average over 10 s).

Maybe there is some kind of context switching to update a 2D window going on when you see those blips… I’ll check fluctuation in per iteration timings for my kernels later today to see if I get the blips too.

I have some numbers to support my hypothesis:

Timings were performed for 10 seconds (about 1700 kernel invocations). The first run was thrown out. Every kernel invocation is performing exactly the same operations on exactly the same dataset. Though it does perform many semi-random texture reads, so the order of warp execution may change performance slightly.

I would generate more statistics, but a yet undiscovered bug prevents me from running more than a few thousand kernels in a row.

Platform: AMD64 Gentoo, using CUDA 1.0

When running X:
mean: 5.76 ms
min: 5.70 ms
max: 7.17 ms
2.5% above 6.00 ms

Running console only (no X):
mean: 5.70 ms
min: 5.68 ms
max: 5.87 ms
0% above 6.00 ms

MisterAnderson> hum… Interesting. Indeed I noticed something like this a few days ago but I didn’t see the relation right away. I’m running Kubuntu and when I switch between desktops (or only between windows) the timings are soaring! If I switch like 3 or 4 times (and one switch per second so not even too fast), my averages reach several ms during this period. It came back to the normal when I stop. And in some cases (but that’s not very regular so it’s not easy to reproduce) I got a:

sphyraena>

i) I’m using texture fetches indeed. But the fetches only depend on the connectivity of my mesh so it’s constant at each iteration. The read value may change, but not how the access happens. The only thing is that I change some texture bindings prior the launch of my kernels. I have 3 different arrays, 2 texture references, and I’m using 3 different combinations depending of the iteration number modulo 3. But I think the influence is minor, and even if there was an influence, it would follow a 3-iteration cycle. Which is not the case. But the coordinates of my fetches are the same at each iteration. Therefore I don’t think it could be the cause of my problem.

ii) I’m not using constant memory.

iii) No, my FEM doesn’t reach any boundary conditions. My test model is a cube and I move one face of this cube according to a cycle. So it’s always the same motion. Nothing change with time. And no I can’t reveal my code. It’s too sensitive. This kind of FEM solver on GPU is a world first, there’s no way I reveal any part of the code.

I did further tests. Normally I use 2 kernels for my algorithm. For test purpose, I removed the first one, and the second one is:

   int tid = blockIdx.x*BLOCK_SIZE + threadIdx.x;

    

    if ( tid < size )

    {

        Array[tid] = make_float4(0, 0, 0, 0);

    }

So I do nothing basically. I disable rendering I usually do (so no other kernel). When I time this kernel, I have an average of 0.079ms. But I’ve got 6% of the iterations above 0.1ms and the maximum reaches 0.93ms!!! This is crazy.

Do you have any idea of what it could be? This is very important, my team will demo something next week in MICCAI (Medical Image Computing and Computer Assisted Intervention, http://www.miccai2007.org/). And I really need to fix that in this perspective.

Thanks for your help.

Which OS?

Have you tried to run with 2 cards, one for display, one for CUDA?

I’m running Kubuntu 6.06 on a Dell Precision 690 with a Xeon and a 8800 GTX.

And no, I haven’t tried to run 2 cards. Should I? Because the thing is in my complete version I’m using vertex buffer for rendering. So if I use 2 cards. one for rendering and one for physical computation, I’m gonna have to copy the vertex and normal buffer from one card to another. I’m currently avoiding any copy by writing the vertex and normal buffers directly with CUDA. But if you say that would fix the irregularities, I will try to see the performance I can get.

If you are running X and CUDA at the same time, the card needs to respond to two sources of requests. I would try to run the display on a non-G80 card and see if the CUDA timing is more uniform

I can’t do it with the spare graphics card I got, I don’t have enough slots on my motherboard. So I guess there’s no solution… :(

I bet the card->cpu->card memory copies would have killed your performance anyways. Since your app is displaying the results on screen, you obviously cannot get rid of X completely for the demo, but what about displaying your output in a full screen window? Perhaps then X will no longer try to update other windows…

Yes, definitely. I would need to copy the vertex and normal buffers more than 1,000 times per second between the two cards. That would kill my performance.

I tried the full screen mode. It helps a bit.

On the model size I mentionned earlier in this thread, it didn’t have any real effect (still 5.6% instead of 6% or so above 1ms for an average of 0.85ms).

But on smaller model, it helped a bit more.

  • Window mode

Average: 0.58ms; 3.6% above 0.7ms

  • Fullscreen mode

Average: 0.58ms; 2.8% above 0.7ms

And that’s the model we’re gonna demonstrate to MICCAI next week anyway. And of course we’re gonna demo that in fullscreen. So yes, it’s less bad than I thought.