Maximum number of queued kernels

My kernel contains a loop with something like 8 to 256 iterations. I’ve noticed that for some reason I get better performance if I limit each kernel execution to maybe 16 or 32 iterations and make multiple kernel executions. I don’t really understand why. My best guess is that it stops the different threads that are running on one multiprocessor from getting too far out of step with each other and therefore improves the cache coherence when sampling textures. The problem is that if I make too many kernel executions I seem to fill up the command queue and I start eating up CPU cycles. What is the maximum number of executions I can queue up? Can anyone suggest an easy way to extend it? Is there a better approach? I looked at thread synchronisation within the kernel but it seemed to slow everything down massively. Maybe I was doing something wrong though.

I don’t have any idea what the max queue length is, but I’ve also experienced that more, smaller kernels tend to perform better than fewer, longer kernels (see this topic, though I wasn’t doing texture lookups).

The implementation I wound up using only launches one kernel at a time, then waits for it to complete using a cuda event before starting the next call. You might try launching several kernels, putting events on every 5th one, then as events get hit add more to the queue…just a thought.

Hope that helps!

Ben

The queue depth is 16 on compute 1.0 devices and 24 on compute 1.1. I don’t know about compute 1.3.

Thanks. Thats probably just about enough for my application.

Thanks for sharing your findings. Its very interesting that you see the same effect even without texture lookups. Its starting to sound like a CUDA bug to me.

Thanks for the suggestion as well. I’ll probably end up doing something along those lines.

After your comments I decided to test my theory that in my case the problem is related to the texture cache. I modified my program so that the texture sampling coordinates don’t change from one loop iteration to the next. This had no noticeable effect on the behaviour - I still get best performance when I use 16 to 32 iterations per execution. This makes no sense to me. My threads are 100% non-divergent (every thread in the entire grid will exectute the exact same statements and perform the exact same number of loop iterations).

For the time being I will work around the issue using multiple kernel executions but since my kernel does quite a few calculations before entering its loop this must be a sub-optimal solution.

Edit: In fact, after a couple more experiments it looks like the size of the address range in linear memory that I read from/write to is far more important. Does that ring any bells with anyone?

could it be that with 32 threads per block you are reading in from adresses close together, so almost all reads are a cache hit?

You can test by letting each thread read places more than 8 kb apart. That should prevent cache hits.

I’m not quite sure what you mean by “32 threads per block”. I actually have 512 threads per block. Each thread will perform say 16 or 32 iterations of a loop. In general I would expect all 32 iterations of all 512 threads to result in cache hits most of the time.

Maybe it would help if I provide a little more information.

My kernel looks a bit like this:

for(…)

{

read from array in linear memory

sample from textures

combine values

write to array in linear memory

}

Normally the addresses in linear memory and the texture sampling coordinates would both depend on the threadId and the loop counter. Initially I suspected that by allowing too many iterations in a single kernel execution I was polluting my texture cache so I forced the texture sampling coordinates to depend only on the threadId and not the loop counter. It didn’t help. So then I made it so that instead of every iteration reading from/writing to a new address in linear memory I recycled the addresses after 16 iterations. This made a big difference. But global memory is uncached so whats going on?

In case its important, 256 iterations would normally require about 256MB of linear memory whereas I see the highest performance when I limit each kernel execution to about 32 iterations (or 32MB of linear memory). Also note that all of my global memory access is fully coallesced. A rough calculation indicates that I am using about 30GB/sec of memory bandwidth on an 8800 GT (not including texture sampling which will be very heavily cached).

How many GiB/s do you get when including the texture reads? See the other recent post on the texture cache (The Official NVIDIA Forums | NVIDIA) for my standard discussion, but I’ve never seen the texture “cache” provide more than the max device memory bandwidth even when reading the same value over and over.

IIRC, the 8800 GT can attain 50 GiB/s of memory bandwidth so 30 not counting texture reads is doing pretty good.

Regarding your observation that the width of memory access makes a difference, see this thread http://forums.nvidia.com/index.php?showtop…&hl=memory+bank , although I’m not sure if the issues are related or not.

Thanks for your comments. I think some of my results differ slightly from yours though.

I don’t think that your comment about temporal locality is true in my case. My application is backprojection of a 2D image into a 3D volume. A single iteration of a single thread processes one voxel (volume element). It samples the 2D image and adds it to the value already present in the 3D volume. Thus 32 iterations of one block processes a 16x32x32 voxel region. The projection of this region onto the 2D image is (very very roughly) still only 32x32 pixels by 4 bytes per pixel = 4k. Furthermore, there will be lots of other blocks that also project onto roughly the same small region of the 2D image. This is a massive oversimplification but you get the idea.

I also don’t agree with your description of the texture cache behaviour. The texture cache is local to the MP (actually to one pair of MPs) and (as you say) the documentation describes it as being as fast as shared memory in the case of a cache hit. My understanding is that only cache misses will cause reads from global memory and therefore use up global memory bandwidth. If I were to assume your “almost coallesced memory reader” model then my total bandwidth would be about 84GB/sec on an stock 8800 GT!

Thanks for the link about memory access. I haven’t had a chance to read it all yet but it looks very helpful.

That link was indeed very helpful. I’ve changed the layout of my data in memory and I’m now getting slightly higher bandwidth and more importantly it looks like I’ve got rid of all the wierd slowdowns for particular problem sizes and/or lots of iterations. I’m now getting 35GB/sec (not including texture sampling) out of my 8800 GT (which I’m pretty happy about).

Just curious: My reading of the manual (Programming Guide 1.1, sec 5.1.2.3) seems to indicate that texture cache doesn’t actually decrease latency, just reduces load on the memory bus (It looks like it’s designed to be constant-time to aid in streamed pixel processing when in graphics mode). Is that what people have been seeing? I haven’t had time to test it myself. Of course, if there are enough blocks available, the latency should just get hidden, so maybe “as fast as shared memory” is, in the aggregate, true?

I would agree with your interpretation of the guide. By all accounts, the texture cache is documented to act as a cache and should in principle increase the effective bandwidth when values in the cache are reused. I’ve just never seen the texture units provide more than ~70 GiB/s in practice, even when only accessing elements in a small region of memory. Someone please enlighten me with a microbenchmark showing how the cache can be taken advantage of to get more effective bandwidth… I’d love to find a way to get it working in my bandwidth starved app :)

http://forums.nvidia.com/index.php?showtop…76&#entry256376

Here is one of my microbenchmarks comparing constant vs shared vs texture memory for randomly reading a small 256 element array which by all accounts should be in cache for the entire run. With all threads in a warp accessing the same element in the array, the effective bandwidth (counting each texture read) was 67.95 GiB/s on an 8800 GTX. Shared memory put out 233.36 GiB/s, and constant memory achieved 271.84 GiB/s.

When I get into the office tomorrow, I might try a new microbenchmark with a 1-element texture and see how much bandwidth can be pulled from that.

I can probably prepare an example demonstrating what I’m seeing but it might be a while before I have time. In the meantime you might like to do some tests with textures mapped to 2D CUDA arrays using normalised sampling coordinates and see what happens.

Edit: You might also want to run the kernels through decuda and see what they look like. In particular, do you get 100% occupancy with all three kernels?

Scrap that. I think the problem is that the card has a limited texture fill rate. On the 8800 GTX its meant to be about 36 billion texels per sec but I’ve done a little googling and it looks like its actually only 18 billion texels per sec under some circumstances. That would be 72GB/sec for your 4 byte reads. Sound plausible?

I agree that this is low compared to global memory bandwidth but remember:

  1. If you are making good use of the cache then you might be getting all this bandwidth without using much of your real global memory bandwidth

  2. Bi-linear filtering is available for free

  3. You are not using any of your precious shared memory

  4. You don’t have to worry about bank conflicts

  5. You don’t have to waste instructions on cache logic

  6. It might be possible to obtain more than 4 bytes in a single texturing operation

Sounds very plausible. It also explains why my kernels using textures run better on G92 parts. I’ll run that microbenchmark on a G92 board to see if things change at all (as it has more texture units). I feel stupid for not looking this number up myself External Media And it explains why my early tests using 3 textures (one each for x,y,z) was slower than one float4 read that used more bandwidth.

Interesting idea. I’ll try another microbenchmark to test this. If things work out, it could mean better performance than I’ve been expecting for a few kernels I haven’t written yet.

Agreed. Although bilinear filtering is of little use in my application.

Most of the reads in my real app are float4 reads, and I still don’t see > device mem bandwidth. However, it is reading a large array in a fairly random pattern (although I use data-reordering to make the data locality as good as possible). I’ve never tried benchmarking texture reads of a small float4 array, but I will do so.

I have one thing to add: adding a texture read to a kernel often increases the register usage by ~3 registers. Constant memory might be the best option (if the array is small enough) in these situations, although with the doubled registers on the G200 chip the tradeoffs might work out differently.

interesting. may I ask where you found that information or did you simply try it out?

I wrote a small app that calls a kernel many times, printing a wall clock time after each kernel call (I can post the code if you want).

The maximum number of queued kernels quoted above may be out of date though. I was talking to John Stone and he mentioned that the latest drivers have much deeper queues (100’s of calls). I haven’t checked this with my test code yet, though.

I just wrote a quick test program and can confirm this with the CUDA 2.0 beta driver 177.13. On an 8800 GTX, the queue depth was 145 calls, and on a GTX 280 in a different computer, the depth was 154 calls.

Here is the code I used to check this:

#include <stdio.h>

#include <stdlib.h>

#define _DEBUG

#include "cutil.h"

__global__ void waste_time(float *out)

{

    // pointless kernel

    out[0] += sin(cos(exp(out[0])));

}

int main(int argc, char **argv)

{

    CUT_DEVICE_INIT(argc, argv);

    

    float *out;

    CUDA_SAFE_CALL(cudaMalloc((void**) &out, sizeof(float)));

    

    unsigned int timer;

    CUT_SAFE_CALL(cutCreateTimer(&timer));

    CUT_SAFE_CALL(cutStartTimer(timer));

   for(int i=0; i < 1000; i++) {

        CUT_SAFE_CALL(cutResetTimer(timer));

        waste_time<<<10240, 512>>>(out);

        printf("Call %d: %f ms\n", i, cutGetTimerValue(timer));

    }

}

Okay, so the maximum queue depth does not seem to be the pitfall in my case, because I am nowhere near those limits. However, I am clearly misunderstanding something here.

I changed your code snippet slightly by creating a stream and adding a memcpy inbetween the kernel calls. For what reason does the memcpy synchronize?

#include <stdio.h>

#include <stdlib.h>

#define _DEBUG

#include "cutil.h"

__global__ void waste_time(float *out)

{

 Â  // pointless kernel

 Â  out[0] += sin(cos(exp(out[0])));

}

int main(int argc, char **argv)

{

 Â  CUT_DEVICE_INIT(argc, argv);

 Â  

 Â  float *d_out,*d_out2,*h_out3;

 Â  CUDA_SAFE_CALL(cudaMalloc((void**) &d_out, sizeof(float)));

 Â  CUDA_SAFE_CALL(cudaMalloc((void**) &d_out2, sizeof(float)));

 Â  CUDA_SAFE_CALL(cudaMallocHost((void**) &h_out3, sizeof(float)));

 Â  

 Â  unsigned int timer;

 Â  CUT_SAFE_CALL(cutCreateTimer(&timer));

 Â  CUT_SAFE_CALL(cutStartTimer(timer));

 Â  cudaStream_t streamIdx;

 Â  cudaStreamCreate(&streamIdx);

  for(int i=0; i < 200;) {

 Â  Â  Â  cutResetTimer(timer);

 Â  Â  Â  cudaMemcpyAsync(d_out2, &h_out3, sizeof(float), cudaMemcpyHostToDevice, streamIdx);

 Â  Â  Â  waste_time<<<10240, 512, 0, streamIdx>>>(d_out);

 Â  Â  Â  printf("kernel %03d: %6.2f ms\n", i++, cutGetTimerValue(timer));

 Â  }

  CUDA_SAFE_CALL(cudaFree(d_out));

 Â  CUDA_SAFE_CALL(cudaFree(d_out2));

 Â  CUDA_SAFE_CALL(cudaFreeHost(h_out3));

}

Output is:

Using device 0: GeForce 8800 GTX

kernel 000: Â  0.02 ms

kernel 001: Â  0.01 ms

kernel 002: Â 31.79 ms

kernel 003: Â 31.79 ms

kernel 004: Â 31.77 ms

kernel 005: Â 31.87 ms

kernel 006: Â 31.84 ms

kernel 007: Â 31.84 ms

kernel 008: Â 31.85 ms

kernel 009: Â 31.84 ms

kernel 010: Â 31.80 ms

kernel 011: Â 31.77 ms

kernel 012: Â 31.78 ms

kernel 013: Â 31.81 ms

kernel 014: Â 31.76 ms

..