Minimum size of element in rtContextLaunch2D

Hello All,

I’m a new Optix User and I should reuse the code of a guy who is not possible to connect now.

I would like to know, if there is any minimum size of the window that is launched with rtContextLaunch2D, because the code is launching a window of Nx1 elements. If I launch with N = 100000 all works well, but if I launch with N = 10000 the function rtContextLaunch2D throws a std c++ exception… The buffers that I use to comunicate the context with c++ are of N size… So I can not understand this…

The current configuration is CUDA 4.1, Optix SDK 2.5.0, Windows 7 64 bits, Visual Studio 2010 with a 32 bits project, and GTX 460.

Thanks a lot!

Imanol.

First thing to find out is what the exception reason was by decoding its error code.
Look for rtContextGetErrorString() in the OptiX SDK.

Thanks for replying so fast!

The error string returned is:
“Unknown error (Details: Function “_rtContextLaunch2D” caught C++ standard exception: unknown error)”

I should mention that I launch the context in that form:

_context->launch( PROGRAM_PHOTON, static_cast<unsigned int>(_numPhotons), static_cast<unsigned int>(1) );

Thanks!

Although it shouldn’t be a problem, is there a reason for the 2D launch with all of the work in the x dimension? I can understand launching it in ( 1, N ) to distribute work to multiple GPUs, but ( N, 1 ) could simply be launched as 1D:

_context->launch( PROGRAM_PHOTON, static_cast<unsigned int>(_numPhotons) );

That won’t likely solve your problem, though.

But to answer your other question:

No, there is no minimum launch size.

But if I had to guess there is some assumption of a certain size within the kernels themselves. Without knowing much detail about what you’re doing, I can only recommend to revisit all buffer reads/writes.

The code is from another developer who is not currently available… I’m only implementing new functionalities on his base code. I don’t know why he did this… So I suppose I can use a 1D launch… But, for your comment should I understand that if I do a 2D launch in (1,N) I will reduce the computation time in comparation with launching 2D (N,1) and 1D?

I don’t understant what you mean with revisit all buffer reads/writes. Looking for an access out of memory or this kind of things?

Thanks!

Not necessarily. Nvidia processors run a maximum of 32 threads per warp, but depending on the dimensions of your launch, some warps may run at less than maximum capacity. Also, depending on the amount of coherence in your work (that is, the extent to which all threads do the same thing at the same time), it may or may not be beneficial to use all 32 threads. See http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf for details.

So to answer your first question, without testing each option ((N,1), (1,N), and 1D), you’ll never know which one is faster for your particular application.

Mmmmmmm,

ok I understand!

I should do some tests then!

Thanks!!!

Yes.

That’s not really what I meant. I forget the exact number, but when you do a 2D launch as (N,1), the first ~65k threads or so will be launched on the first device. So, you’ll get very low utilization on the second device. Launching it as (1, N) will more evenly distribute the workload to both devices.

This only matters in a multi GPU environment though.