Optix Prime error in cudaMemcpyAsync when ray buffer exceeds 2MB

When using Optix Prime to test the visibility of the vertices of a mesh from a specific point i encounter the following exception:

Function "_rtpQueryExecute" caught exception: Encountered a CUDA error: 
cudaMemcpyAsync( dst, src, size, cudaMemcpyHostToDevice, m_streams[streamIdx] ) 
returned (11): invalid argument, [65863812]

I have narrowed this down a bit: The exception occurs with “query->execute(0)” each time the ray buffer set with

query->setRays( nverts, RTP_BUFFER_FORMAT_RAY_ORIGIN_DIRECTION, RTP_BUFFER_TYPE_HOST, rayBuffer );

exceeds 2MB (not GB), i.e. nverts >= 87382. Values below run fine. Any ideas how to fix this?

I am using Win7 x64, VS2008 and a 32bit compiled executable. The simplePrimePP example on the other hand runs fine with much more rays (32bit executable, same VS2008, same system).

My complete code is here:

...
using namespace optix::prime;

  //create prime context
  Context context = Context::create(RTP_CONTEXT_TYPE_CUDA);

  //create model
  Model model = context->createModel();
  MyMesh& m = getMesh();
  model->setTriangles(m.NumBoundaryFaces, RTP_BUFFER_TYPE_HOST, m.FaceVertexIndices, m.NumBoundaryVertices, RTP_BUFFER_TYPE_HOST, m.Vertices, 0);
  model->update(RTP_MODEL_HINT_NONE);

  unsigned int nverts = m.NumBoundaryVertices;

  // create ray buffer
  float* rayBuffer = new float[2*3*nverts];
  rtpHostBufferLock(rayBuffer, 2*3*nverts*sizeof(float));

  // create hit buffer
  float* hitBuffer = new float[nverts];
  rtpHostBufferLock(hitBuffer, nverts*sizeof(float));

  try {
    for (...) {

      // do sth unrelated

      // fill ray origins
      float origin[3] = {getPosition().X, getPosition().Y, getPosition().Z};
      for (int i=0; i<nverts; ++i) {
        memcpy(rayBuffer+(2*3*i), origin, 3*sizeof(float));
      }

      // fill ray directions
      F3Map verts = m.verticesAsMap();
      Eigen::Vector3f orig(origin[0], origin[1], origin[2]);
      for (int i=0; i<nverts; ++i) {
        Eigen::Vector3f direction = (verts.col(i)-orig).normalized();
        memcpy(rayBuffer+((2*3*(i))+3), direction.data(), 3*sizeof(float));
      }

      
      //create query
      Query query = model->createQuery(RTP_QUERY_TYPE_CLOSEST);
      query->setRays( nverts, RTP_BUFFER_FORMAT_RAY_ORIGIN_DIRECTION, RTP_BUFFER_TYPE_HOST, rayBuffer );
      query->setHits( nverts, RTP_BUFFER_FORMAT_HIT_T, RTP_BUFFER_TYPE_HOST, hitBuffer);
      
      //execute
      query->execute( 0 ); // FAILS HERE
      
      //analyze hit buffer
      ...
    }
  } catch ( optix::prime::Exception& e ) {
    LogErr(QString("An error occurred with error code %1 and message %2").arg(e.getErrorCode()).arg(QString::fromStdString(e.getErrorString())));
  }  

  rtpHostBufferUnlock(rayBuffer);
  delete[] rayBuffer;
  rtpHostBufferUnlock( hitBuffer );
  delete[] hitBuffer;
  
  ...

I just found out, that this does happen for RTP_BUFFER_FORMAT_RAY_ORIGIN_DIRECTION, but not with RTP_BUFFER_FORMAT_RAY_ORIGIN_TMIN_DIRECTION_TMAX. With the latter i can use plenty of rays (a million at least).
Also, using a CPU context both work.

Seems like a bug in Optix Prime to me ?!

To help isolating the issue, would you be able to cross check this with a 64-bit version of the application?

I would strongly recommend to not build 32-bit ray tracing applications at all.
You won’t be able use all available resources on today’s systems with a 32-bit application.

Please always list the following system details when reporting issues to reduce turnaround times:
OS version, OS bitness, installed GPU(s), display driver version, OptiX version, CUDA version.

You provided OS version and bitness.

Unfortunately i cannot compile for 64 bit at the moment due to complicated library dependencies. As already stated i have now switched to the other buffer format and it works. Other than that i have changed nothing. Are there any specific resources that are not available on a 32bit build other than more memory (my app is computation-bound and memory has not been a problem so far)?

Also, i have read in the nvidia publication “advanced optix” about +/-300Mrays/s/GPU. Allocating ray and hit buffers with cudaMalloc (and copying them to the device with cudaMemcpy) and then executing the same query 10000 times in a loop (which really only contains the “query->execute(0)”) i get rates of about 52-74 Mrays/s on a Geforce GTX Titan black non overclocked. I used between 18000 and a million rays per query. Since a titan black is not that slow (yet) i wonder if that is in the expected range?
(don’t get me wrong, it’s still very fast - i just need to know if i am doing everything right)

Or are the 300Mrays/s meant for Optix excluding prime, using geometry nodes with simpler hit tests than triangles?

(Win7 x64, GF Titan black, driver 335.23, Optix 3.6.3, Cuda 5.0)

Well, looking more closely at this post
https://devtalk.nvidia.com/default/topic/794982/optix/announcing-optix-3-7-beta/
* Don’t ship 32-bit OptiX Prime libraries, which was never a supported configuration.

Means you must switch to 64-bit to use OptiX Prime. No way around it.

Absolute performance numbers don’t make much sense without knowing the scene complexity but here you go: 50 MRays/sec is something I normally reach with OptiX when running my globall illumination path tracer (lots of divergence).
I’ve seen 600 to 900 MRays/sec for ambient occlusion rays with a very simple scene on a Quadro K6000 which is comparable to your Titan but twice the memory and some other workstation features.
So I would consider 52-74 MRays/sec to be in the low range for OptiX Prime without knowing the scene.

Okay, i see, at the moment i am testing if optix prime is faster than what i have and i have to switch to 64 bit if it is. However it does not look like it is at the moment.

My tested models are all spheres of radius 100 with 2^k equally distributed vertices centered at the origin. I am using one ray per vertex towards a point P=(0,0,200) with tmin=1e-3 and tmax equalling the distance of the vertex to the point P (direction vector normalized) to check visibility between vertices and P. 74 Mrays/s are reached for 2^20 points.

I thought that this lower performace might be related to all rays starting deep within the BVH without complete BVolume misses, but using rays completely outside the model is not faster.

Also, I have noticed, that cpu usage seems to be 100% (for my application) during the query-loop, i think that shows something is wrong since no buffers should be transferred. GPU-Z shows 100% GPU-usage, 31% memcontroller (GPU) and 1.8 / 6 GB memory in use.

Encapsulating the simplePrimePP example query with a loop also yields 100% cpu load.

Any ideas?

Can’t help without exact source code.

  • You’re using RTP_CONTEXT_TYPE_CUDA and RTP_BUFFER_TYPE_CUDA_LINEAR?
  • simplePrimepp is using RTP_QUERY_TYPE_CLOSEST. For pure visibility testing you should use RTP_QUERY_TYPE_ANY.
    It also makes sense to read all other OptiX Prime related posts on this forum. There are some touching visibility tests. Mind, the triangle intersection routine is not watertight for performance reasons.
  • simplePrimepp is also not showing anything performance related, queries are not asynchronous with query->execute(0). For performance you should look at the more advanced examples using multiple buffers (search for RTP_QUERY_HINT_ASYNC). Make sure you use a query size bigger than 65536 to get any asynchronicity.

-> I am using RTP_CONTEXT_TYPE_CUDA and RTP_BUFFER_TYPE_CUDA_LINEAR.

-> I have tested switching from CLOSEST TO ANY - no speedup.

-> I have read through some part of the forum and have not found any clues

-> Not watertight is indeed a slight problem. Is it planned to “fix” this some time, e.g. by providing a slower optional and watertight mode?

-> As for the asynchronous processing, since i do not download or upload any buffers during this test using RTP_QUERY_HINT_ASYNC could only speedup command passing from app to driver to gpu, right? I have tested this and it shows no change in mrays/s. Strangely though, when using the async hint, windows (gui interaction with all programs) completely freezes during the test.