Ray cache

Hello,

we are currently implementing a ray cache.

We decided to use a hash table on the device side.

Since we cannot call new on the device, we use a buffer with ready elements and fill them with
data. During the tracing the data with the same hash code is reused.

This helped us to increase our efficiency.

We tested the code for the same scene for a sufficient number of cycles and it did not fail and worked properly.

However, when we test the code in a changing environment (a real time tracing) for some sufficient
number of rays we get multiple exceptions.

[2013-12-13 12:48:18.580046] [0x91b52700] [error] An exception occured while tracing: Launching context 1D: OptiX result code: -1, Unknown error (Details: Function “RTresult _rtContextLaunch1D(RTcontext_api*, unsigned int, RTsize)” caught exception: Encountered a CUDA error: driver().cuMemcpyHtoD(dst + dstOffset, static_cast<const char*>(src) + srcOffset, bytes) returned (700): Launch failed, [6750287])

The exceptions happen when we try to get performance data from the device

context->getFreeDeviceMemoryInBytes(device);

However, if we disable read operations on our cache buffer the exceptions do not occur.
Moreover the exceptions occur on the return statement, so they are not thrown neither on
index generation nor accessing the buffer. Only the return statement makes the code break.

I would like to know what could possibly cause the error?

kind regards,

Alexander

Have you tried enabling all of OptiX’s exceptions? See the exceptions or device_exceptions sample in the SDK for example of how to implement this. One of the enabled exceptions is to check all buffer accesses.

rtContextSetExceptionEnabled( context, RT_EXCEPTION_ALL, 1 )

You should also make sure you have an exception program as well.

Hello,

thank you for your answer.

We already have all exceptions enabled.

The exception I have posted previously is the result we get with enabled exceptions.

Could you give an explanation why this can happen?

I got feeling that at some point the buffer is destroyed.

I would like also to know how to handle such situations.

Let me know if you need any additional info.

kind regards,

Alexander

JBigler meant that you should

  1. Set up an exception program (an RT_PROGRAM in the device code)
  2. Register it with your context
  3. Enable print capabilities with rtContextSetPrintEnabled (this allows you to use rtPrintf on the device)
  4. Enable all the exceptions with rtContextSetExceptionEnabled
  5. Use rtPrintExceptionDetails to get additional info on what happened on the device while your code was executing. Notice that this last step should be done in your exception program (the one that you set up at point 1).

Hello marknv,

We already have all these set up except the last item.

I added rtPrintExceptionDetails to the exception program but it did not make a big difference.

So you already have an output in the first post.

Could you give an explanation why this happens on the device?

regards,

Alexander

Just to know… are you updating your hash table somehow? It might also be a synchronization problem. Is the hash table using shared/const memory somehow?

  1. We start with the buffer creation

nodeBuffer = context->createBuffer(RT_BUFFER_INPUT_OUTPUT, RT_FORMAT_USER, numOfEmitters * configuration.cacheBufferSize);
nodeBuffer->setElementSize(sizeof(CacheNode));
context[“cache_node_buffer”]->set(nodeBuffer);

CacheNode is our cache element which is used to store data, hash value and also some other data, for benchmarking, for example.

numOfEmitters is the number of active “cameras” which shoot rays on a scene.

int numOfBins = configuration.cacheBufferSize/configuration.cacheLoadFactor;
context[“num_of_bins”]->setInt(numOfBins);

numOfBins is a number of bins in the hash table. It is calculated based on the size of hash buffer
and the load factor, which in our case is 1.

after that we initialize the state of our cache elements setting the initial state of the variables
and so on.

  1. After we have done all the initialization procedures we launch the tracing.

rtContextLaunch1D

  1. We have generated a certain number of ray directions which have to be shot on the scene.
    So our ray generation program takes the directions from the buffer and shoot them on the scene.
    The tracing is done in a loop till a certain stopping criteria is met. However, in the tracing
    loop there are rays which have almost the same directions, with some reduction of precision
    we could take for them results of the previous computations.

So for every direction we check whether the cache already contains this direction.

cachedNodeRead = getFromCache(base_index, prev_data, benchmark, pos_hash);

if the result of the query is NULL, we do the ray tracing and write a result to the cache

if(cachedNodeRead == NULL)
{

 rtTrace(top_object, nextRay, data);
 writeToCache(base_index, prev_data, data, cachedNodeWrite, traceTime);          

}
else
{
data = cachedNodeRead->data;
}

  1. Here are the write and read operations which are performed on the cache buffer

inline device void writeToCache(int base_index, PerRayData prev_data, PerRayData data,
CacheNode* &cachedNodeWrite, float trace_time, bool benchmark, int pos_hash)
{

if(cache_init)
{       
    //first, we check the buffer size
    int buf_s = node_buffer.size();
    //create a key
    Key key;
    key = makeKey(key, prev_data, false); 
    //get the bucket index
    int bucket_ind = base_index + get_bucket_index(key.hash);      
    //gets and element from the buffer according to the buffer index   
    CacheNode* node = &node_buffer[bucket_ind];
    
  
    node->hash = key.hash;
    node->data = data;      
    node->used = true;      
    node->pos_hash = pos_hash;           
    node->timestamp = time;        
    atomicAdd(&(node->counter), 1);
        
  
    //here we try to link all the elements in one trace
    if(cachedNodeWrite != NULL)
    {
        atomicCAS(&( cachedNodeWrite->queue), -1, node->index);
        atomicCAS(&( node->parent ), -1, cachedNodeWrite->index);
       
                 
    }   
     cachedNodeWrite = node;
 }

}

inline device CacheNode* getFromCache(int base_index, PerRayData data, bool benchmark, int pos_hash)
{
if(cache_init)
{
//create a key
Key key;
key = makeKey(key, data, false);
//first take the counter value
int bucket_ind = base_index + get_bucket_index(key.hash);
//take the node from the buffer according to the index
CacheNode* node = &node_buffer[bucket_ind];

    //if the element is node used we return NULL
    if(!node->used)
        return NULL; 
    // if the position of the caller does not coincide with the cache value
    // positions the cache is purged
    if(root->pos_hash != pos_hash)
    {           
        invalidateRay(root);
        return NULL;
    }   
  
       
    //HERE THE EXCEPTION HAPPENS  
    return node;       

}

the exception happens on the return statement of the get call, or to be more precise on the assignment statement when we assign the element to the variable in the calling function.

regards,

Alexander

node->hash = key.hash;
node->data = data; 
node->used = true; 
node->pos_hash = pos_hash; 
node->timestamp = time;

These lines look suspicious to me. The CUDA execution model doesn’t guarantee that simultaneous writes from threads in the same block to the same global memory location will succeed. Could work… could.

Also: are you using atomics in a multi-GPU environment? If so, each buffer would need to have the RT_BUFFER_GPU_LOCAL flag (each device has its own copy of the buffer).

Could you give advice how to synchronize the code?

Alexander

Are you running your code in a multi-GPU environment? If not, it is safe to use atomics to control your write accesses

No, we are not running the code in a multi-GPU environment. I will try to use atomics to write a thread safe block and report to you about the result.

Alexander

Hello,

It looks like the synchronization does not fix the error.
Consider the following code

inline device void writeToCache(int base_index, PerRayData prev_data, PerRayData data,
CacheNode* &cachedNodeWrite, float trace_time, bool benchmark, int pos_hash)
{
if(cache_init)
{
//create a key
Key key;
key = makeKey(key, prev_data, false);

    //bucket index (composite node)
    int iind = get_bucket_index(key.hash);
    int bucket_ind = base_index + iind;
        
    CacheNode* node = &node_buffer[bucket_ind];
   
    clock_t time = clock();
           
    if(atomicCAS(&(node->writeLock), 0, 1) == 0)
    {
        node->hash = key.hash;
        node->data = data;
        node->used = true;
        node->traceTime = trace_time; 
        node->hashGen = key.hash_gen;
        node->pos_hash = pos_hash;
        node->timestamp = time;
        atomicExch(&(node->writeLock), 0);
                                 
    }
 }

}

inline device bool getFromCache(int base_index, PerRayData prev_data, PerRayData &data, bool benchmark, int pos_hash)
{
//if the cache is initialized
if(cache_init)
{
//create a key
Key key;
key = makeKey(key, prev_data, false);

    //first take the counter value
    int iind = get_bucket_index(key.hash); ;
    int bucket_ind = base_index + iind;     
  
    //get node
    CacheNode* node = &node_buffer[bucket_ind];
   
   
    if(atomicCAS(&(node->writeLock), 0, 1) == 0)
    {       
        if(node->pos_hash != pos_hash)
        {
            atomicExch(&(node->writeLock), 0);
            return false;
        }              
      
        atomicAdd(&(node->hit), 1);
       
        data = node->data;
        atomicExch(&(node->writeLock), 0);
        return true;              
          
     }
}   
return false;

}

The read and write operations are mutually exclusive.

In this case we also get multiple exceptions.

[2013-12-27 11:26:35.018231] [0x7932d700] [error] An exception occured while tracing: Launching context 1D: OptiX result code: -1, Unknown error (Details: Function “RTresult _rtContextLaunch1D(RTcontext_api*, unsigned int, RTsize)” caught exception: Encountered a CUDA error: Kernel launch returned (700): Launch failed, [6619200])

[2013-12-27 11:26:35.020383] [0x7932d700] [error] An exception occured while tracing: Launching context 1D: OptiX result code: -1, Unknown error (Details: Function “RTresult _rtContextLaunch1D(RTcontext_api*, unsigned int, RTsize)” caught exception: Encountered a CUDA error: driver().cuMemcpyHtoD(dst + dstOffset, static_cast<const char*>(src) + srcOffset, bytes) returned (700): Launch failed, [6750287])

merry Christmas,

Alexander

also before these exceptions we get RT_EXCEPTION_INTERNAL_ERROR at launch index (43129).
Why could this happen?

Alexander

… I think that we found some acceptable work-around (solution) for the problem.

Firstly, we use two locks, one for the write access and the second one for the read access.

Secondly, we control a number of thread which read a cache entry.

We found that if we have more then 5 threads succeedingly reading from the cache entry the exception
occurs.

Consider the following code:

inline device void writeToCache(int base_index, PerRayData prev_data, PerRayData data,
CacheNode* &cachedNodeWrite, float trace_time, bool benchmark, int pos_hash)
{
if(cache_init)
{
//rtPrintf(“planes: %d, %d, %d \n”, planex, planey, planez);
//rtPrintf(“chosen: %d \n”, chosen);
//create a key
Key key;
key = makeKey(key, prev_data, false);

    //bucket index (composite node)
    int iind = get_bucket_index(key.hash);
    int bucket_ind = base_index + iind;
    
    CacheNode *node = &node_buffer[bucket_ind];

    clock_t time = clock();
            
    if(atomicCAS(&(node->writeLock), 0, 1) == 0)
    {
        node->hash = key.hash;
        node->nextOrigin = prev_data.nextOrigin;
        node->nextDirection = prev_data.nextDirection;
        node->data = data;
        node->used = true;
        node->traceTime = trace_time;  
        node->hashGen = key.hash_gen;
        node->pos_hash = pos_hash;
        node->timestamp = time;
        //we unlock the read lock here
        atomicExch(&(node->readLock), 0);
    }
 }

}

inline device bool getFromCache(int base_index, PerRayData prev_data,
PerRayData &data, bool benchmark, int pos_hash, bool check)
{
//if the cache is initialized
if(cache_init)
{
//create a key
Key key;
key = makeKey(key, prev_data, false);

    //first take the counter value
    int iind = get_bucket_index(key.hash); ;
    int bucket_ind = base_index + iind;      
   
    //get node
    CacheNode* node = &node_buffer[bucket_ind];         
    
      
    if(atomicCAS(&(node->readLock), 0, 1) == 0)
    {
        if(node->pos_hash != pos_hash)
        {                 
            atomicExch(&(node->readN), 0);
            atomicExch(&(node->writeLock), 0);            
            return false;                
        }

        //a thread increases the counter of reading threads
        int i = atomicInc(&(node->readN),5);
        node->hit++; 
        data = node->data;
        //next we count if the number of thread is less then 4
        //then we release the read lock
        //else release the write lock
        if(i < 4)
            atomicExch(&(node->readLock), 0);        
        else
        {
            atomicExch(&(node->readN), 0);
            atomicExch(&(node->writeLock), 0);
        }
        return true;
        
    }
   
}
return false;

}

Nevertheless, your feedback could be very important for us.

kind regards,

Alexander