Global memory coherence in compute capability 2.0 Does __threadfence() really do what's on the t

Hi,

My question in a nutshell: If a block calls [font=“Courier New”]__threadfence()[/font] after writing to an address in global memory, do the other blocks really see that write after the call to [font=“Courier New”]__threadfence()[/font]?

I’m working on a parallel code in which several blocks update, i.e. read and write, some global data. Access to different regions of global data is controlled using mutexes, e.g. as described in Appendix A.1.1 of the “Cuda by Example” book, implemented using atomic instructions.

So far I had been testing my code on a C1060 with compute capability 1.3, and everything worked fine. However, I recently switched to a GTX 480 with compute capability 2.0, and things started going wrong. Judging by my results, whenever data is handed over from one block to another, not all the writes make it before the read.

In more specific terms, the following happens:

[list=1]

[*]Block 1: acquires the mutex protecting a region of the global data.

[*]Block 1: reads, processes and writes data to that region of global memory

[*]Block 1: calls [font=“Courier New”]__threadfence()[/font].

[*]Block 1: releases the mutex protecting that region of global data.

[*]Block 2: acquires the mutex protecting that same region of global data.

[*]Block 2: reads, processes and writes data to that region of global memory.

What goes wrong is that the data read by Block 2 in step 6 should be exactly the data written by Block 1 in step 2, but it is not. I’ve always assumed that this was somehow guaranteed by the [font=“Courier New”]__threadfence()[/font], but this does not seem to be the case for compute capability 2.0.

I’ve reduced my code to a sample program which reproduces the problem:

/* Headers. */

#include <stdlib.h>

#include <stdio.h>

/* Some constants. */

#define N 100

#define nr_jobs 10000

/* Global variables. */

__device__ int mutex = 0;

__device__ int data_cuda[N];

__device__ int taboo_list[N];

volatile __device__ int job_next = 0;

/**

 * @brief Lock a device mutex.

 *

 * @param m The mutex.

 *

 * Loops until the mutex can be set. Note that only one thread

 * can do this at a time, so to synchronize blocks, only a single thread of

 * each block should call it.

 */

__device__ void cuda_mutex_lock ( int *m ) {

    while ( atomicCAS( m , 0 , 1 ) != 0 );

    }

/**

 * @brief Unlock a device mutex.

 *

 * @param m The mutex.

 *

 * Does not check if the mutex had been locked.

 */

__device__ void cuda_mutex_unlock ( int *m ) {

    atomicExch( m , 0 );

    }

/**

 * @brief Kernel function.

 *

 * @param jobs Pointer to a list of jobs, i.e. the index of their data.

 *

 */

__global__ void kernel ( int *jobs ) {

int i, threadID;

    volatile __shared__ int jobid, tdata, sum;

/* Get the threadID. */

    threadID = threadIdx.x;

/* Main loop. */

    while ( job_next < nr_jobs ) {

/* Let the first thread grab a job. */

        if ( threadID == 0 ) {

/* Get the mutex. */

            cuda_mutex_lock( &mutex );

/* Loop over the remaining jobs. */

            for ( jobid = -1 , i = job_next ; i < nr_jobs ; i++ ) {

/* Is the data of the ith job free? */

                if ( atomicCAS( &taboo_list[ jobs[i] ] , 0 , 0 ) == 0 ) {

/* Move the job to the front of the list. */

                    jobid = jobs[i];

                    jobs[i] = jobs[job_next];

                    jobs[job_next] = jobid;

/* Mark the job as taken. */

                    // taboo_list[jobid] += 1;

                    atomicAdd( &taboo_list[jobid] , 1 );

/* Move to the next job. */

                    job_next += 1;

/* Spread the word... */

                    __threadfence();

/* Break. */

                    break;

} /* is ith job free? */

} /* loop over jobs. */

/* Release the mutex. */

            cuda_mutex_unlock( &mutex );

} /* first thread grabs a job. */

/* All threads to the job together. */

        if ( jobid >= 0 ) {

/* Load the data into tdata. */

            if ( threadID == 0 )

                tdata = data_cuda[jobid];

/* Do something to waste a bit of time. */

            for ( sum = 0 , i = 0 ; i < 10000 ; i++ )

                if ( i % 32 == threadID )

                    sum += threadID;

/* Write the data+1 back. */

            if ( threadID == 0 )

                data_cuda[jobid] = tdata + 1;

__threadfence();

/* Let the first thread unlock the job data. */

            if ( threadID == 0 )

                // taboo_list[jobid] -= 1;

                atomicSub( &taboo_list[jobid] , 1 );

} /* all threads do the job. */

} /* main loop. */

}

/**

 * @brief The main driver function. 

 *

 */

int main ( int argc , char *argv[] ) {

int i, sum;

    int *jobs, *jobs_cuda, data[N];

    void *data_cuda, *taboo_list;

    dim3 nr_blocks( 1 , 1 );

    dim3 nr_threads( 32 , 1 );

/* Get the number of blocks from the command line. */

    if ( argc > 1 )

        nr_blocks.x = atoi( argv[1] );

/* Allocate and fill the local job list. */

    jobs = (int *)malloc( sizeof(int) * nr_jobs );

    for ( i = 0 ; i < nr_jobs ; i++ )

        jobs[i] = rand() % N;

/* Set the shared device data to zero. */

    cudaGetSymbolAddress( &data_cuda , "data_cuda" );

    cudaGetSymbolAddress( &taboo_list , "taboo_list" );

    if ( cudaMemset( data_cuda , 0 , sizeof(int) * N ) != cudaSuccess )

        printf( "main: cudaMemset on line %i failed with %s.\n" , __LINE__ , cudaGetErrorString(cudaGetLastError()) );

    if ( cudaMemset( (void *)taboo_list , 0 , sizeof(int) * N ) != cudaSuccess )

        printf( "main: cudaMemset on line %i failed with %s.\n" , __LINE__ , cudaGetErrorString(cudaGetLastError()) );

/* Allocate and copy to the CUDA job list. */

    if ( cudaMalloc( &jobs_cuda , sizeof(int) * nr_jobs ) != cudaSuccess )

        printf( "main: cudaMalloc on line %i failed with %s.\n" , __LINE__ , cudaGetErrorString(cudaGetLastError()) );

    if ( cudaMemcpy( jobs_cuda , jobs , sizeof(int) * nr_jobs , cudaMemcpyHostToDevice ) != cudaSuccess )

        printf( "main: cudaMemcpy on line %i failed with %s.\n" , __LINE__ , cudaGetErrorString(cudaGetLastError()) );

/* Call the kernel. */

    kernel<<< nr_blocks , nr_threads >>>( jobs_cuda );

/* Copy back the data. */

    if ( cudaMemcpy( data , (void *)data_cuda , sizeof(int) * N , cudaMemcpyDeviceToHost ) != cudaSuccess )

        printf( "main: cudaMemcpy on line %i failed with %s.\n" , __LINE__ , cudaGetErrorString(cudaGetLastError()) );

/* Collect and print the sum. */

    for ( sum = 0 , i = 0 ; i < N ; i++ )

        sum += data[i];

    printf( "main: nr_blocks=%i, sum=%i.\n" , nr_blocks.x , sum );

/* Goodbye. */

    return 0;

}

The kernel basically manages a list of jobs which is protected by a mutex. Each job does something to an entry in the global array [font=“Courier New”]data_cuda[/font]. The status of each entry in the global array is controlled by the global [font=“Courier New”]taboo_list[/font], the entries of which can only be set when the mutex is held. The entries are freed whenever a job (just adding one to the data) is done.

If I run the program with a single block, all goes well:

[pedro@e-sci139 cuda_mutex]$ ./a.out 1

main: nr_blocks=1, sum=10000.

e.g. I get the correct number of increments to the global data back. However, if I use a larger number of threads:

[pedro@e-sci139 cuda_mutex]$ ./a.out 64

main: nr_blocks=64, sum=8590.

some of the updates to the global data get lost.

The obvious fix for my small example is to declare the global data as [font=“Courier New”]volatile[/font], e.g.

volatile __device__ int data_cuda[N];

This fix, however, is not possible in my “real” code since the global data is an array of [font=“Courier New”]struct[/font]s, and I get all kinds of errors if I declare it as [font=“Courier New”]volatile[/font].

So is there something I should know about [font=“Courier New”]__threadfence()[/font] in compute capability 2.0? Or is there a way to explicitly flush the L1 cache, or parts thereof? Or am I completely misunderstanding something here?

Many thanks,

Pedro

I may have just solved my own problem…

Well, actually, I hope this isn’t the solution because it’s a bit inelegant.

According to this thread, the caching behavior can be set at compile time. Using the (undocumented) options

-Xptxas -dlcm=cv

e.g. “Cache as volatile”, seems to do what I want it to do.

Cheers, Pedro