Radon transform on the GPU

Hi,
I’m a beginner in CUDA programing and I’m trying to program the Radon transform on the GPU. It’s the sum of the intensities of the image along an axis defined by an angle theta from the center of the image. I calculate these index in the CPU then i use them in the GPU. The problem is that i’m using a conditional on the value of the threadIdx.x in a loop, and I doubt that this is the reason of launch failure :


cudaSafeCall() Runtime API error in file <radonKernel.cu>, line 231 : unspecified launch failure.


the code:
/*****************************************************************************************
// kernel
global void radon_vrs3( float* img_In, float* radonResults, int* index, int N, int M )
{
// Declare rows of the matrix to be in shared mamory for speed
shared float sum[256];

    // Calculate which element this thread reads from memory
    //int index =  M * blockIdx.x + threadIdx.x; for radon_vrs2
    int idx =  M * blockIdx.x + threadIdx.x;
    for (int ii = 0; ii < N*M; ii++)
    {
            if ( idx == index[ii] )   // index is the vector of the correct indexes to read according the value of theta
                    sum[threadIdx.x] = img_In[idx];
    }
    __syncthreads();
    int nTotalThreads = blockDim.x; // Total number of active threads
    while (nTotalThreads > 1)
    {
            int halfPoint = (nTotalThreads >> 1); // divide by two
            // Only the first half of threads will be active.
            if (threadIdx.x < halfPoint)
                    sum[threadIdx.x] += sum[threadIdx.x + halfPoint];
            __syncthreads();

            nTotalThreads = (nTotalThreads >> 1); // divide by two.
    }
    // At this time, each thread(0) has a sum of a row
    // It's time for each thread(0) to write it's final result.
    if (threadIdx.x == 0)
            radonResults[blockIdx.x]=sum[0];

}
/*********************************************************************************************************

Can anyone help ??

Nothing immediately in your device code jumps out as a cause for an ULF, but you might also want to take a look at your host code. If there’s a problem with your allocation of the arrays you pass to your kernel, you can get an ULF. If you call your kernel with impossible grid and block dimensions, you get an ULF as well. I think (though I can’t recall for sure), hitting the watchdog timer can trigger an ULF. (Hopefully someday we’ll get more specific error messages…)

syncthreads in a loop looks very suspiciouse.

Thank you for the tips, but i checked almost everything you said (except “hitting the watchdog timer can trigger an ULF”, i really don’t know what it is). I made also some changes :

the first condition in the device code becomes :

/*************************************************************************

#define threads_per_block 512

#define blocks_per_grid_row 512

// kernel

global void radon_vrs3( float* img_In, float* radonResults, int* index, int N, int M )

{

    // Declare rows of the matrix to be in shared mamory for speed

    __shared__ float sum[threads_per_block];

// Calculate which element this thread reads from memory

    //int index =  M * blockIdx.x + threadIdx.x; for radon_vrs2

    int idx1 =  M * blockIdx.x + threadIdx.x;

    int idx = index[idx1];

sum[threadIdx.x] = img_In[idx];

    __syncthreads();

/***************************************************************************************************

and the problem is coming from this instruction : sum[threadIdx.x] = img_In[idx];

If I replace idx here by idx1, everything goes well, even the results of the radon transform are correct. Is it because of the values in the index vector ?? for example at theta = 0, the indexes that must be read from the Image(512x512) are from 130560 to 131071 (which is in total 512 values to be read and summed).

I checked this up, no problem with this …

try to debug your code and check what is going in program.

what are your kernel launch arguments? If you are launching with more than 256 threads, you’ll exceed your hardcoded shared memory array size and get the kernel crash.

these are the kernel arguments :

/*****************************************************************************

#define threads_per_block 512

#define blocks_per_grid_row 512

int dataAmount = threads_per_block * blocks_per_grid_row;

    int blockGridWidth = blocks_per_grid_row; //

    int blockGridHeight = (dataAmount / threads_per_block) / blockGridWidth; //

    dim3 blockGridRows( blockGridWidth, blockGridHeight );//

    dim3 threadBlockRows(threads_per_block , 1);//

    CUT_SAFE_CALL( cutStartTimer(hTimer) );

    radon_vrs3<<<blockGridRows, threadBlockRows>>>(d_A, d_S, index, N, M);

/*********************************************************************

and the changes I’ve just included in the Kernel :

/**********************************************************************

global void radon_vrs3( float* img_In, float* radonResults, int* index, int N, int M )

{

    // Declare rows of the matrix to be in shared mamory for speed

    __shared__ float sum[threads_per_block];

// Calculate which element this thread reads from memory

    //int index =  M * blockIdx.x + threadIdx.x; for radon_vrs2

    int idx1 =  M * blockIdx.x + threadIdx.x;

    int idx = index[idx1];

sum[threadIdx.x] = img_In[idx];

    __syncthreads();

    int nTotalThreads = blockDim.x; // Total number of active threads

    while (nTotalThreads > 1)

    {

            int halfPoint = (nTotalThreads >> 1); // divide by two

            // Only the first half of threads will be active.

            if (threadIdx.x < halfPoint)

                    sum[threadIdx.x] += sum[threadIdx.x + halfPoint];

            __syncthreads();

nTotalThreads = (nTotalThreads >> 1); // divide by two.

    }

    // At this time, each thread(0) has a sum of a row

    // It's time for each thread(0) to write it's final result.

    if (threadIdx.x == 0)

            radonResults[blockIdx.x]=sum[0];

}

/******************************************************************************************

The problem is in this instruction : sum[threadIdx.x] = img_In[idx];

if i replace idx here with idx1, everything goes well, but these are not the index desired.

So you need to check value of idx and bounds of index array.

Try running your program through cuda-memcheck to capture the out of bounds errors, give you function address, and memory address. If you compile with debugging symbol support it’ll likely give you source info too.

I wouldn’t recommend the last part. Compiling with debugging for the device will spill shared memory to local memory. If the out of bounds access is in the shared buffer, cuda-memcheck won’t detect it when compiled with debugging on.

Hey could you provide the algorithm used to build the radon transform?