Instability/deadlock on bigger arrays

Hi all.

I have a code that works on 2D arrays of 2^Nx2^N size. Each thread copies 2 samples from the global memory to the shared memory and then processes them many times. I simplified the code and replaced all the computations with just a short meaningless number crunching in order to demonstrate my problem. Here it is:

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

#include <cutil.h>

__global__ void cuda_kernel (float2* g_idata, float2* g_odata, const int num_iterations) {

    __shared__ float sdatax[512];

    __shared__ float sdatay[512];

   const unsigned int tid = threadIdx.x;

    const unsigned int bid = blockIdx.x * blockDim.x * 2;

    const unsigned int stride = blockIdx.y * blockDim.x * gridDim.x * 2;

    float2 tmp;

    int i;

   tmp = g_idata[bid + tid + stride];

    sdatax[tid] = tmp.x;

    sdatay[tid] = tmp.y;

    tmp = g_idata[bid + tid + blockDim.x + stride];

    sdatax[tid + blockDim.x] = tmp.x;

    sdatay[tid + blockDim.x] = tmp.y;

   for (i = 0; i < num_iterations; i++) {

        tmp.x = sdatax[tid + blockDim.x];

        tmp.y = sdatay[tid + blockDim.x];

        sdatax[tid] = tmp.x * __cosf (0.5f) / __sinf (0.5f);

        sdatay[tid] += tmp.y;

        __syncthreads ();

    }

   tmp.x = sdatax[tid];

    tmp.y = sdatay[tid];

    g_odata[bid + tid + stride] = tmp;

    tmp.x = sdatax[tid + blockDim.x];

    tmp.y = sdatay[tid + blockDim.x];

    g_odata[bid + tid + blockDim.x + stride] = tmp;

}

int main (int argc, char** argv) {

    CUT_DEVICE_INIT (argc, argv);

   const unsigned int N = 12, len = 1 << N, tests_num = 10000;

    unsigned int i, j;

    const unsigned int mem_size = sizeof (float2) * len * len;

   printf ("Testing %dx%d samples\n", len, len);

   // allocate host memory

    float2* h_idata = (float2*)malloc (mem_size);

    for (i = 0; i < len; ++i) {

        for (j = 0; j < len; ++j) {

            h_idata[i * len + j].x = 1.0;

            h_idata[i * len + j].y = 1.0;

        }

    }

   // allocate device memory

    float2* d_idata;

    CUDA_SAFE_CALL (cudaMalloc ((void**)&d_idata, mem_size));

    // allocate device memory for result

    float2* d_odata;

    CUDA_SAFE_CALL (cudaMalloc ((void**)&d_odata, mem_size));

   // setup execution parameters

    dim3 grid (1, 1, 1);

    dim3 threads (256, 1, 1);

    grid.x = len / 512;

    grid.y = len;

   for (i = 0; i < tests_num; i++) {

        printf ("Iteration #%d\n", i);

       // copy host memory to device

        CUDA_SAFE_CALL (cudaMemcpy (d_idata, h_idata, mem_size,

                                    cudaMemcpyHostToDevice));

       // execute the kernel

        cuda_kernel <<<grid, threads>>>(d_idata, d_odata, 10000);

       // check if kernel execution generated an error

        CUT_CHECK_ERROR ("Kernel execution failed");

       // copy result from device to host

        CUDA_SAFE_CALL (cudaMemcpy (h_idata, d_odata, mem_size,

                                    cudaMemcpyDeviceToHost));

    }

   // cleanup memory

    free (h_idata);

    CUDA_SAFE_CALL (cudaFree (d_idata));

    CUDA_SAFE_CALL (cudaFree (d_odata));

   CUT_EXIT (argc, argv);

}

It works flawlessly, if N <= 11 (2048x2048 array or smaller), but if I define N = 12, then it becomes unstable. It manages to get through 30-300 iterations out of 10000 and then gets stuck. The number of iterations is always different.

I am on CentOS 4.2 x86_64. CUDA 2.0beta. Driver version is 177.13. deviceQuery tells me that there is 1GB of memory, so, I guess, memory size should not be a problem here.

According to ltrace, the lockup takes place inside cudaThreadSynchronize() call. I can still log in to the machine (it is a remote Dell Precision 690 with a Tesla board). Top shows 100% CPU load from my CUDA application. But nothing happens. It may last for several hours, then it will hang the machine and reboot it on its own. If I try to kill the application, then it won’t do so, but hang it and reboot in 5-10 minutes.

Running in the debug mode does not reveal any additional details.

Any suggestions on what might be wrong with the code above or CUDA setup to run it are welcome. Thank you.

How are you building this app?

Please generate and attach an nvidia-bug-report.log from your system.

I use a modified Makefile from SDK.

I attached an archive with the report and Makefile. Thank you.
nv_bug_rep_and_makefile.tar.gz (25.8 KB)