GPU memory adressing problem ?

mmetz_nv · March 28, 2008, 1:14pm

Hi,

I have a very simple test program. It does the following: allocate an array → init the values of the array on the host → copy data to device → multiply array by a float value → copy data back and compare to host results (the relevant parts of the code are attached).

Now what happens is the following:

I get incorrect results for different runs, but sometimes also correct results <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />
For various runs, the incorrect value is not always the same array value External Image
If an array-value is incorrect, it appears to be the value at index i is the value that should be at index j, but there appears to be no correlation between i and j External Image

Below is the output of one of those runs. I marked a few lines. As you can see after the first loop the value at index “45534” is incorrect, but appears to be the value that ought to be at index “46046”, whereas this value is the value that ought to be at index “48590” …

In the second loop the value at index “78” is incorrect, which however was correct for the first run…

Any idea what’s wrong ???

CUDA 1.1, Linux (64 Bit) driver version 169.09, G8800GT

loop 0

loop 1

L 1 - re 45534 455340.00 460460.00 (= 10.0*45534.00) ** → 46046

L 1 - re 45535 455350.00 460470.00 (= 10.0*45535.00)

L 1 - re 45918 459180.00 482060.00 (= 10.0*45918.00)

L 1 - re 45919 459190.00 482070.00 (= 10.0*45919.00)

L 1 - re 46046 460460.00 485900.00 (= 10.0*46046.00) ** → 48590

L 1 - re 46047 460470.00 485910.00 (= 10.0*46047.00)

L 1 - re 47710 477100.00 494860.00 (= 10.0*47710.00)

L 1 - re 47711 477110.00 494870.00 (= 10.0*47711.00)

L 1 - re 48206 482060.00 455340.00 (= 10.0*48206.00)

L 1 - re 48207 482070.00 455350.00 (= 10.0*48207.00)

L 1 - re 48590 485900.00 477100.00 (= 10.0*48590.00)

L 1 - re 48591 485910.00 477110.00 (= 10.0*48591.00)

L 1 - re 49486 494860.00 496300.00 (= 10.0*49486.00)

L 1 - re 49487 494870.00 496310.00 (= 10.0*49487.00)

L 1 - re 49630 496300.00 508940.00 (= 10.0*49630.00)

L 1 - re 49631 496310.00 508950.00 (= 10.0*49631.00)

L 1 - re 50894 508940.00 515500.00 (= 10.0*50894.00)

loop 2

L 2 - re 78 7800.00 942200.00 (= 10.0*780.00)

L 2 - re 79 7900.00 942300.00 (= 10.0*790.00)

L 2 - re 718 71800.00 699000.00 (= 10.0*7180.00)

L 2 - re 719 71900.00 699100.00 (= 10.0*7190.00)

L 2 - re 1998 199800.00 7800.00 (= 10.0*19980.00)

L 2 - re 1999 199900.00 7900.00 (= 10.0*19990.00)

L 2 - re 2142 214200.00 303800.00 (= 10.0*21420.00)

L 2 - re 2143 214300.00 303900.00 (= 10.0*21430.00)

L 2 - re 3038 303800.00 611000.00 (= 10.0*30380.00)

L 2 - re 3039 303900.00 611100.00 (= 10.0*30390.00)

L 2 - re 6110 611000.00 547000.00 (= 10.0*61100.00)

L 2 - re 6111 611100.00 547100.00 (= 10.0*61110.00)

L 2 - re 6494 649400.00 214200.00 (= 10.0*64940.00)

L 2 - re 6495 649500.00 214300.00 (= 10.0*64950.00)

L 2 - re 6878 687800.00 649400.00 (= 10.0*68780.00)

L 2 - re 6879 687900.00 649500.00 (= 10.0*68790.00)

L 2 - re 9422 942200.00 955000.00 (= 10.0*94220.00)

#define NTHREADS 512

int main(int argc, char** argv)

{

    repeated_multiply(10);

    CUT_EXIT(argc, argv);

}

__global__ void

dd3mul(cufftComplex* A, float f)

{

    // Block index

    int bx = blockIdx.x;

    // Thread index

    int tx = threadIdx.x;

    

    unsigned int i = blockDim.x * bx + tx;

   A[i].x *= f;

    // __syncthreads();

    A[i].y *= f;

    //__syncthreads();

    

}

void repeated_multiply(int n)

{

    CUT_DEVICE_INIT();

    

    int length = 64*64*64;

    int memsiz = sizeof(cufftComplex) * length;

    

    cufftComplex* h_data = (cufftComplex*) malloc(memsiz);

    cufftComplex* h_datc = (cufftComplex*) malloc(memsiz);

    cufftComplex* d_data;

    CUDA_SAFE_CALL( cudaMalloc((void**) &d_data   , memsiz) );

    

    // init host arrays

    for(int i=0; i<length;i++) {

        h_data[i].x = (float)(i);

        h_data[i].y = (float)(i*i);

        h_datc[i].x = 0.0;

        h_datc[i].y = 0.0;

    }

    

    CUDA_SAFE_CALL(cudaMemcpy(d_data, h_data, memsiz, cudaMemcpyHostToDevice) );

    

    dim3 xthreads(NTHREADS, 1, 1);

    dim3 xgrid(length / NTHREADS, 1, 1);

    

    float f = 10.;

    for(int j=0; j<n; j++) {

        

        printf("loop %d\n", j);

        // copy data back to host for checking

        CUDA_SAFE_CALL(cudaMemcpy(h_datc, d_data, memsiz, cudaMemcpyDeviceToHost) );

        

        int count = 0;

        for(int i=0; i<length;i++) {

            if( fabs(h_data[i].x - h_datc[i].x)>1.e-6 ) {

                printf("L %d - re %d %.2f %.2f (= %.1f*%.2f)\n", j, i, h_data[i].x, h_datc[i].x, f, h_data[i].x/f);

                count ++;

            }

            if(count>16) break;

        }

        

        // multiply host-array by f

        for(int i=0; i<length;i++) {

            h_data[i].x *= f;

            h_data[i].y *= f;

        }

        // multiply device-array by f

        dd3mul<<< xgrid, xthreads >>>(d_data, f);

    }

    

    free(h_data);

    free(h_datc);

    CUDA_SAFE_CALL(cudaFree(d_data));

}

chubaca · March 31, 2008, 1:05pm

Sorry did not manage to compile your code.
But as far as I can see you do the processing

dd3mul<<< xgrid, xthreads >>>(d_data, f);

after you start checking the values.

Are you sure that this is what you meant to do?

Edit: Oh. Ok. I think that doesn’t matter, sorry.

Another Edit:
Managed to compile the stuff. For me it worked fine as far as the expected output is:

loop 0
loop 1
loop 2
loop 3
loop 4
loop 5
loop 6
loop 7
loop 8
loop 9

It think your problem is the “printf” maybe on your computer it is processed asynchronously.
On what OS do you have the code being executed?

mmetz_nv · March 31, 2008, 5:02pm

Argh - I don’t think that it was a problem with printf … (btw: Debian Linux, 64Bit)

The problem went await after a reboot :blink:. My Linux box was running for some weeks without being rebooted. My feeling is that it might be the driver problem/bug ???

All my tests indicate that the addressing on the card was faulty. The consequence for me is that I will have to develop test-routines that are execute ever now and then to check for this misbehavior and stop calculations if so External Media Very unfortunate …