memory problem on tesla c1060

Hi All;

I have a problem with a program using cufft in it.

[codebox]#include // original

#include “main.h”

using namespace std;

int main(int argc, char *argv)

{

    int NX, NY, NZ; // define the fft limits

// fft limits

     NX= 64;

     NY= 64;

     NZ= 128;

    int NXYZ; // define the fft limits

    NXYZ = NX* NY* NZ; // defining more spaces than required is possible

cout << " NXYZ " << NXYZ << endl;

cufftComplex data[NXYZ], data_1[NXYZ]; // data variables on the cpu

// GPU memory allocation 3d

    cudaMalloc((void**)&devPtr, sizeof(cufftComplex)*(NX*NY*NZ));

	cudaMalloc((void**)&devPtr_1, sizeof(cufftComplex)*NX*NY*NZ);

/* creates 3D FFT plan */

    cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C);

    cufftPlan3d(&plan_1, NX, NY, NZ, CUFFT_C2C);

// source data creation or fill the vector

    for(i=  0 ; i < NX*NY*NZ ; i++){

        // data part 1

            data[i].x = 2.0f; // real part

            data[i].y = 1.0f; // imag part

		// data part 2

            data_1[i].x = 4.0f;

            data_1[i].y = 5.0f;

    }

// transfer to GPU memory _

    cudaMemcpy(devPtr, data, sizeof(cufftComplex)*NX*NY*NZ, cudaMemcpyHostToDevice);

    cudaMemcpy(devPtr_1, data_1, sizeof(cufftComplex)*NX*NY*NZ, cudaMemcpyHostToDevice);

/* executes FFT processes */

    cufftExecC2C(plan, devPtr, devPtr, CUFFT_FORWARD);

    cufftExecC2C(plan_1, devPtr_1, devPtr_1, CUFFT_FORWARD);

/* executes FFT processes (inverse transformation) */

    cufftExecC2C(plan, devPtr, devPtr, CUFFT_INVERSE);

    cufftExecC2C(plan_1, devPtr_1, devPtr_1, CUFFT_INVERSE);

/* transfer results from GPU memory */

    cudaMemcpy(data, devPtr, sizeof(cufftComplex)*NX*NY*NZ, cudaMemcpyDeviceToHost);

    cudaMemcpy(data_1, devPtr_1, sizeof(cufftComplex)*NX*NY*NZ, cudaMemcpyDeviceToHost);

/* deletes CUFFT plan */

    cufftDestroy(plan);

    cufftDestroy(plan_1);

/* frees GPU memory */

    cudaFree(devPtr);

    cudaFree(devPtr_1);

for(i = 0 ; i < NXNYNZ ; i += 10000){

cout << "data " << i << " "<< data[i].x << " "<< data[i].y << " "

              << "data_1 " << i << " "<< data_1[i].x << " "<< data_1[i].y

              << endl;

    }

return 0;

}[/codebox]

The program running good on a card and system with the following specs:

Centos 5.3; GeForce 9800 GT; code::Blocks 8.02; gcc 4.1.2; cuda 2.3

But when I run it on another machine with this specs

Ubuntu 9.04; tesla c1060; cuda 2.3

the system throughs the segmentation error. :blink:

Segmentation fault

the code I am using on this last machine is

[codebox]#include // original

#include “main.h”

using namespace std;

int main(int argc, char *argv)

{

cudaSetDevice(0);

    int NX, NY, NZ; // define the fft limits

// fft limits

     NX= 64;

     NY= 64;

     NZ= 128;

    int NXYZ; // define the fft limits

    NXYZ = NX* NY* NZ; // defining more spaces than required is possible

cout << " NXYZ " << NXYZ << endl;

cufftComplex data[NXYZ], data_1[NXYZ]; // data variables on the cpu

// GPU memory allocation 3d

    cudaMalloc((void**)&devPtr, sizeof(cufftComplex)*(NX*NY*NZ));

	cudaMalloc((void**)&devPtr_1, sizeof(cufftComplex)*NX*NY*NZ);

/* creates 3D FFT plan */

    cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C);

    cufftPlan3d(&plan_1, NX, NY, NZ, CUFFT_C2C);

// source data creation or fill the vector

    for(i=  0 ; i < NX*NY*NZ ; i++){

        // data part 1

            data[i].x = 2.0f; // real part

            data[i].y = 1.0f; // imag part

		// data part 2

            data_1[i].x = 4.0f;

            data_1[i].y = 5.0f;

    }

// transfer to GPU memory _

    cudaMemcpy(devPtr, data, sizeof(cufftComplex)*NX*NY*NZ, cudaMemcpyHostToDevice);

    cudaMemcpy(devPtr_1, data_1, sizeof(cufftComplex)*NX*NY*NZ, cudaMemcpyHostToDevice);

/* executes FFT processes */

    cufftExecC2C(plan, devPtr, devPtr, CUFFT_FORWARD);

    cufftExecC2C(plan_1, devPtr_1, devPtr_1, CUFFT_FORWARD);

/* executes FFT processes (inverse transformation) */

    cufftExecC2C(plan, devPtr, devPtr, CUFFT_INVERSE);

    cufftExecC2C(plan_1, devPtr_1, devPtr_1, CUFFT_INVERSE);

/* transfer results from GPU memory */

    cudaMemcpy(data, devPtr, sizeof(cufftComplex)*NX*NY*NZ, cudaMemcpyDeviceToHost);

    cudaMemcpy(data_1, devPtr_1, sizeof(cufftComplex)*NX*NY*NZ, cudaMemcpyDeviceToHost);

/* deletes CUFFT plan */

    cufftDestroy(plan);

    cufftDestroy(plan_1);

/* frees GPU memory */

    cudaFree(devPtr);

    cudaFree(devPtr_1);

for(i = 0 ; i < NXNYNZ ; i += 10000){

cout << "data " << i << " "<< data[i].x << " "<< data[i].y << " "

              << "data_1 " << i << " "<< data_1[i].x << " "<< data_1[i].y

              << endl;

    }

return 0;

}

[/codebox]

I add the cudaSetDevice(0); to insure that I am using the first device (c1060) not the integrated one (nForce 980a)

As far as I know; c1060 has 8 time the memory of the 9800 GT.

Any Idea what is going on?

these are the contents of main.h

#include <cuda.h>

#include <cuda_runtime.h>

#include <cufft.h>

cufftHandle plan, plan_1;

cufftComplex *devPtr, *devPtr_1; // pointers for data on the gpu

First dumb question, you’re hardwiring the device number into your program? Why?

And are you absolutely sure that device 0 is your Tesla?

You can check really easily. And in fact this is a good status message in your program no matter what.

CUDA_SAFE_CALL_NO_SYNC(cudaGetDeviceProperties(&deviceProp, device));

  fprintf(stderr, "Using device %d: %s\n", device, deviceProp.name);

SPWorley

Thanks for having a look;

To be sure before I post my problem I did the following:

first: I run devicequery to check for the device,

second: I selected cudaSetDevice(1) which the integrated device and the size of NXYZ that was able to run on it is far smaller than cudaSetDevice(0),

third thing I did is to remove the manual selection of gpu and replace it with a small piece of code that searches for devices on the host and return as an argument to cudaSetDevice() the gpu with the max # of multiprocessors.

and still it shows the same bad results!

any ideas?

I test your code, and modify your code as

cufftComplex *data, *data_1;

data = (cufftComplex*) malloc( sizeof(cufftComplex)*NXYZ );

assert( data ) ;

data_1 = (cufftComplex*) malloc( sizeof(cufftComplex)*NXYZ );

assert( data_1 ) ;

just using dynamic allocation

then the program works without any error.

However if I use

cufftComplex data[524288], data_1[524288];

where NXYZ = NX* NY* NZ = 6464128 = 524288, say size(data) = size(data_1) = 4 MB

then vc2005 says stack overflow

I think this is stack-overflow problem.

by the way, how could you compile your original code,

it should report error at

cufftComplex data[NXYZ], data_1[NXYZ];

since NXYZ is not a constant expression

LSChien

Thanks for your input,

Based on your suggestions I have the program working fine now. :yes:

For the cufftComplex data[524288], data_1[524288];

I define them in the source file not in the header, I am using gcc as a compiler and it gives me no errors.