cudaMemcpy error Segmentation fault when executed

Why is this not working? I have tried many variations on the same theme!

a is a matrix of order 4. cudaMalloc works ok, but not cudaMemcpy.

[codebox]

#include <stdio.h>

#include <stdlib.h>

#include “cufft.h”

#include “cuda.h”

#include “cutil_inline.h”

extern “C” void testpassing_(cufftReal* a, int* kk)

{

    int kk1=*kk+1;

    size_t size = sizeof(cufftReal)*kk1*kk1*kk1*kk1;

    cufftReal *h_r  ;

    cufftReal *d_r  ;

    h_r=a;

cutilSafeCall( cudaMalloc((void **)&h_r, size));

    cutilSafeCall( cudaMalloc((void **)&d_r, size));

    cutilSafeCall( cudaMemcpy(d_r,h_r, size ,  cudaMemcpyHostToDevice));

    cutilSafeCall( cudaMemcpy(h_r,d_r, size ,  cudaMemcpyDeviceToHost));

}

[/codebox]

The make file

[codebox]

all:

gfortran -c -o test_passing_calling.o test_passing_calling.F

nvcc -arch=sm_11 -c test_passing.cu -I$(CUDA_HOME)/cuda_23/cuda/include -I$(CUDA_SDK_HOME)/C/common/inc

gfortran -o testpassing test_passing_calling.o test_passing.o -lgfortran -lstdc++ -L$(CUDA_HOME)/cuda_23/cuda/lib64 -lcufft -lcudart -L$(CUDA_SDK_HOME)/lib

[/codebox]

The fortran program

[codebox]

    parameter (kk=2)

    real*4  ::  a(0:kk,0:kk,0:kk,0:kk)

do i=0,kk

    do j=0,kk

    do k=0,kk

    do m=0,kk

    a(i,j,k,m) = rand()

    end do

    end do

    end do

    end do

do i=0,kk

    do j=0,kk

    do k=0,kk

    write( *,1000) , a(i,j,k,0),  a(i,j,k,1), a(i,j,k,2)

    end do

    print *

    end do

    print *

    print *

    end do

call testpassing(a,kk)

do i=0,kk

    do j=0,kk

    do k=0,kk

    write( *,1000) , a(i,j,k,0),  a(i,j,k,1), a(i,j,k,2)

    end do

    print *

    end do

    print *

    print *

    end do

1000 format ( 3f10.5)

    end

[/codebox]

“cutilSafeCall( cudaMalloc((void **)&h_r, size));”

I’m unfamiliar with fortran calling conventions, but I’m pretty sure this line is nonsense. Fortran has allocated space for the array a, and you define a pointer h_r that points to that array, so why would you want to allocate memory on the device using this (host) pointer?

Remove that line - it should work.

If not, manually check whether ‘size’ is the value you expect it to be in your testpassing_ function - although I can’t see anything that could go wrong there.

Thank you very much. Now it works.

[codebox]

#include <stdio.h>

#include <stdlib.h>

#include “cufft.h”

#include “cuda.h”

#include “cutil_inline.h”

extern “C” void testpassing_(cufftReal* a, int* kk)

{

    int i,j;

    int kk1=*kk+1;

    j= kk1*kk1*kk1*kk1;

    size_t size = sizeof(cufftReal)*j;

    cufftReal *h_r = (cufftReal*) malloc(size) ;

    cufftReal *d_r;

    cufftComplex *h_c = (cufftComplex*) malloc(size*2);

    cufftComplex *d_c;

    h_r=a;

    cufftHandle plan;

// for(i=0;i < j ; i++){

// printf(“i = %d h_r = %10.5e \n”,i, h_r[i] );}

cutilSafeCall( cudaMalloc((void **)&d_r, size));

    cutilSafeCall( cudaMalloc((void **)&d_c, size*2));

    cutilSafeCall( cudaMemcpy(d_r,h_r, size ,  cudaMemcpyHostToDevice));

    cufftPlan1d(&plan, size,CUFFT_R2C,1);

    cufftExecR2C(plan,  (cufftReal *)d_r,  d_c);

    cutilSafeCall( cudaMemcpy(h_c,d_c, size*2 ,  cudaMemcpyDeviceToHost));

for(i=0;i < j ; i++){

    printf("  %d     %10.5f   %10.5f  \n", i , h_c[i].x/j ,h_c[i].y/j  );}

}

[/codebox]

But I still have some problem with this type of code.
After executing the last cudaMemcpy call, the array now on the host memory cannot be cleared when returning from the function. I can see the array values in the calling function but it cannot be cleared. Repeated calls will only use more and more memory until max is reached.
How can I solve this problem?
Does the cudaMemcpy modify the array size or address?

Finally.
I got my program working.
After many, many runs and modifications, I did solved all the memory managements problems. This is not optimized for the fastest speed yet, but I will be looking at that.
The main program is in Fortran which calls a subroutine which calls a cuda function which calls a cuFFT function which return the results to the calling subroutine and finally to the main calling program. All this is done with a single precision card which is just enough for my case.
Most of the problems I have encountered were due to “memory managements” understandings. I can see that most of the posts on this forum are also memory related. I am not alone.

Finally. :)
I got my program working.
After many, many runs and modifications, I did solved all the memory managements problems. This is not optimized for the fastest speed yet, but I will be looking at that.
The main program is in Fortran which calls a subroutine which calls a cuda function which calls a cuFFT function which return the results to the calling subroutine and finally to the main calling program. All this is done with a single precision card which is just enough for my case.
Most of the problems I have encountered were due to “memory managements” understandings. I can see that most of the posts on this forum are also memory related. I am not alone.

Sorry for the duplicated post, this site is too slow sometimes. Can I delete one post?

After testing for speed, I am sorry to report that the CUDA version is the slowest of all. I was expecting at 20X faster than fftw3!

After more investigations, I found that the CUDA version is faster in the case of large matrices only. But then the non CUDAized parts are proportionnaly slower and the complete program run is very slow. For matrix above 128x128x128 of single precision complex.
For large matrix, the CUDA cuFFT is only 7X faster than fftw3. The card is Gforce 9600 GSO.
I did notice that the occupancy is low for small problems and gets higher for large problems. It seems that cuFFT is not optimized for high occpupancy for small problems and not all processors are being used in those cases.

Can I do something to increase the occupancy?