cudaMemcpy and mem allocation

Hello,

This is my first attemp to write to this forum. I’m writing a simple cuda program. I have a number of files I want to treat, the lenght of the files are very low, but I can lots of them (from 1GB to 2TB depending). The buffer length (b_size) is 128MB

I allocate memory for both, device and host as:

res=cudaMallocHost((void**)&c->h_head, b_size);
[…]
res=cudaMalloc((void **)&c->d_head, b_size);
res=cudaMalloc((void **)&c->d_outbuf, b_size);

(checking the returned value at res, I got no errors.

I use this allocated memory as a buffer, so I load files into it and then I use cudaMemcpy to download it to the device:

cudaMemcpy(c->d_head,
c->h_head,
b_size,
cudaMemcpyHostToDevice);

gpu_kernel(b_size,
(char*)c->d_head,
(char*)c->d_outbuf);

My problem cames when I do this in a loop:
while ( files available){

  • load files into buffer
  • download it to device
  • launch kernel
    }

If I comment the kernel launch part, Using 1GB of tiny files (8-9 loops with 128MB buffer), the download part more or less it takes the same to download to the device: approx 58000 us.

With kernel launch, the first download takes approx this 58000 us, the the followings only 30-40us which is a lot of less.

Can anyont tellme what I am doing wrong?

EDIT:
I’m using fedora 17 x64 with gcc-3.4

the c->d_head is declared as:
struct c{
[…]
u8 * h_head;
u8 * h_out;
u8* d_head;
u8* b_outbuf;
}
thanks!!

Check return codes for errors. The most probable explanation is that there is an error in the kernel and none of the following CUDA calls are actually executed.