Hello,
This is my first attemp to write to this forum. I’m writing a simple cuda program. I have a number of files I want to treat, the lenght of the files are very low, but I can lots of them (from 1GB to 2TB depending). The buffer length (b_size) is 128MB
I allocate memory for both, device and host as:
res=cudaMallocHost((void**)&c->h_head, b_size);
[…]
res=cudaMalloc((void **)&c->d_head, b_size);
res=cudaMalloc((void **)&c->d_outbuf, b_size);
(checking the returned value at res, I got no errors.
I use this allocated memory as a buffer, so I load files into it and then I use cudaMemcpy to download it to the device:
cudaMemcpy(c->d_head,
c->h_head,
b_size,
cudaMemcpyHostToDevice);
gpu_kernel(b_size,
(char*)c->d_head,
(char*)c->d_outbuf);
My problem cames when I do this in a loop:
while ( files available){
- load files into buffer
- download it to device
- launch kernel
}
If I comment the kernel launch part, Using 1GB of tiny files (8-9 loops with 128MB buffer), the download part more or less it takes the same to download to the device: approx 58000 us.
With kernel launch, the first download takes approx this 58000 us, the the followings only 30-40us which is a lot of less.
Can anyont tellme what I am doing wrong?
EDIT:
I’m using fedora 17 x64 with gcc-3.4
the c->d_head is declared as:
struct c{
[…]
u8 * h_head;
u8 * h_out;
u8* d_head;
u8* b_outbuf;
}
thanks!!