Tranfers between device and host data exchanges initiated by device

Hi everybody :wave: ,

I have a new “simple” question: I want to write a program which needs data tranfers from device to host and initiated by host. My problem consists in synchronising host and device for these exchanges. I tried with event feature, but compilation bugs due to the useof host code in device code! Does anybody have an idea?

My data:

host ------------> page blocked memory

                                    |

device----------------------/

[b][u]

My program behaviour:[/u][/b]

[i]red color for device code

green color for host[/i]

-------------------------------> device puts data in a bufffer ---------------------> when the buffer is full, it is copied in page locked memory -------------> host is advised and then copies in another place

rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr|___<<|

rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr|_______________<Loop__<|

Launch a new kernel for each iteration of the loop.

Launch a new kernel for each iteration of the loop.

In facts, In my program, it’s too long to launch a new kernel, that’s why I try to avoid this… :confused:

In facts, In my program, it’s too long to launch a new kernel, that’s why I try to avoid this… :confused:

Ten microseconds is too long? What are you doing???

Ten microseconds is too long? What are you doing???

I have to create many objects and I have to minimize data tranfers this is the reason why I’d like to have just one write from device to host instead of 1 read and then 1 write. More accurately, this is a hardware simulator. :)

I have to create many objects and I have to minimize data tranfers this is the reason why I’d like to have just one write from device to host instead of 1 read and then 1 write. More accurately, this is a hardware simulator. :)

If you get a Fermi-based card, you could overlap multiple kernels, so that whenever a kernel ends the corresponding buffer is copied.

If you get a Fermi-based card, you could overlap multiple kernels, so that whenever a kernel ends the corresponding buffer is copied.

==>Yes, I have a GTX470 ( :heart: !) … But is there really an overlap, when data returned by a kernel 1 have to be transmitted to a kernel 2?

==>In the documentation, it’s written that memory tranfers have to be “coalesced” , but how to do this with handling int array? I mean that the amount of memory transfers will be the size of the array: is there any solution to copy an entire device buffer to page-blocked memory with only one command?

==>To test synchronisation between host and device, I tried this:

But, it printed nothing (a[0]==0 all the time :blink: )

==>Yes, I have a GTX470 ( :heart: !) … But is there really an overlap, when data returned by a kernel 1 have to be transmitted to a kernel 2?

==>In the documentation, it’s written that memory tranfers have to be “coalesced” , but how to do this with handling int array? I mean that the amount of memory transfers will be the size of the array: is there any solution to copy an entire device buffer to page-blocked memory with only one command?

==>To test synchronisation between host and device, I tried this:

But, it printed nothing (a[0]==0 all the time :blink: )

no idea?

Maybe my subject is borring ? :">

no idea?

Maybe my subject is borring ? :">

Declare [font=“Courier New”]a[/font] as volatile and insert a __threadfence_system() after setting it from the kernel.

Declare [font=“Courier New”]a[/font] as volatile and insert a __threadfence_system() after setting it from the kernel.

compilation fails:

whereas with __threadfence and __threadfence_block() no problem occurs :pinch:

compilation fails:

whereas with __threadfence and __threadfence_block() no problem occurs :pinch:

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

#include <cuda_runtime.h>

__global__

void kernel2(volatile int * ptr_i)

{ int i=ptr_i[0];

while(i!=1100)

{

if(ptr_i[0]==0)

{

i++;

ptr_i[0]=i;

}

}

syncthreads();

}

int main()

{

int *a; // Pinned memory allocated on the CPU

int *d_a;

int nelem;

unsigned int flags;

size_t bytes;

nelem = 10;

bytes = nelem*sizeof(int);

cudaSetDeviceFlags(cudaDeviceMapHost);

flags = cudaHostAllocMapped;

cudaHostAlloc((void **)&a, bytes, flags);

cudaHostGetDevicePointer((void **)&d_a, (void *)a, 0);

a[0]=0;

kernel2<<<1,1024>>>(d_a);

while(a[0]!=1100)

{

if(a[0]!=0)

{printf("=> %i \n ",a[0]);

a[0]=0;

}

}

}

I modified some things: 1024 threads (the max available according to the programming guide) and less data… What I obtained is in the attached file (don’t care of the .cu extension => needed to upload the file :D)

:)