I have a new “simple” question: I want to write a program which needs data tranfers from device to host and initiated by host. My problem consists in synchronising host and device for these exchanges. I tried with event feature, but compilation bugs due to the useof host code in device code! Does anybody have an idea?
My data:
host ------------> page blocked memory
|
device----------------------/
[b][u]
My program behaviour:[/u][/b]
[i]red color for device code
green color for host[/i]
-------------------------------> device puts data in a bufffer ---------------------> when the buffer is full, it is copied in page locked memory -------------> host is advised and then copies in another place
I have to create many objects and I have to minimize data tranfers this is the reason why I’d like to have just one write from device to host instead of 1 read and then 1 write. More accurately, this is a hardware simulator. :)
I have to create many objects and I have to minimize data tranfers this is the reason why I’d like to have just one write from device to host instead of 1 read and then 1 write. More accurately, this is a hardware simulator. :)
==>Yes, I have a GTX470 ( !) … But is there really an overlap, when data returned by a kernel 1 have to be transmitted to a kernel 2?
==>In the documentation, it’s written that memory tranfers have to be “coalesced” , but how to do this with handling int array? I mean that the amount of memory transfers will be the size of the array: is there any solution to copy an entire device buffer to page-blocked memory with only one command?
==>To test synchronisation between host and device, I tried this:
But, it printed nothing (a[0]==0 all the time :blink: )
==>Yes, I have a GTX470 ( !) … But is there really an overlap, when data returned by a kernel 1 have to be transmitted to a kernel 2?
==>In the documentation, it’s written that memory tranfers have to be “coalesced” , but how to do this with handling int array? I mean that the amount of memory transfers will be the size of the array: is there any solution to copy an entire device buffer to page-blocked memory with only one command?
==>To test synchronisation between host and device, I tried this:
But, it printed nothing (a[0]==0 all the time :blink: )
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
__global__
void kernel2(volatile int * ptr_i)
{ int i=ptr_i[0];
while(i!=1100)
{
if(ptr_i[0]==0)
{
i++;
ptr_i[0]=i;
}
}
syncthreads();
}
int main()
{
int *a; // Pinned memory allocated on the CPU
int *d_a;
int nelem;
unsigned int flags;
size_t bytes;
nelem = 10;
bytes = nelem*sizeof(int);
cudaSetDeviceFlags(cudaDeviceMapHost);
flags = cudaHostAllocMapped;
cudaHostAlloc((void **)&a, bytes, flags);
cudaHostGetDevicePointer((void **)&d_a, (void *)a, 0);
a[0]=0;
kernel2<<<1,1024>>>(d_a);
while(a[0]!=1100)
{
if(a[0]!=0)
{printf("=> %i \n ",a[0]);
a[0]=0;
}
}
}
I modified some things: 1024 threads (the max available according to the programming guide) and less data… What I obtained is in the attached file (don’t care of the .cu extension => needed to upload the file :D)