Hi,
I am trying to take advantage of multiple graphics cards, thought my tests are currently running with just the one. The outline of the program goes:
[codebox]
Allocate memory on the host, and load in raw data.
spawn a thread for each card, each thread will
Initialise the card, and allocate memory on it.
forall tiles{
copy a tile across onto the card from the hoast using cudaMemcpy3dAsync
run analysis on this tile
copy results back
}
end threads
compare results with those generated from the gold function.
[/codebox]
As I am using streams, it is necessary to use [font=“Courier New”]cudaMallocHost[/font] to allocate memory for the raw data and the results on the host, however it appears that the memory allocated this way is only visible to the thread is was allocated in, even though the pointer to it is a global variable. The code all runs fine in the emulator mode. I have experimented with moving the allocation into different parts of the code, and any function using the memory from a different thread fails with either an argument error in the case of [font=“Courier New”]cudaMemcpy3DAsync[/font], or a segmentation error if it is a direct access from the code (the test against the gold results if I allocate memory inside the spawned thread).
Is this the expected behaviour? Does [font=“Courier New”]cudaMallocHost[/font] only allow the memory to be used by the thread that called it?
If so, is there a work round? I realise I could allocate a single tile and then copy the data into this from memory allocated with the standard [font=“Courier New”]malloc[/font], but this seams very inefficient.
If it is not the expected behaviour, does anyone have any ideas what I am doing wrong?
Many thanks in advance
Daniel