I know this kind of goes against what the purpose of GPU devices are for, but is it possible that inside each thread block, to execute a section inside there by only one thread?
Example:
[codebox]///// main /////
kernel<<<16,32>>>(params);
///// kernel /////
global void kernel(params)
{
//do this code section by only one thread
...
...
// now every thread executes the following code
...
...
}[/codebox]
Is this in any way possible with some kind of mutex/lock system? The reason I ask is because I have to define the index pointers of my double **ptr to match the correct locations inside the contiguous block of memory I allocated for double *array. When I copy the matrix over from the host to the GPU, the memory locations will be different for double **ptr, so I have to reset them when I get into the kernel code. However, every thread inside the kernel runs the same code, but it only needs to be set one time in reality.
I guess in my head I was thinking that __syncthreads() would generate too much latency, but I suppose it would be mostly negligible. So simple, haha. Thanks!
Yes… Your approach is totally against the purpose of using GPUs. by using the above concepts you will not get the performance because there will come an if - else condition and this will definitely slow down the performance. Another one is that if you use the double pointer, the non coallaced memory issue will come in the Kernel which will degrade the performance. Another thing is that you want to do the data copying multiple to the device memory, since double pointer memory transfer is not present in the cudaMemcopy. Overall this method will not give the performance.you can obtain some what performance by texture binding the input data in device memory, which doesn’t have the coallaced memory issue.
True but all the other threads are doing nothing anyway while ‘The One’ runs the code. Sometimes its quicker for all threads to do the same calculation rather than having to write it to shared then synch then have all threads read it. I’m not sure in your case with the double pointer. How long the calculation will take and the number of warps per block will also influence the optimum.
(also says in manual that sometimes its better to re calculate the same thing rather than tie up registers, local or shared mem )
Cheers,
kbam
PS Sometime I have forgotten the 80:20 rule, (To get 80% of the benefits with just 20% of the effort), and found that I have put far to much time into trying to make a perfect design.
I’m allocating all my input arrays as a contiguous 1D array each, then setting my pointers from the corresponding double ** var to point to the correct starting locations inside each contiguous 1D array. Also, during my large-size testing, my input will be 400MB+ in data for a single day’s worth of data. Based on the calculations I am having to do, I have to 4-5 matrices of these large arrays to hold the input for my calculations. Meaning, on a single day’s worth of data, I will be taking up over half a single GPU device’s memory at a given moment in the kernel call.
How large is texture memory? I’m using the Tesla C1060 devices. Sorry, I can’t know much about texture memory if I don’t know how to program using it.
Oh, and using the if-else statement increases my performance just barely. It wasn’t really worth the trouble, but my kernel calls do finish < .1 % faster now.
The idea that you never want to have a single thread doing something while every other thread in the blocks waits at a barrier is completely wrong. This is often the best way to accomplish something (e.g., load a single variable into shared memory).