I have written a CUDA program, and can’t get a ideal speedup. now only about 10 times on GTX280. so could you please help me to optimize the program.
coalescing. how to make the 1D array always coalesced. use the cudaMallocPitch()? i only know the benifit of coalescing, but i have no idea how to do it? could you please give me some ideas, thanks.
texture. to what extend the texture is better than the coalescing global memory?
how to setting the grid size. some times, i run the program, it will exit with “Unspecified launch failure” error message. for this kind of situation, i divide the kernel to run sevral times. it works well. but i want to know why it works? some said that the reason is the watch dog of windows. but in the program, there are some kernels runs more time than these problem kernels, and it runs well. i am confused.
how to deal large array which is can’t load to shared memory. is it fater to read 64 float4 than 256 float?
do you know other optimization methods to speed up the program.