Kernel execution takes AGES

Hey everyone.

I have a very basic question. Here is my kernel:

_global__ void method(int A[], int k)


	int threads_per_block = blockDim.x * blockDim.y;

	int tid_within_block = threadIdx.x + threadIdx.y * blockDim.x;

	int grid_id = blockIdx.x + blockIdx.y * gridDim.x;

	int id = grid_id * threads_per_block + tid_within_block;


	int i = id/N;						

	int j = id % N;

	int ik = A[i * N + k];

	int kj = A[k * N + j];

	if (A[id] >  ik + kj)


		A[id] = ik + kj;            // <-- line to be COMMENTED OUT



The above kernel is run in the for loop 1000 times. After each iteration cudaThreadSynchronize() is called.

The code as it is above takes 24 seconds to complete.

HOWEVER, if we comment out line:

A[id] = ik + kj;

it takes only 0.25 seconds!!! (Obviously by commenting out im getting the wrong result, but i was just testing why the kernel is sooo slow by trying to comment out

different bits of the code).

So can anyone tell me, why writing to the array takes so long? And how can i improve it without using shared memory or constant memory. I want to be able to run it fast using just global memory.

Why does a simple line of code take so long. It should take just one clock cycle. Its very weird and frustrating.

Thanks for any suggestions!!!

You have just discovered the dead code optimizer. When you comment out that assignment line, the compiler discovers that all of the code in your kernel has no effect on global memory and eliminates all of it. You are timing how long it takes to launch 1000 kernels in the fast case. (No memory reads happen because they get optimized away.)

As for why your kernel executes so slowly when all the lines are present: Can you explain what blockDim and gridDim you use when you launch? How many elements are in A? I’m trying to estimate how many memory transactions are required to execute this code.

The size of A is:

1000 * 1000.

It is called 1000 times. I use 100 x 100 blocks and each has 10 x 10 threads. I am aware that its not the most optimal way but i thought it would take 1 second to execute the whole thing 1000 times.
I am trying to implement Harish Floyd-Warshall algoirhtm and his implementation for 1000 by 1000 matrix takes 10^3 ms which is insane compared to my 24 seconds.

Thanks again for any feedback!

When you call the kernel 1000 times, is that for k=1…1000? If yes, you should probably collapse these into a single kernel so that tiled memory access will maximize data reuse.

What compute capability are you targeting / what card are you testing this on? Can you use a pitch of 1024 to properly align memory accesses? As seibert already said, think about what memory access pattern your kernel has and how you could optimize that. You can also easily eliminate the expensive % and / operations by reorganizing your thread and block indices, although that will probably not improve the speed of your memory bound kerne.

Yes, you are right, it is 1000 times for k= 1 to k = 1000. What do you mean by collapsing it into a single kernel? Does it kernel will be executed once? Can you elaborate or provide any online material?

(Compute capability is 2.0 and card is NVS 4200M.)

OK, so as written, all the kernel calls combined do 2 billion 32-bit reads and up to 1 billion 32-bit writes (depending on how often that if statement is true), or a total maximum of 12 GB of data transfer. The peak memory bandwidth of a NVS 4200M is 12.8 GB/sec, so with perfect coalescing of memory transactions (i.e. every warp reads data words in consecutive order, roughly speaking) the shortest run time for this program (all 1000 kernel calls) would be 1 second, not counting the overhead of launching each kernel or doing the arithmetic in each thread.

Since you are getting 24 seconds, that would suggest the memory read pattern is degrading performance quite a bit. Certainly the read A[i * N + k] will not be coalesced since every thread will be reading the same column from a different row. The throughput for this line could be very poor. The second read, A[k * N + j], should go much faster, I think. Since N is not divisible by 32, you will sometimes have a warp read split into two memory transactions. I don’t know how bad the throughput could be in the first line, though. Can you run this code in the Visual Profiler?

Another side point: Are you going to have a race condition with threads reading and writing entries in the A array at the same time? It isn’t obvious to me that threads are not going to randomly get old or new values when running the kernel.

I am not getting race condition since the array A is updated for k = 1.
Then the call is synchronized using cudaThreadSynchronize(). Then next iteration is launched. What can i do to my matrix to make the data access more coalesced?
Could you provide any link that explains a bit about it?

Thanks again, you guys have been very helpful.

Look at the inefficient memory read that seibert pointed out. Have a block load that in a coalesced manner into shared memory and use it from there. The matrix transpose example from the SDK has more info on how to do that.

Provided you can work your way around needing global synchronization, you can improve on that by reusing data for different values of k as well.