Suspected wrap size issue with my matrix-related kernel code

For example, I have a 4 * 5 matrix with size = 4 * 5 = 20 and rows = 4.

The first part of my kernel code deals with the matrix and uses indexes from 0 to size - 1 and works just fine for any value of size.

The second part of my code uses fewer indexes from 0 to rows - 1.

The third part of my code uses fewer indexes from 0 to rows - 2 and decreases till it uses indexes from 0 to 1.

As size exceeds the wrap size, that is 32, the second part of my code gives wrong results, and consequently the third part gives wrong results.

I suspect this has something to do with wrap size. What can the issue possibly be here?

possibly failing to declared shared memory as volatile

possibly missing __syncthreads() primives in case you are using shared memory where one thread will read results written by another thread (similar problems could arise when using global memory to store intermedidate results with the intention to share them between threads.

be aware that threads from different warps can be at different stages of code execution at any time. A __syncthreads() will force reconvergence.

cuda-memcheck has a race check tool that can identify shared memory related data races.

Maybe post a complete, compilable code example so people can help with concrete suggestions.

    if(index < sizeOfMatrix)
    {
        int lead = (size - 1) - ((sizeOfMatrix + 1) * ((sizeOfMatrix - 1) - index));
        d_result[index] = d_matrix[lead] / d_matrix[lead - sizeOfMatrix + index];
        __syncthreads();
    }
    for(int i = sizeOfMatrix - 1; i > 0; i--)
    {
    	if(index < i)
    	{
    		int nume = ((sizeOfMatrix + 2) * index) + (i - index);
			int deno = (sizeOfMatrix + 2) * index;
			d_result[index] = d_result[index] - ((d_matrix[nume] / d_matrix[deno]) * d_result[i]);
			__syncthreads();
		}
    }

This is the part of the code I am having an issue with. I didn’t post the whole code as it is too large.

__syncthreads() within conditional code such as an if() clause that excludes some threads from execution is a big “no no!”.

All threads must participate in __synctreads(), or undefined behavior may occur (the worst being a complete deadlock of the kernel, i.e. GPU hangs until timeout)

Is index a function of threadIdx.x/y/z ?

consider splitting your d_result into two separate arrays. One for the intermediate values written in the first if clause, and another array for writing the final result computed in the for loop.

I have a feeling that during execution of the 2nd loop you’re overwriting data in d_result[index] that is still being read from d_result[i]

You could also store these intermediate results in shared memory.

Christian

Your suggestion worked!

I placed the __syncthreads() outside the last “for” loop.

Thanks a lot! I do have a lot to learn…

One last query, is there any way to speed up dependent calculations like this one?

Avoid use of global memory except for storing final results. Shared memory is much faster, for example.

Read input data through the texture cache if possible, especially when elements are being read multiple times or out of order ( look up the use of const restrict keywords, or the __ldg() intrinsic if you do not want to explicitly declare texture objects or texture references)

When reading/writing from global memory, watch out that each warp performs coalesced access (i.e. write elements sequentially in thread order). This can speed up access greatly.

Thanks again! I will try these in my code and let you know the results!

I have some memory specific questions:

  1. Texture memory or Shared memory?

  2. Shared memory or Registers? Should I try to avoid misaligned accesses by transferring values needed to registers?

  3. Does misaligned access in shared memory/any other memory also affect performance?

  4. Can cudaMemcpy be used to transfer input data directly to texture memory? Can texture memory be a replacement for global memory?

  5. Does __lgd() intrinsic make use of L1 cache?

  6. How much should I make use of the registers? Is global-to-shared-to-register better than global-to-register transfer(consider one variable/one element of the 2d matrix in global memory being needed by each thread)?

GPU can’t transfer data directly between memory areas, so global-to-shared is compiled to global-to-register-to-shared. Overall, you may want learn PTX in order to know internal GPU architecture and, optionally, to look into the PTX code generated from your program

L1 cache (including texture cache) and shared memory have similar delays, but different coalescing rules. The coalescing rules are described in the CUDA manual

texture isn’t the special memory, but the special way to access to the same VRAM, involving separate L1 cache (but the same L2 cache), which is programmed using either ldg() or by declaration of texture objects. As far as you are using Kepler or newer architecture, it’s easier to use just ldg()

Use registers as much as you can. They have more space (256KB per SM compared to 64-128 KB of shared mem + L1 cache), much faster (shared mem / L1 cache have 30-100 cycles load-to-use delay), but can’t be indirectly indexed. So use shared/cached memory only when you need indexed access or to exchange/share data between multiple threads.

CUDA doesn’t support misaligned access at all, look at adjancent thread for solution: https://devtalk.nvidia.com/default/topic/1027887/cuda-programming-and-performance/emulating-unaligned-memory-reads-writes-with-aligned-ones/

So is it a good practice to declare local variables in the kernel function to store the necessary indexed and non-indexed values from the global memory and then do the calculations using those variables? If not, what should I do to optimize my code?