Please help with __shared__ memory different usage than in samples

I have than a special question. Now I understand, how can I rewrite the program for shared.

I read out 16×16 blocks from the data, and calculate only on the inner 14×14 part. After the calculation is ready, I write back the 14×14 part to the global memory, this can speed up things.

What I have still no idea, is how to use out the coalesced memory-reads. First of all, I have double instead of float. This could be still solved, but the bigger problem is, that I have to read out 16×16, but write back 14×14, so the read and write is totally different, so I don’t know how to solve it, because if the read is coalesced, than the write won’t use coalesced writes. (Or on sm1.3 it is possible not to write to memory-spaces following each other, but to memory-spaces only not conflicting).

Another question is, that when using local memory, is there any special type to use out the registers (like for the proc. register)?

Laszlo

Memory coalescing works for 64 bit types as well as 32 bit types, so coalesced reading and writing of doubles is no different to floats. Coalesced memory access (either reads or writes) requires that a half warp of threads operate on a given contiguous 16 word global memory segment aligned on a 16 word boundary. But not every thread with the half-warp actually has to read or write into the segment. Divergent half warps can still execute coalesced memory access (see Figure 5-1 in the programming guide for an illustration). There is no problem in not having all of the threads in a block executing writes, as long as the addressing scheme follows the alignment requirements.

So to achieve coalesced global memory access, all that has to happen is that threads within a half-warp read from the same 16 word align memory segment. If your storage is row major ordered, that means each group of 16 sequential threads must read and write row-wise through your global memory arrays aligned on 16 word boundaries. If you data is column major ordered, then each group of 16 sequential threads must read and write column wise aligned on 16 word boundaries. It also implies that all your storage needs to be padded in the major ordering direction to even 16 word boundaries. You should note that the numbering scheme for threads inside a block and blocks inside a grid is column major order, so in some ways it is more natural to write code for column major ordered storage than row order, but both are trivial to implement.

In your finite difference code, you will need to think about how to handle overlapping blocks, because that can have major implications for memory coalescing.

The code is uses a very specialized grid, in polar coordinate-system. (When programming, the same as a rectangular grid with the upper and lower boundary as periodic, the left as wall, the right as freestream outlet).

When solving the Navier-Stokes, it’s not a problem the implementation. What can have a memory impact is in the SOR iteration, where there are the periodic boundaries. In this case, is the memory copy a huge slow-down, or it’s working automatically coalescing?

P.S.: I have just seen NVIDIA Nexus. Is it worth trying, or the normal SDK is more stable at the moment?

I haven’t implemented successive over relaxation methods in CUDA so I can’t answer that.

I don’t develop in Windows, so I don’t have an opinion on that either.

Okay, so you have misunderstood my question. It was that if I give a direct cudaMemCpy order to the GPU is it as effective, than if I write a kernel that copies data to shared memory using coalesced, and writes it back to the other address?

And for the local memory, is there a force register type?

Yes I really did misunderstand it. Completely. cudaMemcpy is pretty optimal. You can assume that there isn’t a faster way to copy contiguous blocks of memory. Of course, one of the potential problems is that only on dimension of your computational domain is contiguous in memory. Whether that is important is complete a matter for you code.

The compiler makes its own determination as to whether a variable can be compiled to register or not. I used to think the register storage keyword helped, but I am now pretty sure it doesn’t.

I have done some profiling on the code. The interesting part was, that some kernels generated only coalesced writes (, where every thread has written only one value back;but only uncoalesced reads), although I haven’t optimized it. Is it possible that some kernels work like this by accident?

Best regards,

                                                                                        Laszlo Daroczy

P.S.: If I have to pass only one value to all threads, what’s the fastest way to do it?

I am not sure what you mean “by accident”. The GPU always tries to coalesce reads and writes. It is only when kernels request memory in ways that can’t be serviced in that manner that sequential memory reads are used. So this naïve kernel will have fully coalesced reads:

__global__ void kernel(float *in, float* out)

{

	unsigned int idx = blockDim.x * blockIdx.x + threadIdx.x;

	out[idx] = 5.0f * in[idx];

}

and this version will not:

__global__ void kernel(float *in, float* out)

{

	unsigned int idx = blockDim.x * blockIdx.x + threadIdx.x;

	out[idx] = 5.0f * in[idx+1];

}

If you have a single parameter to pass to all threads, constant memory is the best way to do it. If the constant value needs to be determined at runtime rather than compile time, then use cudaMemcpyToSymbol().

Thanks very much for the lot of useful informations! Now I think I can start to reorganize the code! Actually, I also managed to find a way to convert lightweight branches into arithmetic statements, so I can eliminate divergent branches also!

Best regards,

                                            Laszlo D.

So I can clearly understand, that in case of 1.1, every member of the half-warp must access contigous memory spaces.

But for 1.2 (as I have it), what does this actually mean? I will have double (so I guess the 64bit words), in which case 128 bytes are the segments (so 16 words will be in a segment). But the any access pattern means, that the threads don’t need to load memory spaces after each other, but in any order (but not twice?)?

Otherwise, I have problems with the amount of shared memory. I use 16×16 block, and I need 9 several matrices for the calculations, but the shared memory cannot be bigger than 16k. In this case, what can I do? (Because if I switch to 8*8, than I risk the coalesced reads)?

The requirement isn’t just that the memory is contiguous, it is that each thread’s read must be in sequence within the contiguous segment, i.e. thread 0 must read word zero (or not read), thread 1 must read word 1 (or not read), etc.

It means the ordering requirement within the coalesced block is relaxed, compared with compute 1.0/1.1 devices. If you keep reading, you will also notice that compute 1.2/1.3 adds an additional mode which can do a 16 word read + an extra read to fetch memory for 16 threads, whereas the older hardware can only do either a coalesced 16 word read, or 16 separate fetches.

For coalescing, you only need to have 16 threads in the major order direction of the arrays read from contiguous memory. There is nothing to say you can’t have non-square blocks, ie. anything from 16x3 (or 3x16) upwards in 2D for a second order discretisation should work. If you still are struggling with shared memory, then you might have to look at structuring the code differently so that different stages of the spatial calculations are done by difference kernel launches.