Evaluating SMem Access Pattern and SMem Efficiency in NVVP

I have an implementation which uses shared memory and NVVP shows information that I interpret as contradictory, but you guys will be able to correct me. The kernel is:

__global__ void Vel_Convert(const float * __restrict__ vrms, const float * __restrict__ time, float *vint, const uint32_t LENGTH)
	{
	extern __shared__ float SM_Vrms2Vint[];		// Shared memory space
	float	*SM_Vrms = &SM_Vrms2Vint[0],		// Velocity portion of the memory
		*SM_Time = &SM_Vrms2Vint[blockDim.x];	// Time portion of the memory

	const uint32_t	GLOBAL_IDX	= blockDim.x * blockIdx.x + threadIdx.x,
			OFFSET		= gridDim.x * blockDim.x;

	for(uint32_t idx = GLOBAL_IDX; idx < LENGTH; idx += OFFSET)    // Fills the shared memory
	    {
	    SM_Vrms[threadIdx.x] = vrms[idx];
	    SM_Time[threadIdx.x] = time[idx];
	    }
	__syncthreads();

	if(threadIdx.x > 0)	// Branch divergence here, thread 0 does different job
	    vint[GLOBAL_IDX] = sqrt(((SM_Time[threadIdx.x] * (SM_Vrms[threadIdx.x] * SM_Vrms[threadIdx.x])) -
		(SM_Time[threadIdx.x - 1] * (SM_Vrms[threadIdx.x - 1] * SM_Vrms[threadIdx.x - 1]))) /	
		(SM_Time[threadIdx.x] - SM_Time[threadIdx.x - 1]));
	else
	    vint[GLOBAL_IDX] = SM_Vrms[threadIdx.x];
	}

The kernel is launched this way:

Vel_Convert <<< gSize, bSize, bSize * 2 * sizeof(float) >>> (rp_dVrms, rp_dTime, rp_dVint, NUM_LINES);

It takes 2 arrays as inputs and 1 as output as well as the length and I allocate shared memory for twice the value of block size and split it for 2 different input data that is used along the kernel.

NVVP reports “No Issues” in the Shared Memory Access Pattern (unguided analysis), while in the “Properties” tab to the right, which displays general information about the kernel, shows the yellow ! in “Efficiency/Shared Efficiency” reported as 39,3%.

When I see “No issues”, then I think there is no bank conflict, while the low efficiency can be associated to conflicts in various threads around the net. I suspect it is because of the way I read the current shared memory position (at threadIdx.x) and the position before it (at threadIdx.x - 1), which is needed for the calculation.

So I seek your advice to know if the low shared efficiency is due to bank conflicts (despite the “no issues” message) and how it can be improved.

shared_efficiency is a metric. There is a definition given for it:

https://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference

It doesn’t necessarily have anything to do with bank conflicts. The warp divergence you’ve already indicated will lower the efficiency.

Thanks for the feedback, Robert.
I first read here, https://stackoverflow.com/questions/50956939/cuda-shared-memory-efficiency-at-50, about the efficiency potentially being due to bank conflict (which was the case in this SO post).

In my particular problem, the number of threads unfortunately is not necessarily a multiple of 32, it can be anything like 240, 300, which is the number of samples processed by each block: a 2D matrix flattened to 1D so that each block has as many threads as the length of a dimension. Since one of the warps of the block will not be full, maybe the profiler is accounting for it too?

As for the “No Issues” indicated by the Shared Memory Access Pattern, is it NVVP’s way of saying it is conflict-free?

2 things:

(1)

This foor loop looks a little odd:

for(uint32_t idx = GLOBAL_IDX; idx < LENGTH; idx += OFFSET)    // Fills the shared memory
	    {
	    SM_Vrms[threadIdx.x] = vrms[idx];
	    SM_Time[threadIdx.x] = time[idx];
	    }
	__syncthreads();

You are iterating over global memory via ‘idx’ and writing over the same shared memory location? I suppose its more of a global memory boundary check than an itertive process?

So what happens to the share memory values that don’t get assigned, these will still be used in your computation down the line…

(2)

With such a neighborly access pattern of SM_Time and SM_Vrms variables, you might as well structure your problem so that you use warp shuffle instructions instead, they are guaranteed the same performance as optimal shared memory access.

@Robert, thanks for your ponderations and apologies, I have just finished a running version and was alarmed by the profiler’s reports. I will do a more thorough inspection and put up a compilable program in case my doubt persists.

@Jimmy, “idx” receives GLOBAL_IDX, which is the typical “blockDim.x * blockIdx.x + threadIdx.x” for unique indices, and then each thread assigns a value from global memory to its own shared memory position. It is a regular grid-strided loop over the global memory and threadIdx.x assures that all threads write their respective value to smem.
Can you elaborate on your second consideration?

Regarding (1), just to make an example of my line of thought:

The first thread in the block (with threadIdx.x=0) could potentially then do:

SM_Vrms[0] = vrms[GLOBAL_IDX];
SM_Vrms[0] = vrms[GLOBAL_IDX + 1OFFSET];
SM_Vrms[0] = vrms[GLOBAL_IDX + 2
OFFSET];

Do you see anything fishy about that?

Jimmy, thanks for pointing this out, now I understand what you meant from the previous comment. And this goes in the direction of what Robert said about not trying to point out problems on incomplete code, and that I normally avoid programming after drinking too much Jack Daniels but eventually break the rule. This would work if I were not using shared memory.

After I fix this logic, I will then see if this pattern of a thread accessing a shared memory element at [threadIdx.x] and then at [threadIdx.x - 1] is the main cause of this drop in efficiency.
Another factor, that I can’t escape from, is that thread 0 will not do the full calculation as there is no element before it. Plus the fact that the block size can be any number and not a multiple of 32.

Suppose I have a matrix of 240 x 250000. I need 240 threads to work cooperatively on each set of 240 samples, so it is actually of no use to launch 256 because 16 won’t do work anyway (and will potentially eat some L1/L2 space that I need for the shared memory). There is certainly some penalty I can’t avoid, but I hope it is not imoral to launch non N*32 threads.

Well the hardware will schedule an integral number of warps and the tail 16 threads will be executing no-ops. Its definitely not immoral to launch 240 threads in a block :D

One could imagine launching a block of say 4x32 (blockDim.x=32, blockDim.y=4) threads, and letting each of the 32 x-threads work on a set of 240 samples, hence each block would cover 4 sets of 240 samples. This would increase the work per thread (each thread process 240/32 elements) and alleviate the need for SMEM synchronization.

Next you might then be able to figure out a clever scheme to remove SMEM altogether by only employing warp shuffle instructions, which is guaranteed to have optimal efficiency.

Jimmy, first of all, thanks for your patience and clarifications.

Yes, the scheduler will launch a full warp, but my assumption is that, if I launched the next N32 myself, I would need to further limit the work in code, that is, test if threadIdx.x <= 240 (240 is our arbitrary value here, it could be anything else), but it wouldn’t prevent the unused thread from taking up extra L1/L2. So maybe you can correct me, as what I think is that threads of a warp not explicitly launched by the kernel will be effectively “empty”, so to speak, and if I explicitly launch them to stick to N32 will require manual filtering and extra work.

I understood your 2D suggestion, but this specific problem (the long assignment) is a kind of moving average, 2 by 2, which means that by the time thread the 1st thread of 2nd block starts, it needs to read values obtained from last thread of previous block. That’s why I had to stick to a 1D setup.

But I managed to fix the kernel and moved all the operations to inside the for() loop, which fixes the problem you pointed out, so the operations will happen on the newest data (just needed an extra __syncthreads). I’m not really familiar with warp shuffle. Let me have a look at it.

After venturing into warp shuffle for a day, I already noticed that divergence is a problem to be considered, not only because of program inefficiency, but because of wrong results.

Is it immoral to use __syncthreads with warp shuffle or it totally misses the point of employing the technique?

warp shuffle using the as-of-CUDA-10-non-deprecated-prototypes (i.e. _sync variants) generally should not require an extra __syncthreads(). The _sync variants will take a mask parameter that will reconverge threads.

Thanks for your clarification, Robert.
I will study and try a bit more the technique.

Can you guys tell me if I should worry about the mask value or warp size when not launching a number of threads not multiple of 32?
If I launch, for example <<< gridSize, 101 >>>, which will yield 3 full warps and 1 with just 5 threads and 27 no explicitly launched, does it change something in this:

const unsigned MASK = 0xffffffff;

V1 = __shfl_up_sync(MASK, V2, 1, 32);

What it does is assign the value of V2 in the previous thread/lane to V1 in the caller thread. However, I am not sure if I have to somehow account for this code running on a warp that has just 5 (or less than 32) threads.

Yes, it matters. For correctness, the mask should match the warp lanes you expect to participate.

In this case, how can I find which block will have the “trailing” threads, so to speak, and how to appropriately determine the mask?
I’ve been through the “classic” guides, such as https://devblogs.nvidia.com/cuda-pro-tip-kepler-shuffle/, https://devblogs.nvidia.com/using-cuda-warp-level-primitives/ and https://devblogs.nvidia.com/faster-parallel-reductions-kepler/.
But as far as I could see, they mostly show examples of perfectly-sized problems (for simplicity).

If you are unsure, one way to find the participating threads is to store __activemask() before any thread divergence occurs.

Thanks, Tera.
I will definitely have a look at it.

In your examples so far, it seems like you intend to launch blocks of a size that is not a multiple of 32. In that case, the last warp (of every block) will be non-full, and it is simple arithmetic to compute how many active threads are in that warp.

If you actually meant that you intend to launch threadblocks of a multiple of 32, then only (possibly) the last threadblock (the highest numbered threadblock, e.g. blockIdx.x) will have a non-full complement of threads, and again it is simple arithmetic to compute how many active threads are in the last warp in that block.

Thanks, Robert
I am working on it at the moment with the leads I got here. Hopefully I have good news in the next days.

I was probably not very clear in discussion of the last threadblock in a grid.

Of course, all threadblocks in a grid start out with the same number of active threads.

However it is common programming practice to create kernels that can handle arbitrary-sized data. In such a case, your kernel may include a grid-stride loop, or simply have a thread “check”:

int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < N) {

  // kernel body here
}

In that case, inside the body of the if-statement, for the last block only, there may be a non-full warp.