Shared memory using structure instead of array

Hello,

I’m using structs for shared memory (node_dynamic). Any idea where my bug is?

Is i your thread index?

Do you have multiple threads attempting to load different values into the same location in shared memory? That’s what it looks like to me.

https://devblogs.nvidia.com/using-shared-memory-cuda-cc/

Yeap exactly i is the thread index.
Oh yes I think I override all the time my shared memory with each thread. But when I would need to assign an array which has at the end the size of the big global 2d-array, this is not really possible. Any suggestions to solve this?

It looks like you perhaps need two node_dynamic per thread. I don’t know what the size of that structure is, but shared memory should readily allow you to hold 48KB, so that would be 48 bytes per thread for a threadblock of 1024 threads, or perhaps 96 bytes per thread for a threadblock of 512 threads. Ifwhat I see are all float quantities, that would be 48 bytes per struct, 96 per thread.

Using that method may reduce your occupancy and limit your performance. Certain GPUs can go to 64KB shared or even 96KB shared per threadblock.

I could realize it, thank you very much!

I’m wondering why there is no performance gain, when using shared memory. Do you know why?

Is there data re-use of the data in shared memory? What does the CUDA profiler tell you about the bottlenecks in your code before and after the change?

Withe the exception of the rho parameters, there appears to be very little data reuse in your code. What makes you think shared memory should provide a benefit?

I thought, the faster access by using shared memory instead of global memory for the “complicated” computational part, would make it faster. Nsight tells me:

---------------------Without shared memory-----------------------------------------------------------------
General for 512 threads per block:
[Warning] Memory is more heavily utilized than Compute: Look at Memory Workload Analysis report section to see where the memory system bottleneck is. Check memory replay (coalescing) metrics to make sure you’re efficiently utilizing the bytes transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or whether there are values you can (re)compute.

General for 32 threads per block:
[Warning] Compute is more heavily utilized than Memory: Look at Compute Workload Analysis report section to see what the compute pipelines are spending their time doing. Also, consider whether any computation is redundant and could be reduced or moved to look-up tables.

---------------------With shared memory-----------------------------------------------------------------
General:
[Warning] This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.

Scheduler Statistics:
[Warning] Every scheduler is capable of issuing two instructions per cycle, but for this kernel each scheduler only issues an instruction every 14.4 cycles. This might leave hardware resources underutilized and may lead to less optimal performance. Out of the maximum of 16 warps per scheduler, this kernel allocates an average of 7.47 active warps per scheduler, but only an average of 0.10 warps were eligible per cycle. Eligible warps are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible warp results in no instruction being issued and the issue slot remains unused. To increase the number of eligible warps either increase the number of active warps or reduce the time the active warps are stalled.

Warp state statistics:
Especially the “Stall short scoreboard” shows the following warning:
[Warning] On average each warp of this kernel spends 40.2 cycles being stalled waiting for a scoreboard dependency on an MIO operation (not to TEX or L1). This represents about 37.4% of the total average of 107.5 cycles between issuing two instructions. The primary reason for a high number of stalls due to short scoreboards is typically memory operations to shared memory, but other contributors include frequent execution of special math instructions (e.g. MUFU) or dynamic branching (e.g. BRX, JMX). Consult the Memory Workload Analysis section to verify if there are shared memory operations and reduce bank conflicts, if reported.

Do you have some tips to make my implementation much faster :-) ?