Number of kilobytes transferred to/from shared memory twice the expected

I’m storing two arrays of floats from global memory into shared memory. Then I compute the dot product of these vectors by reading from shared memory. I have left out the reduction step in the code snippet below for simplicity.

I call the kernel with 256 threads per block and 25 blocks. I made the arrays 256 long for simplicity (to make array size equal to # threads per block). Then processing one block will take 8 warps (each 32 threads).

The execution requires #blocks * #warps * #arrays = 25 * 8 * 2 = 400 read requests to shared memory and an equal number of write requests. Performance Analysis with Nsight confirms this (shared tab of memory statistics).

Because I’m using floats I would expect the number of bytes transferred from memory being #read requests * #threads per warp * #bytes per element = 400 * 32 * 4 bytes/float = 51.200 bytes = 50 kilobytes.

However, the shared tab of the memory statistics in Nsight shows 100 kilobytes transferred from shared memory. Similarly, it shows 100 kilobytes transferred to shared memory (expected is also 50 with similar reasoning).

Looking in the CUDA source view shows L1 Transfer Overhead as 2, where I expect 1.

In Nsight performance analysis -> CUDA Launches I see that Shared Memory Configuration Executed is FOUR_BYTE_BANK_SIZE.

My device is a Geforce GT 710 (so compute capability 3.5). I’m running on Windows 7 in vs2017 (VCToolsVersion 14.13). nvcc version is V9.2.148.

__global__ void
dot_product(float * x_g, float * y_g, int x_len, int y_len, float* dp_out) {

	__shared__ __align__(4) float x[256];
	__shared__ __align__(4) float y[256];

	for (int t = threadIdx.x; t < 256; t += blockDim.x) {
		x[t] = x_g[t];
		y[t] = y_g[t];
	}

	__syncthreads();

	//Dot product
	float dp = 0;

	for (int t = threadIdx.x; t < 256; t += blockDim.x) {
		dp += x[t]*y[t];
	}

	dp_out[blockDim.x*blockIdx.x + threadIdx.x] = dp;
	__syncthreads();
}

hmm somehow I remember only the GTX 780Ti had Compute 3.5 and the rest of the Keplers had 3.0.

Maybe this has something to do with it?
“A L1 cache line is 128 bytes and maps to a 128 byte aligned segment in device memory.”

(source, describing Compute 3.5 devices: https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/memorystatisticsglobal.htm )

A read or write access should access warpSize * sizeof(float) = 128 bytes exactly.

Could it be that the shared memory you get is not aligned with a cache line boundary and hence you get two requests per load or store?

Try using shared memory assigned through the 3rd kernel launch template argument (accessible via extern shared), maybe that behaves differently? Alternatively: Does align(128) work or is that number too big?

Christian

In this thread I demonstrate how to determine the starting offset of a shared variable within the block’s shared memory space:

https://devtalk.nvidia.com/default/topic/1039418/cuda-programming-and-performance/how-does-static-shared-memory-get-laid-out-across-the-banks-/

This technique could be used to verify L1 cache line alignment of your x and y arrays.

I tried:

  • using shared memory assigned through the 3rd kernel launch template argument
  • align(128)

Neither of these gives the expected number of kilobytes transferred…

Printing the starting offset of a shared variable within the block’s shared memory space (as you suggested) gives $00000000, so it seems aligned. I also tried casting the memory location to a unsigned long long (before I received your suggestion) and that gives 16777216, which is divisible by 128.

Could this be a device-specific issue? If so, I could get another device, but which one? (I wouldn’t like to spend much more than USD 100, though).

Code:

#include <stdio.h>

// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda.h>

__device__ unsigned int __forceinline__ shared_ptr_32bit(void *global)
{
	unsigned long long sharedptr;
	asm(" cvta.to.shared.u64 %0,%1;\n\t"
		: "=l"(sharedptr) : "l" (global));
	return (unsigned int)sharedptr;
}

__global__ void
dot_product(float * x_g, float * y_g, int x_len, int y_len, float* dp_out, unsigned long long int* xy_ptr) {

	//Option 1:
	//__shared__ __align__(128) float x[256];
	//__shared__ __align__(128) float y[256];

	//Option 2:
	extern __shared__ float xy[];
	float* x = &xy[0];
	float* y = &xy[256];

	for (int t = threadIdx.x; t < 256; t += blockDim.x) {
		x[t] = x_g[t];
		y[t] = y_g[t];
	}

	__syncthreads();

	//Dot product
	float dp = 0;

	for (int t = threadIdx.x; t < 256; t += blockDim.x) {
		dp += x[t]*y[t];
	}

	dp_out[blockDim.x*blockIdx.x + threadIdx.x] = dp;

	printf("&xy[0] = $%08x\n", shared_ptr_32bit(&xy[0]));

	if (threadIdx.x == 0)
		*xy_ptr = (unsigned long long int)xy;

	__syncthreads();
}

/**
 * Host main routine
 */
int
main(void)
{
	// Error code to check return values for CUDA calls
	cudaError_t err = cudaSuccess;

	int threadsPerBlock = 256;
	int blocksPerGrid = 25;

	size_t x_size = sizeof(float)*256;
	size_t y_size = sizeof(float)*256;
	size_t dp_size = sizeof(float)*threadsPerBlock*blocksPerGrid;
	size_t xy_ptr_size = sizeof(unsigned long long int);

	// Allocate the host input vectors
	float* h_x = (float*)malloc(x_size);
	float* h_y = (float*)malloc(y_size);
	float* h_dp = (float*)malloc(dp_size);
	unsigned long long int* h_xy_ptr = (unsigned long long int*)malloc(xy_ptr_size);

	// Verify that allocations succeeded
	if (h_x == NULL || h_y == NULL || h_dp == NULL)
	{
		fprintf(stderr, "Failed to allocate host vectors!\n");
		exit(EXIT_FAILURE);
	}

	// Initialize the host input vectors
	for (int t = 0; t < 256; t++) {
		h_x[t] = (float)t / 256;
	}

	for (int t = 0; t < 256; t++) {
		h_y[t] = (float)t / 256;
	}

	// Allocate the device input vectors
	float* d_x = NULL;
	err = cudaMalloc((void **)&d_x, x_size);

	if (err != cudaSuccess)
	{
		fprintf(stderr, "Failed to allocate device vector d_x (error code %s)!\n", cudaGetErrorString(err));
		exit(EXIT_FAILURE);
	}

	float* d_y = NULL;
	err = cudaMalloc((void **)&d_y, y_size);

	if (err != cudaSuccess)
	{
		fprintf(stderr, "Failed to allocate device vector d_y (error code %s)!\n", cudaGetErrorString(err));
		exit(EXIT_FAILURE);
	}

	float* d_dp = NULL;
	err = cudaMalloc((void **)&d_dp, dp_size);
	if (err != cudaSuccess)
	{
		fprintf(stderr, "Failed to allocate device vector d_dp (error code %s)!\n", cudaGetErrorString(err));
		exit(EXIT_FAILURE);
	}

	unsigned long long int* d_xy_ptr = NULL;
	err = cudaMalloc((void **)&d_xy_ptr, xy_ptr_size);
	if (err != cudaSuccess)
	{
		fprintf(stderr, "Failed to allocate device vector d_xy_ptr (error code %s)!\n", cudaGetErrorString(err));
		exit(EXIT_FAILURE);
	}

	// Copy the host input data from host memory to the device memory
	printf("Copy input data from the host memory to the CUDA device\n");
	err = cudaMemcpy(d_x, h_x, x_size, cudaMemcpyHostToDevice);

	if (err != cudaSuccess)
	{
		fprintf(stderr, "Failed to copy vector h_x from host to device (error code %s)!\n", cudaGetErrorString(err));
		exit(EXIT_FAILURE);
	}

	err = cudaMemcpy(d_y, h_y, y_size, cudaMemcpyHostToDevice);

	if (err != cudaSuccess)
	{
		fprintf(stderr, "Failed to copy vector h_y from host to device (error code %s)!\n", cudaGetErrorString(err));
		exit(EXIT_FAILURE);
	}

	// Launch the CUDA Kernel
	printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
	//Option 1:
	//dot_product << <blocksPerGrid, threadsPerBlock >> > (d_x, d_y, 256, 256, d_dp);

	//Option 2:
	size_t sharedMemorySize = x_size + y_size;
	dot_product << <blocksPerGrid, threadsPerBlock, sharedMemorySize>> > (d_x, d_y, 256, 256, d_dp, d_xy_ptr);

	cudaDeviceSynchronize();

	err = cudaGetLastError();

	if (err != cudaSuccess)
	{
		fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));
		exit(EXIT_FAILURE);
	}

	// Copy the device result vector in device memory to the host result vector
	// in host memory.
	printf("Copy output data from the CUDA device to the host memory\n");

	err = cudaMemcpy(h_dp, d_dp, dp_size, cudaMemcpyDeviceToHost);

	if (err != cudaSuccess)
	{
		fprintf(stderr, "Failed to copy vector d_dp from device to host (error code %s)!\n", cudaGetErrorString(err));
		exit(EXIT_FAILURE);
	}

	err = cudaMemcpy(h_xy_ptr, d_xy_ptr, xy_ptr_size, cudaMemcpyDeviceToHost);

	if (err != cudaSuccess)
	{
		fprintf(stderr, "Failed to copy vector d_xy_ptr from device to host (error code %s)!\n", cudaGetErrorString(err));
		exit(EXIT_FAILURE);
	}

	// Verify that the result vector is correct
	for (int i = 0; i < 256; ++i)
	{
		fprintf(stderr, "dp[%d]=%f\n", i, h_dp[i]);
	}

	fprintf(stderr, "xy_ptr=%llu\n", *h_xy_ptr);

	// Free device global memory
	err = cudaFree(d_x);

	if (err != cudaSuccess)
	{
		fprintf(stderr, "Failed to free device vector d_x (error code %s)!\n", cudaGetErrorString(err));
		exit(EXIT_FAILURE);
	}

	err = cudaFree(d_y);

	if (err != cudaSuccess)
	{
		fprintf(stderr, "Failed to free device vector d_y (error code %s)!\n", cudaGetErrorString(err));
		exit(EXIT_FAILURE);
	}

	err = cudaFree(d_dp);

	if (err != cudaSuccess)
	{
		fprintf(stderr, "Failed to free device vector d_dp (error code %s)!\n", cudaGetErrorString(err));
		exit(EXIT_FAILURE);
	}

	err = cudaFree(d_xy_ptr);

	if (err != cudaSuccess)
	{
		fprintf(stderr, "Failed to free device vector d_xy_ptr (error code %s)!\n", cudaGetErrorString(err));
		exit(EXIT_FAILURE);
	}

	// Free host memory
	free(h_x);
	free(h_y);
	free(h_dp);
        free(h_xy_ptr);

	printf("Done");
	return 0;
}

A nVidia GT 1030 (low profile?) card might be what you are looking for (about 70-90 USD plus tax)
https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&DEPA=0&Order=BESTMATCH&Description=gt+1030&N=-1&isNodeId=1

If its 2GB RAM are too small, consider getting a GT 1050 or 1050 Ti with 4GB RAM.

I do not know if it will have different behavior regarding the number of transactions to the L1 cache.

The L1 cache and shared memory bandwidth is generally huge. Why are you so concerned about saturating this limit?

Christian

I think that you are right that it is device-specific. This link also supports that statement:

The link makes the statement that each shared memory bank has a bandwidth of 64-bits per clock cycle. That statement seems to be independent of the configuration used (4-byte or 8-byte). As there are 32 banks, and also 32 warps, we have to access 8-bytes per thread at onces to use this full bandwidth, for example with float2.

The programming guide also hints at it, but doesn’t say it explicitly:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#shared-memory-3-0

I will not tweak my algorithm for this right now, since we will be running on a device with compute capability 6.1 eventually.

I ordered the card you suggested (2GB RAM is more than enough) and I’ll let you know if it doesn’t have the issue. It will be better anyway, as it’s compute capability of 6.1 equals the compute capability of the device we will be using eventually.

I don’t think it has the issue, since the programming guide only mentions the 32-bits and says nothing about 64-bits:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#shared-memory-6-x
which refers to:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-5-x

I am concerned about saturating the limit, since shared memory throughput is the bottleneck. I have a lot of compute operations left unused. I was thinking about fixing it by using half-precision floats for shared memory transfer and then change them into floats for computation. Then I could use up compute resources and do twice as many computations in the same time. I asked the question because I thought I was not seeing something simple, which I would then fix by using half precision floats. And indeed, I should apparently fix it by using float2’s. Both of these solutions make the code more difficult to read, so I don’t like them.

Right now the algorithm takes 6 seconds to process a 4-second signal on my GT 710. In production, we want to process multiple signals in parallel. With the current speed, it needs 6/4=1.5 streaming multiprocessors to process one signal. So for each signal more we do in parallel, we have to add 1.5 streaming multiprocessors. We actually want the delay being lower than 4 seconds. That would mean even more streaming multiprocessors. Many streaming multiprocessors means even a couple of high end cards or dozens of low end cards. That all takes up rack space, so costs. That’s why I’m concerned about taking away as many bottlenecks as possible. In this case I felt this meant saturating this limit.

I realize that the Kepler devices could reach quite a high shared memory throughput.
Some calculations of the #GB’s transferred from shared memory (using f_core * #banks * bank width):

  • Kepler: GT710: 954 MHz * 32 * 64 bits = 227 GB/s (of which I only attained half because I only accessed 32 bits)
  • Pascal: GT1030: 1152 MHz * 32 * 32 bits = 137 GB/s
  • Pascal: QP5000: 1607 MHz * 32 * 32 bits = 191 GB/s

I tested it on a nVidia GT 1030. The number of transactions to the L1 cache on that device are as I expected. NSight also doesn’t show L1 Transfer Overhead. I get about 80% of the throughput from shared memory I expected. That’s in line with the article here (https://arxiv.org/pdf/1509.02308.pdf). They achieve 83.9% of the theoretical throughput. I tried getting more throughput by also using the texture cache (with __ldg()). I thought I could make it faster by using both texture cache and shared memory. I think it doesn’t work since the number of active warps at a time (64) are not enough to cover the latencies. Don’t know how to prove though that’s what’s actually happening, other than by reasoning that the shared memory access followed by a FFMA already takes about 35+24 cycles in latency (http://www.stuffedcow.net/research/cudabmk), which I then need almost all warps for to cover. So I’ll just stick with the implementation I have.

May I ask what happened to this thread? Suddenly this looks like a monologue I had with myself.

There used to be an OP named “jvanprehn” who asked a technical question about shared memory and I responded.

Now the user and all his postings are gone. This looks like a really bad feature of a forum software to also delete a user’s entire posting history when the account is gone.

Christian

Hi Christian,

I am looking at the thread as admin and I see all the comments from “jvanprehn”. Can you send me a private message with a screenshot of your view? I will look into this issue.

Thanks,
Tom

Currently this problem appears to be rectified - I see all postings including the OP

I had the problem for about two days when I followed the link from my notification email into this thread. I was also unable to locate the thread by other means (searching for keywords in the title or by browsing recent topics). I tried both mobile and desktop browsers.

I am glad to hear it is working now. Did this by chance happen on 9/17-18? In the future feel free to ping me directly when you experience issues like this.

Thanks,
Tom

I think it happened on 9/21.