VisualProfiler ver 2.2


In the new 2.2 profiler, how does the “glob mem XXX throughput GB/s” is computed?

I have the following code in cpu:

for ( int i = 0; i < iMaxLoops; i++ )





where iMaxLoops is ~100.

Does the throughput somehow affected by the CPU code? or is it just the time the kernel runs divided by the amount

of data it reads? I’m using 1D textures to fetch the data, could it be that there is some sort of an issue with the

counters with regard to textures?

Also, attached is a screen shot from one of the loop runs. The interesting kernel is “CalculateEngine” (which is what

consumes 90% of the GPU calculation time). How should I read/analyze those numbers? are they high/low?

I guess I should mainly look at the divergant branch and instructions, no?

I’m using a GTX295 and tested it on one of the GPUs of the dual card.

The reason I ask, is that I got great improvement in GPU and the throughput I see is only 2GB/s???

If it wasn’t sad it would have been funny… :(

thanks in advance, any assistance is more then welcomed.


I haven’t tried the 2.2 profiler yet, but I would be very surprised if that 2Gb/s is not for 1 multiprocessor like all other counters. So you would have to multiply by (I believe 30 for gtx295?), so 60 GB/s would be the total performance.

Hi Denis,

Yes the 295 has 30 multiprocessors. I hope you’re correct, I guess 60GB/s out of 112GB/s is fairly good no?

Also, what do you think about the instruction throughput?

Any suggestions as to how to find how to squeeze more performance? or is it not in the scope of the profiler?

Guys from nVidia - I tried to find some references about the numbers and what they say, but I couldn’t (like the point Denis mentioned above, difference between GPU and CPU time, how to look

at the instruction ratio, etc…) is there some more detailed explaination on those counters somewhere?

I’d also be very happy if someone from nVidia could verify that the counters are indeed per MP and I should multiple it by 30.

BTW - I’ve tested the exact same code, with a different data and got a total of 4GB/s as the throughput. So I guess 4x30 is too much to be real value. Maybe it does some

sort of average between the two kernels, the iterations of the loop on the CPU that runs the kernels???

thanks again



Attached is a screen shot of the summary report for the previous run. The CalculateEngineParams show 18GB/s while the interesting kernel CalculateEngine gives 1.5GB/s. I guess 18GB*30 is well above the

bandwidth of the GTX295, no?

any insights??? :)



Ok I think my initial guess about the textures was correct… would be glad for confirmation though.

I ran a code posted here in the past by MisterAnderson42 (posted below)

Take a look at the attached results of this test ran under VisualProfiler 2.2.

All the texture related kernels are showing 0 as all statistics and btw the values in the other columns are reasonable and probably not per MP.

What do you guys think?



#define BLOCK_SIZE 128

texture<float4, 1, cudaReadModeElementType> tex_float4;

texture<float2, 1, cudaReadModeElementType> tex_float2;

texture<float, 1, cudaReadModeElementType> tex_float;

template <class T> __global__ void copy_gmem(T* g_idata, T* g_odata, T c)


		const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

		g_odata[idx] = g_idata[idx];


__global__ void copy_tex_float(float* g_idata, float* g_odata, float c)


	const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

	g_odata[idx] = tex1Dfetch(tex_float, idx);


__global__ void copy_tex_float2(float2* g_idata, float2* g_odata, float2 c)


	const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

	g_odata[idx] = tex1Dfetch(tex_float2, idx);


__global__ void copy_tex_float4(float4* g_idata, float4* g_odata, float4 c)


	const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

	g_odata[idx] = tex1Dfetch(tex_float4, idx);


template <class T> __global__ void write_only(T* g_idata, T* g_odata, T c)


	const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

	g_odata[idx] = c;


template <class T> __global__ void read_only_gmem(T* g_idata, T* g_odata, T c)


	const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

	__shared__ T shared[BLOCK_SIZE];

	shared[threadIdx.x] = g_idata[idx];

	*((float *)(&shared[(threadIdx.x + 1) & (BLOCK_SIZE-1)])) += 1.0;


__global__ void read_only_tex_float(float* g_idata, float* g_odata, float c)


	const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

	__shared__ float shared[BLOCK_SIZE];

	shared[threadIdx.x] = tex1Dfetch(tex_float, idx);

	*((float *)(&shared[(threadIdx.x + 1) & (BLOCK_SIZE-1)])) += 1.0;


__global__ void read_only_tex_float2(float2* g_idata, float2* g_odata, float2 c)


	const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

	__shared__ float2 shared[BLOCK_SIZE];

	shared[threadIdx.x] = tex1Dfetch(tex_float2, idx);

	*((float *)(&shared[(threadIdx.x + 1) & (BLOCK_SIZE-1)])) += 1.0;


__global__ void read_only_tex_float4(float4* g_idata, float4* g_odata, float4 c)


	const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

	__shared__ float4 shared[BLOCK_SIZE];

	shared[threadIdx.x] = tex1Dfetch(tex_float4, idx);

	*((float *)(&shared[(threadIdx.x + 1) & (BLOCK_SIZE-1)])) += 1.0;


#define DO_BMARK(kernel, type, value, name, ntransfer) \

	{ \

	kernel<<< grid, threads >>>((type *)d_idata, (type *)d_odata, value); \


	cudaEvent_t start, end; \

	CUDA_SAFE_CALL( cudaEventCreate(&start) ); \

	CUDA_SAFE_CALL( cudaEventCreate(&end) ); \


	CUDA_SAFE_CALL( cudaEventRecord(start, 0) ); \

	for (int i=0; i < nIters; ++i) \

		{ \

		kernel<<< grid, threads >>>((type *)d_idata, (type *)d_odata, value); \

		} \

	CUDA_SAFE_CALL( cudaEventRecord(end, 0) ); \

	CUDA_SAFE_CALL( cudaEventSynchronize(end) ); \


	float runTime; \

	CUDA_SAFE_CALL( cudaEventElapsedTime(&runTime, start, end) ); \

	runTime /= float(nIters); \

	printf("%s - Bandwidth:	%f GiB/s\n", name, (ntransfer * len * sizeof(type)) / (runTime  * 1.0e-3 * 1024*1024*1024)); \

	CUDA_SAFE_CALL( cudaEventDestroy(start) ); \

	CUDA_SAFE_CALL( cudaEventDestroy(end) ); \


void BenchMarkTest()


	cudaSetDevice( 1 );

	int len = 1 << 22;

	int num_threads = BLOCK_SIZE;

	int nIters = 500;

	int V = 104, A = 161, G = 1, C = 13, S = 750;

	int iSize = V * A * G * C * S;

	float4 *d_idata, *d_odata;

	CUDA_SAFE_CALL( cudaMalloc((void**)&d_idata, sizeof(float4)*len) );

	CUDA_SAFE_CALL( cudaMalloc((void**)&d_odata, sizeof(float4)*len) );

	printf( "len[%d], sizeof(float4)*len[%d]\n", len, sizeof(float4)*len );

	CUDA_SAFE_CALL( cudaBindTexture(0, tex_float, d_idata, sizeof(float) * len) );

	CUDA_SAFE_CALL( cudaBindTexture(0, tex_float2, d_idata, sizeof(float2) * len) );

	CUDA_SAFE_CALL( cudaBindTexture(0, tex_float4, d_idata, sizeof(float4) * len) );


	dim3  threads(num_threads, 1, 1);

	//dim3  grid(len/num_threads, 1, 1);

	dim3  grid(len/num_threads, 10, 1);

	printf( "threads[%d], grid[%d]\n", threads.x, grid.x );

	//DO_BMARK(copy_gmem<float>, float, 0.0f, "copy_gmem<float>", 2);

	DO_BMARK(copy_gmem<float>, float, 0.0f, "copy_gmem<float>", 2 );

	DO_BMARK(copy_gmem<float2>, float2, make_float2(0.0f, 0.0f), "copy_gmem<float2>", 2);

	DO_BMARK(copy_gmem<float4>, float4, make_float4(0.0f, 0.0f, 0.0f, 0.0f), "copy_gmem<float4>", 2);



	DO_BMARK(copy_tex_float, float, 0.0f, "copy_tex<float>", 2);

	DO_BMARK(copy_tex_float2, float2, make_float2(0.0f, 0.0f), "copy_tex<float2>", 2);

	DO_BMARK(copy_tex_float4, float4, make_float4(0.0f, 0.0f, 0.0f, 0.0f), "copy_tex<float4>", 2);



	DO_BMARK(write_only<float>, float, 0.0f, "write_only<float>", 1);

	DO_BMARK(write_only<float2>, float2, make_float2(0.0f, 0.0f), "write_only<float2>", 1);

	DO_BMARK(write_only<float4>, float4, make_float4(0.0f, 0.0f, 0.0f, 0.0f), "write_only<float4>", 1);



	DO_BMARK(read_only_gmem<float>, float, 0.0f, "read_only_gmem<float>", 1);

	DO_BMARK(read_only_gmem<float2>, float2, make_float2(0.0f, 0.0f), "read_only_gmem<float2>", 1);

	DO_BMARK(read_only_gmem<float4>, float4, make_float4(0.0f, 0.0f, 0.0f, 0.0f), "read_only_gmem<float4>", 1);



	DO_BMARK(read_only_tex_float, float, 0.0f, "read_only_tex<float>", 1);

	DO_BMARK(read_only_tex_float2, float2, make_float2(0.0f, 0.0f), "read_only_tex<float2>", 1);

	DO_BMARK(read_only_tex_float4, float4, make_float4(0.0f, 0.0f, 0.0f, 0.0f), "read_only_tex<float4>", 1);



Any idea about this behaviour? nVidia guys please ? :)


The releasenotes of the visual profiler has details about the counters (also the fact that 1 MP is measured)

I’ve opened a bug report in the beta site (bug #535482 - “Textures reads seem to be ignored by the Visual profiler in 2.2”)
and indeed it seems to have been a issue with textures. Test code to reproduce it is taken from a earlier
post by Mr Anderson42.

This is the official nVidia answer to the bug:
“Latest Comment update from NVIDIA (4/7/2009 12:16:29 PM):
Eyal, our dev team has reviewed this and stated that this is currently unsupported. It may be considered for future enhancements.”


Are you reading from textures at all?

The profiler does not count texture accesses for memory throughput reporting. The reported number is for the entire GPU, not one multiprocessor. Here’s how it’s calculated:

  • count the 32B, 64B, and 128B memory transactions issued by one TPC (these are the {gld,gst}_{32,64,128}b counters you see;
  • compute the number of bytes moved to/from gmem by that TPC, based on the above counts;
  • count the threadblocks run on that TPC;
  • find the ratio of total threadblocks (computed from grid dimensions) to the threadblocks run on one TPC;
  • extrapolate the bytes moved to/from gmem by the entire kernel, using per TPC byte count and the ratio computed above;
  • divide the total bytes moved by kernel elapsed time.

One more thing - when calculating G is 10^9 in the calculation, not 2^30. Evidently powers of 10 are used for bandwidth numbers, so it was kept consistent so that it can be compared to theoretical numbers.


Yes I’m doing most of the reads from textures in my code since the threads use/share the data they need and

shared memory is not applicable in this case ( for a couple of reasons). In anycase the code makes heavy use

of textures. Seems like the profiler doesnt see it.

BTW - this is something I’ve been thinking of. One of the things i was hoping to see with the new profiler

was the bandwidth. I guessed this is how you count it, but if we divide the total bytes by the kernel time then

we also count other operations in this counter, no?

if 50% of the kernel time was calculations and 50% of the time was fetching data from gmem, for example,

I’d get half the bandwidth of the therotical capability in the profiler results, no?

To me, lazy guy, I’d be happy if the division was done by the actual time that the gmem operations took and not

the overall kernel time. That way I could have known easily if my kernel is good or bad “bandwidth” wise.

I hope this makes sense :)



One more thing to note. When I saw that my app gave such low values in the BW params in the profiler, I went

and checked the code once posted here by Mr Anderson42 (I’ve attached it above).

This simple test indeed seemed to have confirmed the problem.


You’re correct, total elapsed time is used. Unfortunately the actual time that gmem operations were taking place is not easy to measure. We’re looking into some ideas about augmenting machine code for this purpose, but it’s not trivial (plus not always accurate - consider cases where memory accesses are data-dependent). In the meantime, you can commenting out as much computation as possible at the source level for this purpose. Also keep in mind that the compiler excludes any code that doesn’t contribute to output.


Hi Paulius,

Doing it in the hardware would be great… however in the meantime, how about putting the burden on us the programmers?

I could write a code like this maybe:

__global__ mykernel(...)


	 // code here


		//load from gmem



what do you think?