A couple of questions

maringanti · November 28, 2008, 2:21am

How to measure the memory bandwidth of an application ? - Is there any way to determine this without having to sit and calculate based on the profiler counters ?
The branch counter from the profiler shows a non-zero value even though I dont have any branch statements in my kernel. What is the possible reason for this ?

alex_dubinsky · November 28, 2008, 5:20am

No.

But you can automate the profiler counter calculations. It’d be nice if someone wrote a script. But at least you can make a formula in Excel and copy in the #s. The formula is something like 64*(CoalescedReads + CoalescedWrites) + 4 * (UncoalescedReads + UncoalescedWrites). A formula for “wasted” bandwidth is 60 * (UncoalescedReads + UncoalescedWrites).

(Of course, keep in mind on G200 uncoalesced reads/writes are not reported. So this formula doesn’t work on G200, and you can’t calculate bandwidth.)

MisterAnderson42 · November 28, 2008, 2:33pm

You can calculate bandwidth by doing your own profiling. Just count up the number of bytes that your kernel will read and write in each thread by hand and time the kernel launches.

maringanti · November 28, 2008, 8:42pm

Thanks guys…

Any idea about the second question. Also there is another one -

When we make a coalesced access what is the burst size of the data that can be sent ? Like for example in normal processors, when there is a cache miss, an entire cache line is brought into the memory. So in each coalesced access how much data comes in ? - is it data required for a half warp or a complete warp or a fixed size ?

MisterAnderson42 · November 28, 2008, 11:23pm

We can’t exactly help you without seeing your kernel code there. A call to a math function or something like that could have branches it it.

Coalesced reads come in several different transaction sizes. On compute 1.0/1.1 hardware, these are 32, 64, and 128 bits in each thread. On compute 1.2 and newer hardware, it is a little different (see the programming guide). And there is no cache. A single memory transaction feeds a half warp (though it is easier to think in terms of warps instead of half warps).

maringanti · December 2, 2008, 11:16pm

ok. So if have to bring in say data elements for each thread, it is better to bring them together as a float2 rather than 2 separate loads of float ?

and regarding the other question about branch counters… this is a test program that does the element-wise product of two vectors. ie. C[i] = A[i] * B[i] where A, B and C are 3 vectors.

#include <stdio.h>

#include <stdlib.h>

#include "cuda.h"

#define THREADS 256

__global__ void dot(float*, float*, float*, int);

void cpuDot(float*, float*, float*, int);

int main(int argv, char** argc){

	int i, vec_size;

	vec_size = 1024;

	float *A = (float *) malloc (vec_size * sizeof(float));

	float *B = (float *) malloc (vec_size * sizeof(float));

	float *Result = (float *) malloc (vec_size * sizeof(float));

	float *cpuResult = (float *) malloc (vec_size * sizeof(float));

	printf("\n Initializing arrays\n");

	for(i=0;i<vec_size;i++){

		B[i] = i%467;

		A[i] = i%3218;

	}

	printf("Declaring arrays on GPU\n");

	float *Ad,*Bd, *Resultd;

	cudaMalloc((void**)&Ad, vec_size*sizeof(float));

	cudaMalloc((void**)&Bd, vec_size*sizeof(float));

	cudaMalloc((void**)&Resultd, vec_size*sizeof(float));

	cudaMemcpy(Ad, A, vec_size*sizeof(float), cudaMemcpyHostToDevice);

	cudaMemcpy(Bd, B, vec_size*sizeof(float), cudaMemcpyHostToDevice);

	dim3 block(THREADS, 1);

	dim3 grid(vec_size/THREADS, 1);

	printf("Starting multiplication on GPU\n");

	dot<<<grid, block>>>(Ad, Bd, Resultd,vec_size);

	cudaMemcpy(Result, Resultd, vec_size*sizeof(float), cudaMemcpyDeviceToHost);

	cudaThreadSynchronize();

	cpuDot(A, B, cpuResult, vec_size);

	float error = 0.0;

	for(i=0;i<vec_size;i++) error += Result[i] - cpuResult[i];

	

	for(i=0;i<vec_size;i++){

		printf("%d \t %f \t %f \n", i, Result[i], cpuResult[i]);

	}

	printf("\n Avg Error : \t %f \n", error/vec_size);

	cudaFree(Ad);

	cudaFree(Bd);

	cudaFree(Resultd);

	free(A);

	free(B);

	free(cpuResult);

	return 0;

}

__global__ void dot(float* r1, float* r2, float* result, int vec_size){

	int tid = blockDim.x * blockIdx.x + threadIdx.x;

	__shared__ float Adata[THREADS];

	__shared__ float Bdata[THREADS];

	__shared__ float product[THREADS];

	Adata[threadIdx.x] = r1[tid];

	Bdata[threadIdx.x] = r2[tid];

	__syncthreads();

	product[threadIdx.x] = Adata[threadIdx.x] * Bdata[threadIdx.x];

	__syncthreads();

	result[tid] = product[threadIdx.x];

}

void cpuDot(float* r1, float* r2, float* cpuResult, int vec_size){

	for(int i = 0; i < vec_size; ++i){

		cpuResult[i] += r1[i] * r2[i];

	}

}

The way i see it, there are no conflicts and branch statements. But the profiler shows its branch counter = 16. How is this possible ?

Topic		Replies	Views
How to know where the bottleneck is? CUDA Programming and Performance	3	4309	February 29, 2008
coalesce counter meaning CUDA Programming and Performance	5	4319	April 15, 2009
Interpreting profiler output CUDA Programming and Performance	3	1075	September 20, 2009
Measuring Kernel Bandwidth CUDA Programming and Performance	6	2381	September 21, 2010
Non coalesced memory access on cuda visual profile CUDA Programming and Performance	7	2533	October 1, 2008
Profiler and un-/coalesced memory access CUDA Programming and Performance	1	900	September 6, 2009
coalesced vs. uncoalesced access why not speed-up of 16x? CUDA Programming and Performance	13	6151	October 29, 2008
cuda profiler and gt280 ava. performance counters CUDA Programming and Performance	10	2975	May 15, 2009
Visual profiler results CUDA Programming and Performance	2	1687	June 16, 2009
compute visual profiler - gst coalesced/uncoalesced CUDA Programming and Performance	0	731	July 21, 2011

A couple of questions

Related topics