CUDA clock() issue

NathanN576 · February 27, 2017, 11:21am

H

tera · February 27, 2017, 2:12pm

The data read into j is never used anywhere, so without the “-G” flag the memory read is just optimized away.
Use it to calculate something you write out at the end of the kernel.

Also the placement of the clock accesses in the compiled code is probably not what you expect (the compiler will happily order memory reads around it). Use “cuobjdump -sass test” to look at the compiled code and experiment with barriers, dependencies and asm volatile(“”); and asm volatile(“” ::: “memory”); statements until you get the code you want to time.

NathanN576 · February 28, 2017, 9:36am

Hi tera,

Thanks for the reply, I have tried what you proposed, and the kernel code now looks like this:

__global__ void clock_test(UINT64 *timer, UINT32 *data, UINT32 *data_copy)
{
	register volatile UINT32 i, j = 0;
	register volatile UINT64 start, end;

	
	for (i = 0; i < ELEMENTS; ++i)
	{		
		asm volatile("membar.cta;");
		start = clock64();
			
		j = data[i];
		asm volatile("membar.cta;");		
		end = clock64();		
		
		data_copy[i] = j;
		timer[i] = end - start;
	}
	
}

When i run the code (compiled without -G), i get this:

It shouldn’t be like this, i tried with barriers at different sections of the code… but the values keep getting constant with more bariers. I dont get what is wrong…

tera · February 28, 2017, 10:12am

These results look sensible to me. What do you think is suspicious?
If you want to discuss this further, please also show the code for how you call the kernel and what is is you are printing out.

NathanN576 · February 28, 2017, 10:42am

Hi tera,

The ideea is the following, i want to expose the L1 cache line size. To do that i read values from L1, do something with them and read the latency, it should be almost constant while you are reading from line, and then a miss should come, this should repeat at every N elements, and this should indicate the cache line size. I need to verify the cache line size for optimizations, and i do not get a pattern or “real latencies”.

The whole code:

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>

#define STRIDE 1
#define ELEMENTS 512

#define UINT64	unsigned long long
#define UINT32	unsigned int

__global__ void clock_test(UINT64 *timer, UINT32 *data, UINT32 *data_copy)
{
	register volatile UINT32 i, j = 0;
	register volatile UINT64 start, end;

	
	for (i = 0; i < ELEMENTS; ++i)
	{		
		asm volatile("membar.cta;");
		start = clock64();		
		j = data[i];
		asm volatile("membar.cta;");		
		end = clock64();		
		data_copy[i] = j;
		timer[i] = end - start;
	}

	
}

int main(int argc, char **argv)
{
	UINT64 h_duration[ELEMENTS];
	UINT32 h_data[ELEMENTS];
	UINT32 h_data_copy[ELEMENTS];

	printf("%d\n", ELEMENTS);

	UINT64 *d_duration;
	cudaError_t error_id;
	error_id = cudaMalloc(&d_duration,  ELEMENTS * sizeof(UINT64));
	if (error_id != cudaSuccess) {
		printf("Could not allocate d_duration %s\n", cudaGetErrorString(error_id));
	}
	
	UINT32 *d_data;
	error_id = cudaMalloc(&d_data,  ELEMENTS * sizeof(UINT32));
	if (error_id != cudaSuccess) {
		printf("Could not allocate d_data %s\n", cudaGetErrorString(error_id));
	}

	UINT32 *d_data_copy;
	error_id = cudaMalloc(&d_data_copy,  ELEMENTS * sizeof(UINT32));
	if (error_id != cudaSuccess) {
		printf("Could not allocate d_data_copy %s\n", cudaGetErrorString(error_id));
	}

	for (int i = 0; i < ELEMENTS; ++i) {
		h_data[i] = (i + STRIDE) % ELEMENTS;
	}
	
	cudaMemcpy((void*) d_data, (void*) h_data, ELEMENTS * sizeof(UINT32), cudaMemcpyHostToDevice);
	cudaDeviceSynchronize();

	dim3 Db = dim3(1);
	dim3 Dg = dim3(1,1,1);

	clock_test <<<Dg, Db>>> (d_duration, d_data, d_data_copy);
	cudaDeviceSynchronize();

	cudaMemcpy((void *)h_duration, (void *)d_duration, ELEMENTS * sizeof(UINT64) , cudaMemcpyDeviceToHost);
	cudaMemcpy((void *)h_data_copy, (void *)d_data_copy, ELEMENTS * sizeof(UINT32) , cudaMemcpyDeviceToHost);
	cudaDeviceSynchronize();	

	for (int i = 0; i < ELEMENTS; ++i) {
		printf("%d %llu\n", i, h_duration[i]);
		
		if (h_data[i] != h_data_copy[i]) {
			printf("TEST FAILED !");
			exit(-1);
		}
	}
}

Output:

How the output should be theoretically:

But it does not happen, there are to many clock cycles in the output, and they are constant… they shouldnt since I am itterating over data that should be in L1. The output is the clock diff before and after

j = data[i];

, and it shouldnt be that big. Am i missing something?

tera · February 28, 2017, 8:10pm

What you are seeing is global memory latency on the first access, then L2 latency for the following ones.
You need to explicitly opt-in to using the L1 cache by adding “-Xptxas -dlcm=ca” to the nvcc command line.

BulatZiganshin · March 1, 2017, 1:35pm

are you sure that you really need to measure that yourself instead of reading about it? gpus are different to cpus, so if you just use yourr cpu experice to explore gpu, this may fail

my own list of low-level benchs:
http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf
Demystifying GPU Microarchitecture through Microbenchmarking | stuffedcow Demystifying GPU Microarchitecture through Microbenchmarking
http://asg.ict.ac.cn/dgemm/microbenchs.tar.gz
Understanding the ISA impact on GPU Architecture.
https://hal.inria.fr/file/index/docid/789958/filename/112_Lai.pdf
Dissecting GPU Memory Hierarchy through Microbenchmarking | hgpu.org Dissecting GPU Memory Hierarchy through Microbenchmarking
Understanding Latency Hiding on GPUs | hgpu.org Understanding Latency Hiding on GPUs by Vasily Volkov

full list here:

NathanN576 · March 1, 2017, 2:37pm

Hi guys,

So i have tried with “-Xptxas -dlcm=ca”, indeed with this flag, i do get the pattern i was expecting, it seems that by default nvcc is using “-Xptxas -dlcm=cg”. But the latency is still high… i dont get it… it’s 1 block, one thread so that i can expose the latency, not hide it.

Thx for all the links BulatZiganshin, i think that altough GPU’s are different, the cache memory should be similar in structure to the CPU, access time and speed may differ. If i am wrong please correct me, but with “-Xptxas -dlcm=ca” the pattern confirms the intial theory.

This confirms that L1 cache line is 32B and that is exactly 8 vector elements. The only issue that remains is the latency, should it be that high?

The output with “-Xptxas -dlcm=ca”:

0 1171
1 369
2 370
3 369
4 369
5 370
6 369
7 369
8 515 ---> miss
9 369
10 369
11 369
12 369
13 369
14 369
15 369
16 515 ---> miss
17 369
18 370
19 369
20 370
21 369
22 369
23 369
24 515 ---> miss
25 370

I will go over the documentation and see if i can better understand this.

Thx

BulatZiganshin · March 7, 2017, 4:12am

yes, it can be high since GPUs prioritize throughpurt over latency and can run 16 threads per core. that said maxwell l1c latency afair is 30-50 cycles as was measured in one of these papers

NathanN576 · March 7, 2017, 10:21am

Hello,

I’ve been doing some modification to the kernel code and observed the following behavior. When I declare the “j” variable to a simple register (removing volatile), the clock changes significantly. The latency drops from ~240 to 75 (it stays constant). From what I have read in other topics, if the volatile is not used, the compiler is free to optimize the reads/writes to the variable by caching the data in a local register. I’m still confused on whether to use volatile or not.

register volatile UINT32 i;
        register volatile UINT32 j = 0;
        clock_t start, end;

        for(int k = 0; k < ITERATIONS; k++)
        {
                for (i = 0; i < ELEMENTS; ++i)
                {
                        start = clock64();
                        asm volatile("ld.cg.u32 %0, [%1];" : "=r"(j) : "l"(&data[j]));    
                        end = clock64();

                        data_copy[i] = j;
                        timer[i] = end - start;
                }
        }

tera · March 7, 2017, 10:41am

As the “register” keyword already indicates, you want j to be in a register rather than in memory, so you don’t produce additional memory accesses beyond the one you want to measure. So the “volatile” modifier on it should go.

On a side note, the “register” keyword is just ignored by the compiler. It will place all variables in registers anyway (unless it’s not able to, in which case even the register keyword will not change that).

tera · March 7, 2017, 10:43am

You should really look at the generated code as displayed by “cuobjdump -sass”. Without checking what you are actually timing microbenchmarking really is meaningless.

NathanN576 · March 7, 2017, 10:46am

Hi tera,

I see, thx a lot for the info, will do that!