clGetEventProfilingInfo() zero end time cannot get kernel's end time while start is correct

kawaman · November 9, 2010, 9:17am

hello,

running a kernel gives strange results.

i have 2 kernels that multiply 2 square matrices. no extensions used.

also, i have a dummy host cpu implementation (so i can compare the results).

i use profiling (by initializing the cl command queue to do so) and i collect the results after a clFinish() on this queue.

i also use gettimeofday() for host timing and i run the binary with time shell function.

when the matrices aren’t large, results are “normal”, except that the second kernel returns 0 as end time from clGetEventProfilingInfo(). i check clGetEventInfo() and it returns 0 (CL_COMPLETE) when queried with CL_EVENT_COMMAND_ EXECUTION_STATUS. the first kernel returns expected results from clGetEventProfilingInfo().

#dummy host implementation

$time ./matmul1 -p -t 0 -ha 256 -wa 256 -wb 256

	82msec (0.082 sec) - Multiplication of A[256, 256] x B[256, 256] = C[256, 256] on dummy

real	0m0.086s

user	0m0.092s

sys	0m0.000s

#dev-gpu1 is for the first kernel

$time ./matmul1 -p -t 110 -ha 256 -wa 256 -wb 256

[mul_matrices_f_dev]: 6 msec, Throughput = 55.9241 GFlops, Time = 0.00060 sec, Size = 33554432	Kernel execution time on device: 0.00049 sec

	452msec (0.452 sec) - Multiplication of A[256, 256] x B[256, 256] = C[256, 256] on dev-gpu1

real	0m0.580s

user	0m0.288s

sys	0m0.288s

#dev-gpu2 is for the second kernel

$time ./matmul1 -p -t 115 -ha 256 -wa 256 -wb 256

[mul_matrices_f_dev2]: 1 msec, Throughput = 335.5443 GFlops, Time = 0.00010 sec, Size = 33554432	Kernel execution time on device: 17157461376.46042 sec

	533msec (0.533 sec) - Multiplication of A[256, 256] x B[256, 256] = C[256, 256] on dev-gpu2

real	0m0.661s

user	0m0.372s

sys	0m0.288s

timings:

#first line

[mul_matrices_f_dev2]: 1 msec, ← gettimeofday() elapsed time only for kernel execution

Throughput = 335.5443 GFlops, Time = 0.00010 sec, ← the same (now in sec instead of msec)

Size = 33554432 Kernel execution time on device: 17157461376.46042 sec ← clGetEventProfilingInfo() end-start converted to sec

#second line

533msec (0.533 sec) <-- gettimeofday() elapsed time (total -- initialize, load, execute, deinitialize)

Multiplication of A[256, 256] x B[256, 256] = C[256, 256] on dev-gpu2 ← matrix dimensions and target (here we run the second kernel)

and after that you see the time shell command output.

to assert correct results i rerun the tests and dump the input/output matrices to a text file. then i use diff and i see no difference.

also, i run the tests with different order (e.g. first the second kernel, then the first) and results are the same.

when matrices get “bigger”, the second kernel is speeding like a demon!!!

#here the host implementation is optimized, because we don't want it to run forever!

$time ./matmul1 -p -t 16 -ha 2048 -wa 2048 -wb 2048

[mul_matrices_f_4_host_sse_aligned_pthr]: 91332 msec, Throughput = 1.8810 GFlops, Time = 9.13320 sec, Size = 17179869184

	91332msec (91.332 sec) - Multiplication of A[2048, 2048] x B[2048, 2048] = C[2048, 2048] on host-pthr+sse2

real	1m31.494s

user	5m40.401s

sys	0m0.280s

#first kernel -- consistent results with previous test

$time ./matmul1 -p -t 110 -ha 2048 -wa 2048 -wb 2048

[mul_matrices_f_dev]: 2208 msec, Throughput = 77.8074 GFlops, Time = 0.22080 sec, Size = 17179869184	Kernel execution time on device: 0.21784 sec

	2954msec (2.954 sec) - Multiplication of A[2048, 2048] x B[2048, 2048] = C[2048, 2048] on dev-gpu1

real	0m3.239s

user	0m0.496s

sys	0m0.352s

#second kernel -- something must be wrong!

$time ./matmul1 -p -t 115 -ha 2048 -wa 2048 -wb 2048

[mul_matrices_f_dev2]: 31 msec, Throughput = 5541.8933 GFlops, Time = 0.00310 sec, Size = 17179869184	Kernel execution time on device: 17157460771.76913 sec

	644msec (0.644 sec) - Multiplication of A[2048, 2048] x B[2048, 2048] = C[2048, 2048] on dev-gpu2

real	0m0.927s

user	0m0.580s

sys	0m0.348s

wow! i have always been an nvidia afficionado, but these results are beyond my imagination! 5 petaflops throughput???

as you can see, in both tests, the second kernel cannot report end time. this happens all the time.

the results you see are with “-cl-fast-relaxed-math” option. but without this option, still the second kernel cannot report end time.

the throughput calculation is made with the gettimeofday() data. clGetEventProfilingInfo() results are only displayed after the “Kernel execution time on device”.

again, when i dump the results, they are compared with the diff command and they are correct.

also, the second kernel consistently runs faster (e.g. with dimensions 1024x1024, 512x512, etc).

so, i would appreciate any help for the following:

any ideas why a kernel cannot report end time?
can a geforce 8800gt have this kind of performance?
is there something else i should know about profiling? (for example, any tool for linux that would make things clearer?)

i tried to google for “clgeteventprofilinginfo zero end” but nothing came up. also, i searched this forum and couldn’t find something.

i have an amd phenom quadcore (cpu) and i run debian squeeze.

$cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  260.19.12  Fri Oct  8 11:17:08 PDT 2010

GCC version:  gcc version 4.3.5 (Debian 4.3.5-4)

$lspci|grep VGA

01:00.0 VGA compatible controller: nVidia Corporation G92 [GeForce 8800 GT] (rev a2)

$./clview

opencl

	num_platforms: 1

	platform 0 (this: 0x14d0670)

		vendor: NVIDIA Corporation

		profile: FULL_PROFILE

		version: OpenCL 1.0 CUDA 3.2.1

		name: NVIDIA CUDA

		extensions:

			cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll

thanx in advance for any answers

kawaman · November 9, 2010, 9:17am

hello,

running a kernel gives strange results.

i have 2 kernels that multiply 2 square matrices. no extensions used.

also, i have a dummy host cpu implementation (so i can compare the results).

i use profiling (by initializing the cl command queue to do so) and i collect the results after a clFinish() on this queue.

i also use gettimeofday() for host timing and i run the binary with time shell function.

when the matrices aren’t large, results are “normal”, except that the second kernel returns 0 as end time from clGetEventProfilingInfo(). i check clGetEventInfo() and it returns 0 (CL_COMPLETE) when queried with CL_EVENT_COMMAND_ EXECUTION_STATUS. the first kernel returns expected results from clGetEventProfilingInfo().

#dummy host implementation

$time ./matmul1 -p -t 0 -ha 256 -wa 256 -wb 256

	82msec (0.082 sec) - Multiplication of A[256, 256] x B[256, 256] = C[256, 256] on dummy

real	0m0.086s

user	0m0.092s

sys	0m0.000s

#dev-gpu1 is for the first kernel

$time ./matmul1 -p -t 110 -ha 256 -wa 256 -wb 256

[mul_matrices_f_dev]: 6 msec, Throughput = 55.9241 GFlops, Time = 0.00060 sec, Size = 33554432	Kernel execution time on device: 0.00049 sec

	452msec (0.452 sec) - Multiplication of A[256, 256] x B[256, 256] = C[256, 256] on dev-gpu1

real	0m0.580s

user	0m0.288s

sys	0m0.288s

#dev-gpu2 is for the second kernel

$time ./matmul1 -p -t 115 -ha 256 -wa 256 -wb 256

[mul_matrices_f_dev2]: 1 msec, Throughput = 335.5443 GFlops, Time = 0.00010 sec, Size = 33554432	Kernel execution time on device: 17157461376.46042 sec

	533msec (0.533 sec) - Multiplication of A[256, 256] x B[256, 256] = C[256, 256] on dev-gpu2

real	0m0.661s

user	0m0.372s

sys	0m0.288s

timings:

#first line

[mul_matrices_f_dev2]: 1 msec, ← gettimeofday() elapsed time only for kernel execution

Throughput = 335.5443 GFlops, Time = 0.00010 sec, ← the same (now in sec instead of msec)

Size = 33554432 Kernel execution time on device: 17157461376.46042 sec ← clGetEventProfilingInfo() end-start converted to sec

#second line

533msec (0.533 sec) <-- gettimeofday() elapsed time (total -- initialize, load, execute, deinitialize)

Multiplication of A[256, 256] x B[256, 256] = C[256, 256] on dev-gpu2 ← matrix dimensions and target (here we run the second kernel)

and after that you see the time shell command output.

to assert correct results i rerun the tests and dump the input/output matrices to a text file. then i use diff and i see no difference.

also, i run the tests with different order (e.g. first the second kernel, then the first) and results are the same.

when matrices get “bigger”, the second kernel is speeding like a demon!!!

#here the host implementation is optimized, because we don't want it to run forever!

$time ./matmul1 -p -t 16 -ha 2048 -wa 2048 -wb 2048

[mul_matrices_f_4_host_sse_aligned_pthr]: 91332 msec, Throughput = 1.8810 GFlops, Time = 9.13320 sec, Size = 17179869184

	91332msec (91.332 sec) - Multiplication of A[2048, 2048] x B[2048, 2048] = C[2048, 2048] on host-pthr+sse2

real	1m31.494s

user	5m40.401s

sys	0m0.280s

#first kernel -- consistent results with previous test

$time ./matmul1 -p -t 110 -ha 2048 -wa 2048 -wb 2048

[mul_matrices_f_dev]: 2208 msec, Throughput = 77.8074 GFlops, Time = 0.22080 sec, Size = 17179869184	Kernel execution time on device: 0.21784 sec

	2954msec (2.954 sec) - Multiplication of A[2048, 2048] x B[2048, 2048] = C[2048, 2048] on dev-gpu1

real	0m3.239s

user	0m0.496s

sys	0m0.352s

#second kernel -- something must be wrong!

$time ./matmul1 -p -t 115 -ha 2048 -wa 2048 -wb 2048

[mul_matrices_f_dev2]: 31 msec, Throughput = 5541.8933 GFlops, Time = 0.00310 sec, Size = 17179869184	Kernel execution time on device: 17157460771.76913 sec

	644msec (0.644 sec) - Multiplication of A[2048, 2048] x B[2048, 2048] = C[2048, 2048] on dev-gpu2

real	0m0.927s

user	0m0.580s

sys	0m0.348s

wow! i have always been an nvidia afficionado, but these results are beyond my imagination! 5 petaflops throughput???

as you can see, in both tests, the second kernel cannot report end time. this happens all the time.

the results you see are with “-cl-fast-relaxed-math” option. but without this option, still the second kernel cannot report end time.

the throughput calculation is made with the gettimeofday() data. clGetEventProfilingInfo() results are only displayed after the “Kernel execution time on device”.

again, when i dump the results, they are compared with the diff command and they are correct.

also, the second kernel consistently runs faster (e.g. with dimensions 1024x1024, 512x512, etc).

so, i would appreciate any help for the following:

any ideas why a kernel cannot report end time?
can a geforce 8800gt have this kind of performance?
is there something else i should know about profiling? (for example, any tool for linux that would make things clearer?)

i tried to google for “clgeteventprofilinginfo zero end” but nothing came up. also, i searched this forum and couldn’t find something.

i have an amd phenom quadcore (cpu) and i run debian squeeze.

$cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  260.19.12  Fri Oct  8 11:17:08 PDT 2010

GCC version:  gcc version 4.3.5 (Debian 4.3.5-4)

$lspci|grep VGA

01:00.0 VGA compatible controller: nVidia Corporation G92 [GeForce 8800 GT] (rev a2)

$./clview

opencl

	num_platforms: 1

	platform 0 (this: 0x14d0670)

		vendor: NVIDIA Corporation

		profile: FULL_PROFILE

		version: OpenCL 1.0 CUDA 3.2.1

		name: NVIDIA CUDA

		extensions:

			cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll

thanx in advance for any answers

karbous · November 9, 2010, 2:13pm

Are you checking the error code of clGetEventProfilingInfo function? Maybe you are giving to the function wrong event… Or try to make en-queuing kernel with blocking call. …Try to post your host code.
Sorry, I’m not an expert. Don’t know how much performance can be gained from such graphic card. Depends on memory pattern access, register usage… Posting output of SDK sample oclDeviceQuery might help;-)
There is openclprof by NVidia (on linux usually in folder /usr/local/cuda/openclprof/bin ), which can report divergent branches, memory throughput,… - it might show why the second kernel is so quick

There is also occupancy xls calculator and cl-nv-verbose compiler option, which are handy.

karbous · November 9, 2010, 2:13pm

Are you checking the error code of clGetEventProfilingInfo function? Maybe you are giving to the function wrong event… Or try to make en-queuing kernel with blocking call. …Try to post your host code.
Sorry, I’m not an expert. Don’t know how much performance can be gained from such graphic card. Depends on memory pattern access, register usage… Posting output of SDK sample oclDeviceQuery might help;-)
There is openclprof by NVidia (on linux usually in folder /usr/local/cuda/openclprof/bin ), which can report divergent branches, memory throughput,… - it might show why the second kernel is so quick

There is also occupancy xls calculator and cl-nv-verbose compiler option, which are handy.

kawaman · November 10, 2010, 7:54pm

first of all, thank you very much for your swift answer.

yes. CL_SUCCESS is returned. i don’t think that a blocking enqueue would make any difference. i have the host+device code at the end of the post, although i don’t think that it will be useful, since it’s derived from the oclMatrixMul SDK code (i just made some minor changes). also, i wrapped some functions in a handy library for quick testing. the kernels are basically the one from the nvidia SDK and the other from ati stream SDK.

firt of all, i made a major mistake! the throughput on the gpu is about 5 teraflops. i have owned many nvidia gfx cards (from tnt2) and i got carried away - i have always been fond of nvidia products, so when i saw a big number i went insane!!!

i also post the oclDeviceQuery output at the end of this post. i have the SDK installed on windows, so i run the binary provided by the SDK (didn’t compile it myself - i don’t have visual studio installed (and don’t plan to). i only have a windows license provided by the university i study in). the kernels have different memory access patterns. also, the second kernel has 1 synchronization point (barrier) instead of 2 the first kernel has.

openclprof is very interesting. but, as you know, in the linux land things are very complicated! since i run debian squeeze 64bit, i don’t like the old drivers provided by the package manager (currently v195.36). i use the sgfxi script to get the latest drivers (v260.xx, which also contain /usr/lib/libOpenCL.so). the problem is if the SDK blob provided by nvidia contains files that may be incompatible with my system (or mess with the ones already installed), i would be in trouble! also, i like to have complete control on what is installed and where on my file system. so, if nvidia provides a package without drivers or other binaries (JUST the samples+documentation AND the utilities), then i would gladly install it. this is the main reason i don’t like SDKs on linux.

if i use the occupancy xls calculator (i have openoffice 3.2 installed - my gfx card has 1.1 compute capability), when i insert the 60 registers reported by ‘-cl-nv-verbose’ kernel compilation, the xls cannot display values in the blue box [ 3.) GPU Occupancy Data is displayed here and in the graphs: ] - they are all zero. but, the first kernel is ok (15 registers).

below, you can see the ‘-cl-nv-verbose’ output.

#first kernel

$ ./cl_prog -g -i ../matmul2.cl --inc ../matmul2.h --options "-cl-nv-verbose -cl-fast-relaxed-math"

OPENCL COMPILATION LOG: ../matmul2.cl on GeForce 8800 GT [GPU]

: Considering profile 'compute_11' for gpu='sm_11' in 'cuModuleLoadDataEx_4'

: Retrieving binary for 'cuModuleLoadDataEx_4', for gpu='sm_11', usage mode='  --verbose  '

: Considering profile 'compute_11' for gpu='sm_11' in 'cuModuleLoadDataEx_4'

: Control flags for 'cuModuleLoadDataEx_4' disable search path

: Ptx binary found for 'cuModuleLoadDataEx_4', architecture='compute_11'

: Ptx compilation for 'cuModuleLoadDataEx_4', for gpu='sm_11', ocg options='  --verbose  '

ptxas info    : Compiling entry function 'matrixMul' for 'sm_11'

ptxas info    : Used 15 registers, 28+16 bytes smem, 199 bytes cmem[0], 4 bytes cmem[1]

	opencl program 0x21fe210

		file: ../matmul2.cl

		num_kernels: 1

			kernel 0 (this 0x2c16670)

		kernel0: 0x2c16670

#second kernel

$ ./cl_prog -g -i ../matmul3-3.cl --options "-cl-nv-verbose -cl-fast-relaxed-math"

OPENCL COMPILATION LOG: ../matmul3-3.cl on GeForce 8800 GT [GPU]

: Considering profile 'compute_11' for gpu='sm_11' in 'cuModuleLoadDataEx_4'

: Retrieving binary for 'cuModuleLoadDataEx_4', for gpu='sm_11', usage mode='  --verbose  '

: Considering profile 'compute_11' for gpu='sm_11' in 'cuModuleLoadDataEx_4'

: Control flags for 'cuModuleLoadDataEx_4' disable search path

: Ptx binary found for 'cuModuleLoadDataEx_4', architecture='compute_11'

: Ptx compilation for 'cuModuleLoadDataEx_4', for gpu='sm_11', ocg options='  --verbose  '

ptxas info    : Compiling entry function 'matrixMul3' for 'sm_11'

ptxas info    : Used 60 registers, 24+16 bytes smem, 168 bytes cmem[0], 24 bytes cmem[1]

	opencl program 0x2379210

		file: ../matmul3-3.cl

		num_kernels: 1

			kernel 0 (this 0x2824eb0)

		kernel0: 0x2824eb0

oclDeviceQuery output on windows+msys bash console:

$ ./oclDeviceQuery.exe                                                                                                                                      

oclDeviceQuery.exe Starting...                                                                                                                              

OpenCL SW Info:                                                                                                                                             

CL_PLATFORM_NAME:      NVIDIA CUDA                                                                                                                         

 CL_PLATFORM_VERSION:   OpenCL 1.0 CUDA 3.2.1                                                                                                               

 OpenCL SDK Revision:   5537818                                                                                                                             

OpenCL Device Info:                                                                                                                                         

1 devices found supporting OpenCL:                                                                                                                         

---------------------------------                                                                                                                          

 Device GeForce 8800 GT                                                                                                                                     

 ---------------------------------                                                                                                                          

  CL_DEVICE_NAME:                       GeForce 8800 GT                                                                                                     

  CL_DEVICE_VENDOR:                     NVIDIA Corporation                                                                                                  

  CL_DRIVER_VERSION:                    260.99                                                                                                              

  CL_DEVICE_TYPE:                       CL_DEVICE_TYPE_GPU                                                                                                  

  CL_DEVICE_MAX_COMPUTE_UNITS:          14                                                                                                                  

  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:   3                                                                                                                   

  CL_DEVICE_MAX_WORK_ITEM_SIZES:        512 / 512 / 64                                                                                                      

  CL_DEVICE_MAX_WORK_GROUP_SIZE:        512                                                                                                                 

  CL_DEVICE_MAX_CLOCK_FREQUENCY:        1500 MHz                                                                                                            

  CL_DEVICE_ADDRESS_BITS:               32                                                                                                                  

  CL_DEVICE_MAX_MEM_ALLOC_SIZE:         128 MByte                                                                                                           

  CL_DEVICE_GLOBAL_MEM_SIZE:            511 MByte                                                                                                           

  CL_DEVICE_ERROR_CORRECTION_SUPPORT:   no                                                                                                                  

  CL_DEVICE_LOCAL_MEM_TYPE:             local                                                                                                               

  CL_DEVICE_LOCAL_MEM_SIZE:             16 KByte                                                                                                            

  CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:   64 KByte                                                                                                            

  CL_DEVICE_QUEUE_PROPERTIES:           CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE                                                                              

  CL_DEVICE_QUEUE_PROPERTIES:           CL_QUEUE_PROFILING_ENABLE                                                                                           

  CL_DEVICE_IMAGE_SUPPORT:              1                                                                                                                   

  CL_DEVICE_MAX_READ_IMAGE_ARGS:        128                                                                                                                 

  CL_DEVICE_MAX_WRITE_IMAGE_ARGS:       8                                                                                                                   

  CL_DEVICE_SINGLE_FP_CONFIG:           INF-quietNaNs round-to-nearest round-to-zero round-to-inf fma                                                       

CL_DEVICE_IMAGE <dim>                 2D_MAX_WIDTH     4096                                                                                               

                                        2D_MAX_HEIGHT    32768                                                                                              

                                        3D_MAX_WIDTH     2048                                                                                               

                                        3D_MAX_HEIGHT    2048                                                                                               

                                        3D_MAX_DEPTH     2048                                                                                               

CL_DEVICE_EXTENSIONS:                 cl_khr_byte_addressable_store                                                                                       

                                        cl_khr_icd                                                                                                          

                                        cl_khr_gl_sharing                                                                                                   

                                        cl_nv_d3d9_sharing                                                                                                  

                                        cl_nv_compiler_options                                                                                              

                                        cl_nv_device_attribute_query                                                                                        

                                        cl_nv_pragma_unroll                                                                                                 

                                        cl_khr_global_int32_base_atomics                                                                                    

                                        cl_khr_global_int32_extended_atomics                                                                                

CL_DEVICE_COMPUTE_CAPABILITY_NV:      1.1                                                                                                                 

  NUMBER OF MULTIPROCESSORS:            14                                                                                                                  

  NUMBER OF CUDA CORES:                 112                                                                                                                 

  CL_DEVICE_REGISTERS_PER_BLOCK_NV:     8192                                                                                                                

  CL_DEVICE_WARP_SIZE_NV:               32                                                                                                                  

  CL_DEVICE_GPU_OVERLAP_NV:             CL_TRUE                                                                                                             

  CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV:     CL_TRUE                                                                                                             

  CL_DEVICE_INTEGRATED_MEMORY_NV:       CL_FALSE                                                                                                            

  CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>  CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0                                                                   

oclDeviceQuery, Platform Name = NVIDIA CUDA, Platform Version = OpenCL 1.0 CUDA 3.2.1, SDK Revision = 5537818, NumDevs = 1, Device = GeForce 8800 GT        

System Info:                                                                                                                                                

Local Time/Date = 19:41:18, 11/10/2010                                                                                                                     

 CPU Arch: 0                                                                                                                                                

 CPU Level: 16                                                                                                                                              

 # of CPU processors: 4                                                                                                                                     

 Windows Build: 2600                                                                                                                                        

 Windows Ver: 5.1                                                                                                                                           

PASSED                                                                                                                                                      

Press <Enter> to Quit...                                                                                                                                    

-----------------------------------------------------------

host+device code:

matmul-dev.c – code based on samples from nvidia gpu computing sdk

(this won’t be very helpful)

/* matmul-dev.c */

#include <stdio.h>

#include <stdlib.h>

#include <stdint.h>

#include <inttypes.h>

#include <string.h>

#include <CL/cl.h>

#include "std_gen.h"

#include "clhlp"

#include "matmul-dev.h"

#include "matmul2.h"

/* OpenCL specific variables */

static struct cl_arch_t clarch;

static struct cl_ctx_t clctx;

static struct cl_cmdqueue_t clcmdq;

static struct cl_prog_t clprg;

/* OpenCL device memory for matrices */

static cl_mem d_a, d_b, d_c;

static unsigned int mem_size_a, mem_size_b, mem_size_c;

static void	init_kernel(const int which);

static void	init_mem(float *a, float *b,

			const int ha, const int wa, const int wb);

static void	deinit_kernel(void);

static void	deinit_mem(void);

static void	print_profiling(const char *fun, cl_event event,

			const int ha, const int wa, const int wb,

			const uint64_t ms, const int n_iter);

static void

print_profiling(const char *fun, cl_event event,

	const int ha, const int wa, const int wb,

	const uint64_t ms, const int n_iter)

{

	uint64_t n_ops;

	double gflops, t;

	t=(double) ms/(n_iter*1000.0);

	n_ops=(uint64_t) 2*wa*ha*wb;

	gflops=1.0e-9*n_ops/t;

	printf("[%s]: %" PRIu64 " msec, "

		"Throughput = %.4f GFlops, Time = %.5f sec, "

		"Size = %" PRIu64 "\t",

		fun, ms, gflops, t, n_ops

	);

	printf("Kernel execution time on device: %.5f sec\n",

		get_execution_time(event)/1000.0

	);

}

void

init_cl(const int typ)

{

	const char thisfun[]="init_cl";

	int res;

	cl_device_type cltyp;

	cltyp=CL_DEVICE_TYPE_ALL;

	switch(typ) {

	case 1:

		cltyp=CL_DEVICE_TYPE_GPU;

	break;

	case 2:

		cltyp=CL_DEVICE_TYPE_CPU;

	break;

	}

	if(!init_cl_arch(&clarch, 0))

		BAILOUT_CL_NO_PLAT(thisfun);

	if(!init_cl_ctx(&clctx, cltyp, clarch.platforms[0], 0))

		BAILOUT_CL_NO_CTX(thisfun);

#ifdef DO_PROFILING

	res=init_cl_cmdqueue(&clcmdq,

		CL_QUEUE_PROFILING_ENABLE, clctx.context, clctx.devices[0]

	);

#else

	res=init_cl_cmdqueue(&clcmdq, 0, clctx.context, clctx.devices[0]);

#endif /* DO_PROFILING */

	if(!res)

		BAILOUT_CL_NO_CMD_QUEUE(thisfun);

}

void

deinit_cl(void)

{

	deinit_cl_cmdqueue(&clcmdq);

	deinit_cl_ctx(&clctx);

	deinit_cl_arch(&clarch);

}

static void

init_kernel(const int which)

{

	const char thisfun[]="init_kernel";

	char header[256];

	switch(which) {

	case 2:

		strcpy(header, "../matmul2.h");

		if(

			!init_cl_prog_from_src_ex(&clprg,

				"../matmul2.cl", "-cl-fast-relaxed-math",

				NULL, (char (*)[256]) header, 1, NULL, 0,

				clctx.context, 0

			)

		)

			BAILOUT_CL_PROG_ERR(thisfun);

	break;

	case 3:

		if(

			!init_cl_prog_from_src(&clprg, "../matmul3-3.cl",

				clctx.context, 0

			)

		)

			BAILOUT_CL_PROG_ERR(thisfun);

	break;

	default: /* NOTE: only for testing purposes */

		if(

			!init_cl_prog_from_src(&clprg, "../matmul1.cl",

				clctx.context, 0

			)

		)

			BAILOUT_CL_PROG_ERR(thisfun);

	}

}

static void

deinit_kernel(void)

{

	deinit_cl_prog(&clprg);

}

static void

init_mem(float *a, float *b, const int ha, const int wa, const int wb)

{

	const char thisfun[]="init_mem";

	cl_int errcode;

	mem_size_a=ha*wa*sizeof(float);

	mem_size_b=wa*wb*sizeof(float);

	mem_size_c=ha*wb*sizeof(float);

	d_c=clCreateBuffer(

		clctx.context, CL_MEM_WRITE_ONLY,

		mem_size_c, NULL, &errcode

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clCreateBuffer()");

	d_a=clCreateBuffer(

		clctx.context, CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR,

		mem_size_a, a, &errcode

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clCreateBuffer()");

	d_b=clCreateBuffer(

		clctx.context, CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR,

		mem_size_b, b, &errcode

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clCreateBuffer()");

}

static void

deinit_mem(void)

{

	const char thisfun[]="deinit_mem";

	cl_int errcode;

	errcode=clReleaseMemObject(d_a);

	errcode|=clReleaseMemObject(d_b);

	errcode|=clReleaseMemObject(d_c);

	ASSERT_CL_SUCC(errcode, thisfun, "clReleaseMemObject()");

}

void

mul_matrices_f_dev_le16(float *res, float *a, float *b,

	const int ha, const int wa, const int wb)

{

	const char thisfun[]="mul_matrices_f_dev_le16";

	int w_a, w_b;

	size_t local_worksize[2], global_worksize[2];

	cl_int errcode;

	cl_event kernel_execution;

#ifdef DO_PROFILING

	int n_iter, j;

	uint64_t ms;

#endif /* DO_PROFILING */

	init_kernel(1);

	init_mem(a, b, ha, wa, wb);

#ifdef DO_PROFILING

	ms=0;

	n_iter=10;

	for(j=-1; j<n_iter; j++) {

		/*

		 * sync queue to host and

		 * start timer first time through loop

		 */

		if(!j) {

			clFinish(clcmdq.queue);

			ms=get_sys_time();

		}

#endif /* DO_PROFILING */

	local_worksize[0]=wb;

	local_worksize[1]=ha;

	global_worksize[0]=wb;

	global_worksize[1]=ha;

	w_a=wa;

	w_b=wb;

	errcode=clSetKernelArg(clprg.kernel0,

		0, sizeof(cl_mem), (void *) &d_c

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		1, sizeof(cl_mem), (void *) &d_a

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		2, sizeof(cl_mem), (void *) &d_b

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		3, sizeof(int), (void *) &w_a

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		4, sizeof(int), (void *) &w_b

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clSetKernelArg()");

	errcode=clEnqueueNDRangeKernel(clcmdq.queue, clprg.kernel0,

		2, NULL, global_worksize, local_worksize, 0, NULL,

		&kernel_execution

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clEnqueueNDRangeKernel()");

#ifdef DO_PROFILING

	}

#endif /* DO_PROFILING */

	cl_memcpy_to_host(res, d_c, mem_size_c, clcmdq.queue);

#ifdef DO_PROFILING

	clFinish(clcmdq.queue);

	ms=get_sys_time()-ms;

	print_profiling(thisfun, kernel_execution, ha, wa, wb, ms, n_iter);

#endif /* DO_PROFILING */

	deinit_mem();

	deinit_kernel();

}

void

mul_matrices_f_dev2(float *res, float *a, float *b,

	const int ha, const int wa, const int wb)

{

	const char thisfun[]="mul_matrices_f_dev2";

	int w_a, w_b;

	size_t local_worksize[2], global_worksize[2];

	cl_int errcode;

	cl_event kernel_execution;

#ifdef DO_PROFILING

	int n_iter, j;

	uint64_t ms;

#endif /* DO_PROFILING */

	init_kernel(3);

	init_mem(a, b, ha, wa, wb);

#ifdef DO_PROFILING

	ms=0;

	n_iter=10;

	for(j=-1; j<n_iter; j++) {

		/*

		 * sync queue to host and

		 * start timer first time through loop

		 */

		if(!j) {

			clFinish(clcmdq.queue);

			ms=get_sys_time();

		}

#endif /* DO_PROFILING */

	local_worksize[0]=BLOCK_SIZE;

	local_worksize[1]=BLOCK_SIZE;

	global_worksize[0]=round_up(BLOCK_SIZE, wb);

	global_worksize[1]=round_up(BLOCK_SIZE, ha);

	w_a=wa;

	w_b=wb;

	errcode=clSetKernelArg(clprg.kernel0,

		0, sizeof(cl_mem), (void *) &d_c

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		1, sizeof(cl_mem), (void *) &d_a

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		2, sizeof(cl_mem), (void *) &d_b

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		3, BLOCK_SIZE*BLOCK_SIZE*sizeof(float), NULL

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		4, BLOCK_SIZE*BLOCK_SIZE*sizeof(float), NULL

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		5, sizeof(int), (void *) &w_a

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clSetKernelArg()");

	errcode=clEnqueueNDRangeKernel(clcmdq.queue, clprg.kernel0,

		2, NULL, global_worksize, local_worksize, 0, NULL,

		&kernel_execution

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clEnqueueNDRangeKernel()");

#ifdef DO_PROFILING

	}

#endif /* DO_PROFILING */

	cl_memcpy_to_host(res, d_c, mem_size_c, clcmdq.queue);

#ifdef DO_PROFILING

	clFinish(clcmdq.queue);

	ms=get_sys_time()-ms;

	print_profiling(thisfun, kernel_execution, ha, wa, wb, ms, n_iter);

#endif /* DO_PROFILING */

	deinit_mem();

	deinit_kernel();

}

void

mul_matrices_f_dev(float *res, float *a, float *b,

	const int ha, const int wa, const int wb)

{

	const char thisfun[]="mul_matrices_f_dev";

	int w_a, w_b;

	size_t local_worksize[2], global_worksize[2];

	cl_int errcode;

	cl_event kernel_execution;

#ifdef DO_PROFILING

	int n_iter, j;

	uint64_t ms;

#endif /* DO_PROFILING */

	init_kernel(2);

	init_mem(a, b, ha, wa, wb);

#ifdef DO_PROFILING

	ms=0;

	n_iter=10;

	for(j=-1; j<n_iter; j++) {

		/*

		 * sync queue to host and

		 * start timer first time through loop

		 */

		if(!j) {

			clFinish(clcmdq.queue);

			ms=get_sys_time();

		}

#endif /* DO_PROFILING */

	local_worksize[0]=BLOCK_SIZE;

	local_worksize[1]=BLOCK_SIZE;

	global_worksize[0]=round_up(BLOCK_SIZE, wb);

	global_worksize[1]=round_up(BLOCK_SIZE, ha);

	w_a=wa;

	w_b=wb;

	errcode=clSetKernelArg(clprg.kernel0,

		0, sizeof(cl_mem), (void *) &d_c

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		1, sizeof(cl_mem), (void *) &d_a

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		2, sizeof(cl_mem), (void *) &d_b

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		3, BLOCK_SIZE*BLOCK_SIZE*sizeof(float), NULL

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		4, BLOCK_SIZE*BLOCK_SIZE*sizeof(float), NULL

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		5, sizeof(int), (void *) &w_a

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		6, sizeof(int), (void *) &w_b

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clSetKernelArg()");

	errcode=clEnqueueNDRangeKernel(clcmdq.queue, clprg.kernel0,

		2, NULL, global_worksize, local_worksize, 0, NULL,

		&kernel_execution

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clEnqueueNDRangeKernel()");

#ifdef DO_PROFILING

	}

#endif /* DO_PROFILING */

	cl_memcpy_to_host(res, d_c, mem_size_c, clcmdq.queue);

#ifdef DO_PROFILING

	clFinish(clcmdq.queue);

	ms=get_sys_time()-ms;

	print_profiling(thisfun, kernel_execution, ha, wa, wb, ms, n_iter);

#endif /* DO_PROFILING */

	deinit_mem();

	deinit_kernel();

}

matmul2.h

/* thread block size */

#define BLOCK_SIZE	16

matmul2.cl – code from nvidia gpu computing sdk

// matmul2.cl

// Multiply two matrices A * B = C

// Device code.

//include "matmul2.h"

#define AS(i, j)	As[(i)*BLOCK_SIZE+(j)]

#define BS(i, j)	Bs[(i)*BLOCK_SIZE+(j)]

__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1)))

void

matrixMul(__global float *C, __global float *A, __global float *B,

	__local float *As, __local float *Bs,

	int wA, int wB)

{

	// 2D thread ID

	int gx=get_global_id(0);

	int gy=get_global_id(1);

	// block index (in NDRange)

	int bx=get_group_id(0);

	int by=get_group_id(1);

	// thread index (in workgroup)

	int tx=get_local_id(0);

	int ty=get_local_id(1);

	// index of the first sub-matrix of A processed by the block

	int aBegin=wA*BLOCK_SIZE*by;

	// index of the last sub-matrix of A processed by the block

	int aEnd=aBegin+wA-1;

	// step size used to iterate through the sub-matrices of A

	int aStep=BLOCK_SIZE;

	// index of the first sub-matrix of B processed by the block

	int bBegin=BLOCK_SIZE*bx;

	// step size used to iterate through the sub-matrices of B

	int bStep=BLOCK_SIZE*wB;

	float4 Csub=(float4) (0.0f, 0.0f, 0.0f, 0.0f);

	// loop over all the sub-matrices of A and B

	// required to compute the block sub-matrix

	for(

		int a=aBegin, b=bBegin;

		a<=aEnd;

		a+=aStep, b+=bStep

	) {

		// load the matrices from global memory

		// to local memory; each thread loads

		// one element of each matrix

		AS(ty, tx)=A[a+wA*ty+tx];

		BS(ty, tx)=B[b+wB*ty+tx];

		// synchronize to make sure the matrices are loaded

		barrier(CLK_LOCAL_MEM_FENCE);

		// multiply the two matrices together;

		// each thread computes one element of the block sub-matrix

		#pragma unroll

		for(int k=0; k<BLOCK_SIZE; k+=4) {

			float4 temp1=(float4) (

				AS(ty, k), AS(ty, k+1),

				AS(ty, k+2), AS(ty, k+3)

			);

			float4 temp2=(float4) (

				BS(k, tx), BS(k+1, tx),

				BS(k+2, tx), BS(k+3, tx)

			);

			Csub+=temp1*temp2;

		}

		// synchronize to make sure that the preceding

		// computation is done before loading two new

		// sub-matrices of A and B in the next iteration

		barrier(CLK_LOCAL_MEM_FENCE);

	}

	// write the block sub-matrix to device memory;

	// each thread writes one element

	//C[wB*BLOCK_SIZE*by+BLOCK_SIZE*bx+wB*ty+tx]=

	//	Csub.x+Csub.y+Csub.z+Csub.w;

	C[gy*wB+gx]=Csub.x+Csub.y+Csub.z+Csub.w;

}

matmul3-3.cl – code from ati stream sdk

// matmul3.cl

// Multiply two matrices A * B = C

// Device code.

#define TILEX		4

#define TILEX_SHIFT	2

#define TILEY		4

#define TILEY_SHIFT	2

// optimized version : Tile : 4x4

// both matrix A and matrix B are cached into local memory blocks

__kernel void matrixMul3(__global float4 *matrixC,

	__global float4 *matrixA, __global float4* matrixB,

	__local float4 *blockA, __local float4 *blockB,

	int widthA)

{

	float4 sum0=(float4) (0);

	float4 sum1=(float4) (0);

	float4 sum2=(float4) (0);

	float4 sum3=(float4) (0);

	int temp=widthA/4;

	// calculate blockwise-MMM for each pair of blocks along

	// the common width of matrixA and matrixB

	for(int i=0; i<(temp/get_local_size(0)); i++) {

		// local data for blockA from matrixA

		// right now considering only square matrices so width of

		// C = width of A

		blockA[get_local_id(0)+get_local_size(0)*

			(get_local_id(1)<<TILEY_SHIFT)]=

			matrixA[i*get_local_size(0)+get_local_id(0)+

			(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)];

		blockA[get_local_id(0)+get_local_size(0)*

			(get_local_id(1)<<TILEY_SHIFT)+get_local_size(0)]=

			matrixA[i*get_local_size(0)+get_local_id(0)+

			(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+

			get_global_size(0)];

		blockA[get_local_id(0)+get_local_size(0)*

			(get_local_id(1)<<TILEY_SHIFT)+2*get_local_size(0)]=

			matrixA[i*get_local_size(0)+get_local_id(0)+

			(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+

			2*get_global_size(0)];

		blockA[get_local_id(0)+get_local_size(0)*

			(get_local_id(1)<<TILEY_SHIFT)+3*get_local_size(0)]=

			matrixA[i*get_local_size(0)+get_local_id(0)+

			(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+

			3*get_global_size(0)];

		// local data for blockA from matrixB

		blockB[get_local_id(0)+get_local_size(0)*

			(get_local_id(1)<<TILEY_SHIFT)]=

			matrixB[get_global_id(0)+

			((i*get_local_size(1)+get_local_id(1))<<TILEY_SHIFT)*

			get_global_size(0)];

		blockB[get_local_id(0)+get_local_size(0)*

			(get_local_id(1)<<TILEY_SHIFT)+get_local_size(0)]=

			matrixB[get_global_id(0)+

			((i*get_local_size(1)+get_local_id(1))<<TILEY_SHIFT)*

			get_global_size(0)+get_global_size(0)];

		blockB[get_local_id(0)+get_local_size(0)*

			(get_local_id(1)<<TILEY_SHIFT)+2*get_local_size(0)]=

			matrixB[get_global_id(0)+

			((i*get_local_size(1)+get_local_id(1))<<TILEY_SHIFT)*

			get_global_size(0)+2*get_global_size(0)];

		blockB[get_local_id(0)+get_local_size(0)*

			(get_local_id(1)<<TILEY_SHIFT)+3*get_local_size(0)]=

			matrixB[get_global_id(0)+

			((i*get_local_size(1)+get_local_id(1))<<TILEY_SHIFT)*

			get_global_size(0)+3*get_global_size(0)];

		barrier(CLK_LOCAL_MEM_FENCE);

		// each local thread will read a float4

		for(int j=0; j<get_local_size(0)<<2; j+=4) {

			// block dimensions of A = block dimensions of C

			float4 tempA0=

				blockA[(j>>2)+

				get_local_id(1)*TILEY*get_local_size(0)];

			float4 tempA1=

				blockA[(j>>2)+

				(get_local_id(1)*TILEY+1)*get_local_size(0)];

			float4 tempA2=

				blockA[(j>>2)+

				(get_local_id(1)*TILEY+2)*get_local_size(0)];

			float4 tempA3=

				blockA[(j>>2)+

				(get_local_id(1)*TILEY+3)*get_local_size(0)];

			// block dimensions of B = block dimensions of C

			// should be localId.x * (TILEX / 4)

			float4 tempB0=

				blockB[get_local_id(0)+

				j*get_local_size(0)];

			float4 tempB1=

				blockB[get_local_id(0)+

				(j+1)*get_local_size(0)];

			float4 tempB2=

				blockB[get_local_id(0)+

				(j+2)*get_local_size(0)];

			float4 tempB3=

				blockB[get_local_id(0)+

				(j+3)*get_local_size(0)];

			sum0.x+=tempA0.x*tempB0.x+tempA0.y*tempB1.x+

				tempA0.z*tempB2.x+tempA0.w*tempB3.x;

			sum0.y+=tempA0.x*tempB0.y+tempA0.y*tempB1.y+

				tempA0.z*tempB2.y+tempA0.w*tempB3.y;

			sum0.z+=tempA0.x*tempB0.z+tempA0.y*tempB1.z+

				tempA0.z*tempB2.z+tempA0.w*tempB3.z;

			sum0.w+=tempA0.x*tempB0.w+tempA0.y*tempB1.w+

				tempA0.z*tempB2.w+tempA0.w*tempB3.w;

			sum1.x+=tempA1.x*tempB0.x+tempA1.y*tempB1.x+

				tempA1.z*tempB2.x+tempA1.w*tempB3.x;

			sum1.y+=tempA1.x*tempB0.y+tempA1.y*tempB1.y+

				tempA1.z*tempB2.y+tempA1.w*tempB3.y;

			sum1.z+=tempA1.x*tempB0.z+tempA1.y*tempB1.z+

				tempA1.z*tempB2.z+tempA1.w*tempB3.z;

			sum1.w+=tempA1.x*tempB0.w+tempA1.y*tempB1.w+

				tempA1.z*tempB2.w+tempA1.w*tempB3.w;

			sum2.x+=tempA2.x*tempB0.x+tempA2.y*tempB1.x+

				tempA2.z*tempB2.x+tempA2.w*tempB3.x;

			sum2.y+=tempA2.x*tempB0.y+tempA2.y*tempB1.y+

				tempA2.z*tempB2.y+tempA2.w*tempB3.y;

			sum2.z+=tempA2.x*tempB0.z+tempA2.y*tempB1.z+

				tempA2.z*tempB2.z+tempA2.w*tempB3.z;

			sum2.w+=tempA2.x*tempB0.w+tempA2.y*tempB1.w+

				tempA2.z*tempB2.w+tempA2.w*tempB3.w;

			sum3.x+=tempA3.x*tempB0.x+tempA3.y*tempB1.x+

				tempA3.z*tempB2.x+tempA3.w*tempB3.x;

			sum3.y+=tempA3.x*tempB0.y+tempA3.y*tempB1.y+

				tempA3.z*tempB2.y+tempA3.w*tempB3.y;

			sum3.z+=tempA3.x*tempB0.z+tempA3.y*tempB1.z+

				tempA3.z*tempB2.z+tempA3.w*tempB3.z;

			sum3.w+=tempA3.x*tempB0.w+tempA3.y*tempB1.w+

				tempA3.z*tempB2.w+tempA3.w*tempB3.w;

		}

	}

	matrixC[get_global_id(0)+

		(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)]=

		sum0;

	matrixC[get_global_id(0)+

		(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+

		get_global_size(0)]=

		sum1;

	matrixC[get_global_id(0)+

		(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+

		2*get_global_size(0)]=

		sum2;

	matrixC[get_global_id(0)+

		(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+

		3*get_global_size(0)]=

		sum3;

}

kawaman · November 10, 2010, 7:54pm

first of all, thank you very much for your swift answer.

yes. CL_SUCCESS is returned. i don’t think that a blocking enqueue would make any difference. i have the host+device code at the end of the post, although i don’t think that it will be useful, since it’s derived from the oclMatrixMul SDK code (i just made some minor changes). also, i wrapped some functions in a handy library for quick testing. the kernels are basically the one from the nvidia SDK and the other from ati stream SDK.

firt of all, i made a major mistake! the throughput on the gpu is about 5 teraflops. i have owned many nvidia gfx cards (from tnt2) and i got carried away - i have always been fond of nvidia products, so when i saw a big number i went insane!!!

i also post the oclDeviceQuery output at the end of this post. i have the SDK installed on windows, so i run the binary provided by the SDK (didn’t compile it myself - i don’t have visual studio installed (and don’t plan to). i only have a windows license provided by the university i study in). the kernels have different memory access patterns. also, the second kernel has 1 synchronization point (barrier) instead of 2 the first kernel has.

openclprof is very interesting. but, as you know, in the linux land things are very complicated! since i run debian squeeze 64bit, i don’t like the old drivers provided by the package manager (currently v195.36). i use the sgfxi script to get the latest drivers (v260.xx, which also contain /usr/lib/libOpenCL.so). the problem is if the SDK blob provided by nvidia contains files that may be incompatible with my system (or mess with the ones already installed), i would be in trouble! also, i like to have complete control on what is installed and where on my file system. so, if nvidia provides a package without drivers or other binaries (JUST the samples+documentation AND the utilities), then i would gladly install it. this is the main reason i don’t like SDKs on linux.

if i use the occupancy xls calculator (i have openoffice 3.2 installed - my gfx card has 1.1 compute capability), when i insert the 60 registers reported by ‘-cl-nv-verbose’ kernel compilation, the xls cannot display values in the blue box [ 3.) GPU Occupancy Data is displayed here and in the graphs: ] - they are all zero. but, the first kernel is ok (15 registers).

below, you can see the ‘-cl-nv-verbose’ output.

#first kernel

$ ./cl_prog -g -i ../matmul2.cl --inc ../matmul2.h --options "-cl-nv-verbose -cl-fast-relaxed-math"

OPENCL COMPILATION LOG: ../matmul2.cl on GeForce 8800 GT [GPU]

: Considering profile 'compute_11' for gpu='sm_11' in 'cuModuleLoadDataEx_4'

: Retrieving binary for 'cuModuleLoadDataEx_4', for gpu='sm_11', usage mode='  --verbose  '

: Considering profile 'compute_11' for gpu='sm_11' in 'cuModuleLoadDataEx_4'

: Control flags for 'cuModuleLoadDataEx_4' disable search path

: Ptx binary found for 'cuModuleLoadDataEx_4', architecture='compute_11'

: Ptx compilation for 'cuModuleLoadDataEx_4', for gpu='sm_11', ocg options='  --verbose  '

ptxas info    : Compiling entry function 'matrixMul' for 'sm_11'

ptxas info    : Used 15 registers, 28+16 bytes smem, 199 bytes cmem[0], 4 bytes cmem[1]

	opencl program 0x21fe210

		file: ../matmul2.cl

		num_kernels: 1

			kernel 0 (this 0x2c16670)

		kernel0: 0x2c16670

#second kernel

$ ./cl_prog -g -i ../matmul3-3.cl --options "-cl-nv-verbose -cl-fast-relaxed-math"

OPENCL COMPILATION LOG: ../matmul3-3.cl on GeForce 8800 GT [GPU]

: Considering profile 'compute_11' for gpu='sm_11' in 'cuModuleLoadDataEx_4'

: Retrieving binary for 'cuModuleLoadDataEx_4', for gpu='sm_11', usage mode='  --verbose  '

: Considering profile 'compute_11' for gpu='sm_11' in 'cuModuleLoadDataEx_4'

: Control flags for 'cuModuleLoadDataEx_4' disable search path

: Ptx binary found for 'cuModuleLoadDataEx_4', architecture='compute_11'

: Ptx compilation for 'cuModuleLoadDataEx_4', for gpu='sm_11', ocg options='  --verbose  '

ptxas info    : Compiling entry function 'matrixMul3' for 'sm_11'

ptxas info    : Used 60 registers, 24+16 bytes smem, 168 bytes cmem[0], 24 bytes cmem[1]

	opencl program 0x2379210

		file: ../matmul3-3.cl

		num_kernels: 1

			kernel 0 (this 0x2824eb0)

		kernel0: 0x2824eb0

oclDeviceQuery output on windows+msys bash console:

$ ./oclDeviceQuery.exe                                                                                                                                      

oclDeviceQuery.exe Starting...                                                                                                                              

OpenCL SW Info:                                                                                                                                             

CL_PLATFORM_NAME:      NVIDIA CUDA                                                                                                                         

 CL_PLATFORM_VERSION:   OpenCL 1.0 CUDA 3.2.1                                                                                                               

 OpenCL SDK Revision:   5537818                                                                                                                             

OpenCL Device Info:                                                                                                                                         

1 devices found supporting OpenCL:                                                                                                                         

---------------------------------                                                                                                                          

 Device GeForce 8800 GT                                                                                                                                     

 ---------------------------------                                                                                                                          

  CL_DEVICE_NAME:                       GeForce 8800 GT                                                                                                     

  CL_DEVICE_VENDOR:                     NVIDIA Corporation                                                                                                  

  CL_DRIVER_VERSION:                    260.99                                                                                                              

  CL_DEVICE_TYPE:                       CL_DEVICE_TYPE_GPU                                                                                                  

  CL_DEVICE_MAX_COMPUTE_UNITS:          14                                                                                                                  

  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:   3                                                                                                                   

  CL_DEVICE_MAX_WORK_ITEM_SIZES:        512 / 512 / 64                                                                                                      

  CL_DEVICE_MAX_WORK_GROUP_SIZE:        512                                                                                                                 

  CL_DEVICE_MAX_CLOCK_FREQUENCY:        1500 MHz                                                                                                            

  CL_DEVICE_ADDRESS_BITS:               32                                                                                                                  

  CL_DEVICE_MAX_MEM_ALLOC_SIZE:         128 MByte                                                                                                           

  CL_DEVICE_GLOBAL_MEM_SIZE:            511 MByte                                                                                                           

  CL_DEVICE_ERROR_CORRECTION_SUPPORT:   no                                                                                                                  

  CL_DEVICE_LOCAL_MEM_TYPE:             local                                                                                                               

  CL_DEVICE_LOCAL_MEM_SIZE:             16 KByte                                                                                                            

  CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:   64 KByte                                                                                                            

  CL_DEVICE_QUEUE_PROPERTIES:           CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE                                                                              

  CL_DEVICE_QUEUE_PROPERTIES:           CL_QUEUE_PROFILING_ENABLE                                                                                           

  CL_DEVICE_IMAGE_SUPPORT:              1                                                                                                                   

  CL_DEVICE_MAX_READ_IMAGE_ARGS:        128                                                                                                                 

  CL_DEVICE_MAX_WRITE_IMAGE_ARGS:       8                                                                                                                   

  CL_DEVICE_SINGLE_FP_CONFIG:           INF-quietNaNs round-to-nearest round-to-zero round-to-inf fma                                                       

CL_DEVICE_IMAGE <dim>                 2D_MAX_WIDTH     4096                                                                                               

                                        2D_MAX_HEIGHT    32768                                                                                              

                                        3D_MAX_WIDTH     2048                                                                                               

                                        3D_MAX_HEIGHT    2048                                                                                               

                                        3D_MAX_DEPTH     2048                                                                                               

CL_DEVICE_EXTENSIONS:                 cl_khr_byte_addressable_store                                                                                       

                                        cl_khr_icd                                                                                                          

                                        cl_khr_gl_sharing                                                                                                   

                                        cl_nv_d3d9_sharing                                                                                                  

                                        cl_nv_compiler_options                                                                                              

                                        cl_nv_device_attribute_query                                                                                        

                                        cl_nv_pragma_unroll                                                                                                 

                                        cl_khr_global_int32_base_atomics                                                                                    

                                        cl_khr_global_int32_extended_atomics                                                                                

CL_DEVICE_COMPUTE_CAPABILITY_NV:      1.1                                                                                                                 

  NUMBER OF MULTIPROCESSORS:            14                                                                                                                  

  NUMBER OF CUDA CORES:                 112                                                                                                                 

  CL_DEVICE_REGISTERS_PER_BLOCK_NV:     8192                                                                                                                

  CL_DEVICE_WARP_SIZE_NV:               32                                                                                                                  

  CL_DEVICE_GPU_OVERLAP_NV:             CL_TRUE                                                                                                             

  CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV:     CL_TRUE                                                                                                             

  CL_DEVICE_INTEGRATED_MEMORY_NV:       CL_FALSE                                                                                                            

  CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>  CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0                                                                   

oclDeviceQuery, Platform Name = NVIDIA CUDA, Platform Version = OpenCL 1.0 CUDA 3.2.1, SDK Revision = 5537818, NumDevs = 1, Device = GeForce 8800 GT        

System Info:                                                                                                                                                

Local Time/Date = 19:41:18, 11/10/2010                                                                                                                     

 CPU Arch: 0                                                                                                                                                

 CPU Level: 16                                                                                                                                              

 # of CPU processors: 4                                                                                                                                     

 Windows Build: 2600                                                                                                                                        

 Windows Ver: 5.1                                                                                                                                           

PASSED                                                                                                                                                      

Press <Enter> to Quit...                                                                                                                                    

-----------------------------------------------------------

host+device code:

matmul-dev.c – code based on samples from nvidia gpu computing sdk

(this won’t be very helpful)

/* matmul-dev.c */

#include <stdio.h>

#include <stdlib.h>

#include <stdint.h>

#include <inttypes.h>

#include <string.h>

#include <CL/cl.h>

#include "std_gen.h"

#include "clhlp"

#include "matmul-dev.h"

#include "matmul2.h"

/* OpenCL specific variables */

static struct cl_arch_t clarch;

static struct cl_ctx_t clctx;

static struct cl_cmdqueue_t clcmdq;

static struct cl_prog_t clprg;

/* OpenCL device memory for matrices */

static cl_mem d_a, d_b, d_c;

static unsigned int mem_size_a, mem_size_b, mem_size_c;

static void	init_kernel(const int which);

static void	init_mem(float *a, float *b,

			const int ha, const int wa, const int wb);

static void	deinit_kernel(void);

static void	deinit_mem(void);

static void	print_profiling(const char *fun, cl_event event,

			const int ha, const int wa, const int wb,

			const uint64_t ms, const int n_iter);

static void

print_profiling(const char *fun, cl_event event,

	const int ha, const int wa, const int wb,

	const uint64_t ms, const int n_iter)

{

	uint64_t n_ops;

	double gflops, t;

	t=(double) ms/(n_iter*1000.0);

	n_ops=(uint64_t) 2*wa*ha*wb;

	gflops=1.0e-9*n_ops/t;

	printf("[%s]: %" PRIu64 " msec, "

		"Throughput = %.4f GFlops, Time = %.5f sec, "

		"Size = %" PRIu64 "\t",

		fun, ms, gflops, t, n_ops

	);

	printf("Kernel execution time on device: %.5f sec\n",

		get_execution_time(event)/1000.0

	);

}

void

init_cl(const int typ)

{

	const char thisfun[]="init_cl";

	int res;

	cl_device_type cltyp;

	cltyp=CL_DEVICE_TYPE_ALL;

	switch(typ) {

	case 1:

		cltyp=CL_DEVICE_TYPE_GPU;

	break;

	case 2:

		cltyp=CL_DEVICE_TYPE_CPU;

	break;

	}

	if(!init_cl_arch(&clarch, 0))

		BAILOUT_CL_NO_PLAT(thisfun);

	if(!init_cl_ctx(&clctx, cltyp, clarch.platforms[0], 0))

		BAILOUT_CL_NO_CTX(thisfun);

#ifdef DO_PROFILING

	res=init_cl_cmdqueue(&clcmdq,

		CL_QUEUE_PROFILING_ENABLE, clctx.context, clctx.devices[0]

	);

#else

	res=init_cl_cmdqueue(&clcmdq, 0, clctx.context, clctx.devices[0]);

#endif /* DO_PROFILING */

	if(!res)

		BAILOUT_CL_NO_CMD_QUEUE(thisfun);

}

void

deinit_cl(void)

{

	deinit_cl_cmdqueue(&clcmdq);

	deinit_cl_ctx(&clctx);

	deinit_cl_arch(&clarch);

}

static void

init_kernel(const int which)

{

	const char thisfun[]="init_kernel";

	char header[256];

	switch(which) {

	case 2:

		strcpy(header, "../matmul2.h");

		if(

			!init_cl_prog_from_src_ex(&clprg,

				"../matmul2.cl", "-cl-fast-relaxed-math",

				NULL, (char (*)[256]) header, 1, NULL, 0,

				clctx.context, 0

			)

		)

			BAILOUT_CL_PROG_ERR(thisfun);

	break;

	case 3:

		if(

			!init_cl_prog_from_src(&clprg, "../matmul3-3.cl",

				clctx.context, 0

			)

		)

			BAILOUT_CL_PROG_ERR(thisfun);

	break;

	default: /* NOTE: only for testing purposes */

		if(

			!init_cl_prog_from_src(&clprg, "../matmul1.cl",

				clctx.context, 0

			)

		)

			BAILOUT_CL_PROG_ERR(thisfun);

	}

}

static void

deinit_kernel(void)

{

	deinit_cl_prog(&clprg);

}

static void

init_mem(float *a, float *b, const int ha, const int wa, const int wb)

{

	const char thisfun[]="init_mem";

	cl_int errcode;

	mem_size_a=ha*wa*sizeof(float);

	mem_size_b=wa*wb*sizeof(float);

	mem_size_c=ha*wb*sizeof(float);

	d_c=clCreateBuffer(

		clctx.context, CL_MEM_WRITE_ONLY,

		mem_size_c, NULL, &errcode

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clCreateBuffer()");

	d_a=clCreateBuffer(

		clctx.context, CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR,

		mem_size_a, a, &errcode

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clCreateBuffer()");

	d_b=clCreateBuffer(

		clctx.context, CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR,

		mem_size_b, b, &errcode

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clCreateBuffer()");

}

static void

deinit_mem(void)

{

	const char thisfun[]="deinit_mem";

	cl_int errcode;

	errcode=clReleaseMemObject(d_a);

	errcode|=clReleaseMemObject(d_b);

	errcode|=clReleaseMemObject(d_c);

	ASSERT_CL_SUCC(errcode, thisfun, "clReleaseMemObject()");

}

void

mul_matrices_f_dev_le16(float *res, float *a, float *b,

	const int ha, const int wa, const int wb)

{

	const char thisfun[]="mul_matrices_f_dev_le16";

	int w_a, w_b;

	size_t local_worksize[2], global_worksize[2];

	cl_int errcode;

	cl_event kernel_execution;

#ifdef DO_PROFILING

	int n_iter, j;

	uint64_t ms;

#endif /* DO_PROFILING */

	init_kernel(1);

	init_mem(a, b, ha, wa, wb);

#ifdef DO_PROFILING

	ms=0;

	n_iter=10;

	for(j=-1; j<n_iter; j++) {

		/*

		 * sync queue to host and

		 * start timer first time through loop

		 */

		if(!j) {

			clFinish(clcmdq.queue);

			ms=get_sys_time();

		}

#endif /* DO_PROFILING */

	local_worksize[0]=wb;

	local_worksize[1]=ha;

	global_worksize[0]=wb;

	global_worksize[1]=ha;

	w_a=wa;

	w_b=wb;

	errcode=clSetKernelArg(clprg.kernel0,

		0, sizeof(cl_mem), (void *) &d_c

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		1, sizeof(cl_mem), (void *) &d_a

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		2, sizeof(cl_mem), (void *) &d_b

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		3, sizeof(int), (void *) &w_a

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		4, sizeof(int), (void *) &w_b

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clSetKernelArg()");

	errcode=clEnqueueNDRangeKernel(clcmdq.queue, clprg.kernel0,

		2, NULL, global_worksize, local_worksize, 0, NULL,

		&kernel_execution

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clEnqueueNDRangeKernel()");

#ifdef DO_PROFILING

	}

#endif /* DO_PROFILING */

	cl_memcpy_to_host(res, d_c, mem_size_c, clcmdq.queue);

#ifdef DO_PROFILING

	clFinish(clcmdq.queue);

	ms=get_sys_time()-ms;

	print_profiling(thisfun, kernel_execution, ha, wa, wb, ms, n_iter);

#endif /* DO_PROFILING */

	deinit_mem();

	deinit_kernel();

}

void

mul_matrices_f_dev2(float *res, float *a, float *b,

	const int ha, const int wa, const int wb)

{

	const char thisfun[]="mul_matrices_f_dev2";

	int w_a, w_b;

	size_t local_worksize[2], global_worksize[2];

	cl_int errcode;

	cl_event kernel_execution;

#ifdef DO_PROFILING

	int n_iter, j;

	uint64_t ms;

#endif /* DO_PROFILING */

	init_kernel(3);

	init_mem(a, b, ha, wa, wb);

#ifdef DO_PROFILING

	ms=0;

	n_iter=10;

	for(j=-1; j<n_iter; j++) {

		/*

		 * sync queue to host and

		 * start timer first time through loop

		 */

		if(!j) {

			clFinish(clcmdq.queue);

			ms=get_sys_time();

		}

#endif /* DO_PROFILING */

	local_worksize[0]=BLOCK_SIZE;

	local_worksize[1]=BLOCK_SIZE;

	global_worksize[0]=round_up(BLOCK_SIZE, wb);

	global_worksize[1]=round_up(BLOCK_SIZE, ha);

	w_a=wa;

	w_b=wb;

	errcode=clSetKernelArg(clprg.kernel0,

		0, sizeof(cl_mem), (void *) &d_c

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		1, sizeof(cl_mem), (void *) &d_a

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		2, sizeof(cl_mem), (void *) &d_b

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		3, BLOCK_SIZE*BLOCK_SIZE*sizeof(float), NULL

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		4, BLOCK_SIZE*BLOCK_SIZE*sizeof(float), NULL

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		5, sizeof(int), (void *) &w_a

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clSetKernelArg()");

	errcode=clEnqueueNDRangeKernel(clcmdq.queue, clprg.kernel0,

		2, NULL, global_worksize, local_worksize, 0, NULL,

		&kernel_execution

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clEnqueueNDRangeKernel()");

#ifdef DO_PROFILING

	}

#endif /* DO_PROFILING */

	cl_memcpy_to_host(res, d_c, mem_size_c, clcmdq.queue);

#ifdef DO_PROFILING

	clFinish(clcmdq.queue);

	ms=get_sys_time()-ms;

	print_profiling(thisfun, kernel_execution, ha, wa, wb, ms, n_iter);

#endif /* DO_PROFILING */

	deinit_mem();

	deinit_kernel();

}

void

mul_matrices_f_dev(float *res, float *a, float *b,

	const int ha, const int wa, const int wb)

{

	const char thisfun[]="mul_matrices_f_dev";

	int w_a, w_b;

	size_t local_worksize[2], global_worksize[2];

	cl_int errcode;

	cl_event kernel_execution;

#ifdef DO_PROFILING

	int n_iter, j;

	uint64_t ms;

#endif /* DO_PROFILING */

	init_kernel(2);

	init_mem(a, b, ha, wa, wb);

#ifdef DO_PROFILING

	ms=0;

	n_iter=10;

	for(j=-1; j<n_iter; j++) {

		/*

		 * sync queue to host and

		 * start timer first time through loop

		 */

		if(!j) {

			clFinish(clcmdq.queue);

			ms=get_sys_time();

		}

#endif /* DO_PROFILING */

	local_worksize[0]=BLOCK_SIZE;

	local_worksize[1]=BLOCK_SIZE;

	global_worksize[0]=round_up(BLOCK_SIZE, wb);

	global_worksize[1]=round_up(BLOCK_SIZE, ha);

	w_a=wa;

	w_b=wb;

	errcode=clSetKernelArg(clprg.kernel0,

		0, sizeof(cl_mem), (void *) &d_c

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		1, sizeof(cl_mem), (void *) &d_a

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		2, sizeof(cl_mem), (void *) &d_b

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		3, BLOCK_SIZE*BLOCK_SIZE*sizeof(float), NULL

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		4, BLOCK_SIZE*BLOCK_SIZE*sizeof(float), NULL

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		5, sizeof(int), (void *) &w_a

	);

	errcode|=clSetKernelArg(clprg.kernel0,

		6, sizeof(int), (void *) &w_b

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clSetKernelArg()");

	errcode=clEnqueueNDRangeKernel(clcmdq.queue, clprg.kernel0,

		2, NULL, global_worksize, local_worksize, 0, NULL,

		&kernel_execution

	);

	ASSERT_CL_SUCC(errcode, thisfun, "clEnqueueNDRangeKernel()");

#ifdef DO_PROFILING

	}

#endif /* DO_PROFILING */

	cl_memcpy_to_host(res, d_c, mem_size_c, clcmdq.queue);

#ifdef DO_PROFILING

	clFinish(clcmdq.queue);

	ms=get_sys_time()-ms;

	print_profiling(thisfun, kernel_execution, ha, wa, wb, ms, n_iter);

#endif /* DO_PROFILING */

	deinit_mem();

	deinit_kernel();

}

matmul2.h

/* thread block size */

#define BLOCK_SIZE	16

matmul2.cl – code from nvidia gpu computing sdk

// matmul2.cl

// Multiply two matrices A * B = C

// Device code.

//include "matmul2.h"

#define AS(i, j)	As[(i)*BLOCK_SIZE+(j)]

#define BS(i, j)	Bs[(i)*BLOCK_SIZE+(j)]

__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1)))

void

matrixMul(__global float *C, __global float *A, __global float *B,

	__local float *As, __local float *Bs,

	int wA, int wB)

{

	// 2D thread ID

	int gx=get_global_id(0);

	int gy=get_global_id(1);

	// block index (in NDRange)

	int bx=get_group_id(0);

	int by=get_group_id(1);

	// thread index (in workgroup)

	int tx=get_local_id(0);

	int ty=get_local_id(1);

	// index of the first sub-matrix of A processed by the block

	int aBegin=wA*BLOCK_SIZE*by;

	// index of the last sub-matrix of A processed by the block

	int aEnd=aBegin+wA-1;

	// step size used to iterate through the sub-matrices of A

	int aStep=BLOCK_SIZE;

	// index of the first sub-matrix of B processed by the block

	int bBegin=BLOCK_SIZE*bx;

	// step size used to iterate through the sub-matrices of B

	int bStep=BLOCK_SIZE*wB;

	float4 Csub=(float4) (0.0f, 0.0f, 0.0f, 0.0f);

	// loop over all the sub-matrices of A and B

	// required to compute the block sub-matrix

	for(

		int a=aBegin, b=bBegin;

		a<=aEnd;

		a+=aStep, b+=bStep

	) {

		// load the matrices from global memory

		// to local memory; each thread loads

		// one element of each matrix

		AS(ty, tx)=A[a+wA*ty+tx];

		BS(ty, tx)=B[b+wB*ty+tx];

		// synchronize to make sure the matrices are loaded

		barrier(CLK_LOCAL_MEM_FENCE);

		// multiply the two matrices together;

		// each thread computes one element of the block sub-matrix

		#pragma unroll

		for(int k=0; k<BLOCK_SIZE; k+=4) {

			float4 temp1=(float4) (

				AS(ty, k), AS(ty, k+1),

				AS(ty, k+2), AS(ty, k+3)

			);

			float4 temp2=(float4) (

				BS(k, tx), BS(k+1, tx),

				BS(k+2, tx), BS(k+3, tx)

			);

			Csub+=temp1*temp2;

		}

		// synchronize to make sure that the preceding

		// computation is done before loading two new

		// sub-matrices of A and B in the next iteration

		barrier(CLK_LOCAL_MEM_FENCE);

	}

	// write the block sub-matrix to device memory;

	// each thread writes one element

	//C[wB*BLOCK_SIZE*by+BLOCK_SIZE*bx+wB*ty+tx]=

	//	Csub.x+Csub.y+Csub.z+Csub.w;

	C[gy*wB+gx]=Csub.x+Csub.y+Csub.z+Csub.w;

}

matmul3-3.cl – code from ati stream sdk

// matmul3.cl

// Multiply two matrices A * B = C

// Device code.

#define TILEX		4

#define TILEX_SHIFT	2

#define TILEY		4

#define TILEY_SHIFT	2

// optimized version : Tile : 4x4

// both matrix A and matrix B are cached into local memory blocks

__kernel void matrixMul3(__global float4 *matrixC,

	__global float4 *matrixA, __global float4* matrixB,

	__local float4 *blockA, __local float4 *blockB,

	int widthA)

{

	float4 sum0=(float4) (0);

	float4 sum1=(float4) (0);

	float4 sum2=(float4) (0);

	float4 sum3=(float4) (0);

	int temp=widthA/4;

	// calculate blockwise-MMM for each pair of blocks along

	// the common width of matrixA and matrixB

	for(int i=0; i<(temp/get_local_size(0)); i++) {

		// local data for blockA from matrixA

		// right now considering only square matrices so width of

		// C = width of A

		blockA[get_local_id(0)+get_local_size(0)*

			(get_local_id(1)<<TILEY_SHIFT)]=

			matrixA[i*get_local_size(0)+get_local_id(0)+

			(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)];

		blockA[get_local_id(0)+get_local_size(0)*

			(get_local_id(1)<<TILEY_SHIFT)+get_local_size(0)]=

			matrixA[i*get_local_size(0)+get_local_id(0)+

			(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+

			get_global_size(0)];

		blockA[get_local_id(0)+get_local_size(0)*

			(get_local_id(1)<<TILEY_SHIFT)+2*get_local_size(0)]=

			matrixA[i*get_local_size(0)+get_local_id(0)+

			(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+

			2*get_global_size(0)];

		blockA[get_local_id(0)+get_local_size(0)*

			(get_local_id(1)<<TILEY_SHIFT)+3*get_local_size(0)]=

			matrixA[i*get_local_size(0)+get_local_id(0)+

			(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+

			3*get_global_size(0)];

		// local data for blockA from matrixB

		blockB[get_local_id(0)+get_local_size(0)*

			(get_local_id(1)<<TILEY_SHIFT)]=

			matrixB[get_global_id(0)+

			((i*get_local_size(1)+get_local_id(1))<<TILEY_SHIFT)*

			get_global_size(0)];

		blockB[get_local_id(0)+get_local_size(0)*

			(get_local_id(1)<<TILEY_SHIFT)+get_local_size(0)]=

			matrixB[get_global_id(0)+

			((i*get_local_size(1)+get_local_id(1))<<TILEY_SHIFT)*

			get_global_size(0)+get_global_size(0)];

		blockB[get_local_id(0)+get_local_size(0)*

			(get_local_id(1)<<TILEY_SHIFT)+2*get_local_size(0)]=

			matrixB[get_global_id(0)+

			((i*get_local_size(1)+get_local_id(1))<<TILEY_SHIFT)*

			get_global_size(0)+2*get_global_size(0)];

		blockB[get_local_id(0)+get_local_size(0)*

			(get_local_id(1)<<TILEY_SHIFT)+3*get_local_size(0)]=

			matrixB[get_global_id(0)+

			((i*get_local_size(1)+get_local_id(1))<<TILEY_SHIFT)*

			get_global_size(0)+3*get_global_size(0)];

		barrier(CLK_LOCAL_MEM_FENCE);

		// each local thread will read a float4

		for(int j=0; j<get_local_size(0)<<2; j+=4) {

			// block dimensions of A = block dimensions of C

			float4 tempA0=

				blockA[(j>>2)+

				get_local_id(1)*TILEY*get_local_size(0)];

			float4 tempA1=

				blockA[(j>>2)+

				(get_local_id(1)*TILEY+1)*get_local_size(0)];

			float4 tempA2=

				blockA[(j>>2)+

				(get_local_id(1)*TILEY+2)*get_local_size(0)];

			float4 tempA3=

				blockA[(j>>2)+

				(get_local_id(1)*TILEY+3)*get_local_size(0)];

			// block dimensions of B = block dimensions of C

			// should be localId.x * (TILEX / 4)

			float4 tempB0=

				blockB[get_local_id(0)+

				j*get_local_size(0)];

			float4 tempB1=

				blockB[get_local_id(0)+

				(j+1)*get_local_size(0)];

			float4 tempB2=

				blockB[get_local_id(0)+

				(j+2)*get_local_size(0)];

			float4 tempB3=

				blockB[get_local_id(0)+

				(j+3)*get_local_size(0)];

			sum0.x+=tempA0.x*tempB0.x+tempA0.y*tempB1.x+

				tempA0.z*tempB2.x+tempA0.w*tempB3.x;

			sum0.y+=tempA0.x*tempB0.y+tempA0.y*tempB1.y+

				tempA0.z*tempB2.y+tempA0.w*tempB3.y;

			sum0.z+=tempA0.x*tempB0.z+tempA0.y*tempB1.z+

				tempA0.z*tempB2.z+tempA0.w*tempB3.z;

			sum0.w+=tempA0.x*tempB0.w+tempA0.y*tempB1.w+

				tempA0.z*tempB2.w+tempA0.w*tempB3.w;

			sum1.x+=tempA1.x*tempB0.x+tempA1.y*tempB1.x+

				tempA1.z*tempB2.x+tempA1.w*tempB3.x;

			sum1.y+=tempA1.x*tempB0.y+tempA1.y*tempB1.y+

				tempA1.z*tempB2.y+tempA1.w*tempB3.y;

			sum1.z+=tempA1.x*tempB0.z+tempA1.y*tempB1.z+

				tempA1.z*tempB2.z+tempA1.w*tempB3.z;

			sum1.w+=tempA1.x*tempB0.w+tempA1.y*tempB1.w+

				tempA1.z*tempB2.w+tempA1.w*tempB3.w;

			sum2.x+=tempA2.x*tempB0.x+tempA2.y*tempB1.x+

				tempA2.z*tempB2.x+tempA2.w*tempB3.x;

			sum2.y+=tempA2.x*tempB0.y+tempA2.y*tempB1.y+

				tempA2.z*tempB2.y+tempA2.w*tempB3.y;

			sum2.z+=tempA2.x*tempB0.z+tempA2.y*tempB1.z+

				tempA2.z*tempB2.z+tempA2.w*tempB3.z;

			sum2.w+=tempA2.x*tempB0.w+tempA2.y*tempB1.w+

				tempA2.z*tempB2.w+tempA2.w*tempB3.w;

			sum3.x+=tempA3.x*tempB0.x+tempA3.y*tempB1.x+

				tempA3.z*tempB2.x+tempA3.w*tempB3.x;

			sum3.y+=tempA3.x*tempB0.y+tempA3.y*tempB1.y+

				tempA3.z*tempB2.y+tempA3.w*tempB3.y;

			sum3.z+=tempA3.x*tempB0.z+tempA3.y*tempB1.z+

				tempA3.z*tempB2.z+tempA3.w*tempB3.z;

			sum3.w+=tempA3.x*tempB0.w+tempA3.y*tempB1.w+

				tempA3.z*tempB2.w+tempA3.w*tempB3.w;

		}

	}

	matrixC[get_global_id(0)+

		(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)]=

		sum0;

	matrixC[get_global_id(0)+

		(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+

		get_global_size(0)]=

		sum1;

	matrixC[get_global_id(0)+

		(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+

		2*get_global_size(0)]=

		sum2;

	matrixC[get_global_id(0)+

		(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+

		3*get_global_size(0)]=

		sum3;

}

karbous · November 11, 2010, 5:57pm

thanks for providing the complete code. Give me a bit more time, I will look through it more carefully and hopefully we will solve the problem :-)

karbous · November 11, 2010, 5:57pm

thanks for providing the complete code. Give me a bit more time, I will look through it more carefully and hopefully we will solve the problem :-)

kawaman · November 12, 2010, 8:40am

my friend, once again thanx.

i ran the program on windows (on the amd cpu device). the second kernel crashes (segfaults with the messagebox ‘The instruction … referenced memory at … The memory could not be “read”’. the first kernel runs very slow, but generally, as i saw, amd’s opencl cpu implementation is extremely slow (i use mingw on windows).

i also tried the opencl visual profiler (with the 8800gt device), but i get the “could not read profiling output” error (i added a clReleaseEvent() call to the host code, since i read in this forum that the visual profiler is very sensitive to leaks). But, i could not get it to work and it crashes frequently, so i am not gonna waste my time any more with this.

next step will be to compile the oclMatrixMul sdk sample on linux with the second kernel and see what happens.

again, thanx for your help.

kawaman · November 12, 2010, 8:40am

my friend, once again thanx.

i ran the program on windows (on the amd cpu device). the second kernel crashes (segfaults with the messagebox ‘The instruction … referenced memory at … The memory could not be “read”’. the first kernel runs very slow, but generally, as i saw, amd’s opencl cpu implementation is extremely slow (i use mingw on windows).

i also tried the opencl visual profiler (with the 8800gt device), but i get the “could not read profiling output” error (i added a clReleaseEvent() call to the host code, since i read in this forum that the visual profiler is very sensitive to leaks). But, i could not get it to work and it crashes frequently, so i am not gonna waste my time any more with this.

next step will be to compile the oclMatrixMul sdk sample on linux with the second kernel and see what happens.

again, thanx for your help.

karbous · November 12, 2010, 12:30pm

Isn’t the bug related to allocating only small amount of local memory? The NVidia Matrix mull example needs only BLOCK_SIZE *BLOCK_SIZE however, AMD MatrixMull example needs (blockSize * 4) * (blockSize * 4)… the error, memory could not be “read”', points to the same problem, I think. As well as only larger data show the bug…

Btw. If you already installed CUDA on linux, you don’t have to install OpenCL, it is a a part of CUDA toolkit;-)

karbous · November 12, 2010, 12:30pm

Isn’t the bug related to allocating only small amount of local memory? The NVidia Matrix mull example needs only BLOCK_SIZE *BLOCK_SIZE however, AMD MatrixMull example needs (blockSize * 4) * (blockSize * 4)… the error, memory could not be “read”', points to the same problem, I think. As well as only larger data show the bug…

Btw. If you already installed CUDA on linux, you don’t have to install OpenCL, it is a a part of CUDA toolkit;-)

kawaman · November 12, 2010, 4:59pm

KARBOUS SAVES THE DAY!!! this was the bug! now everything works ok. thanks my friend, you rule!

can you tell me what files the installer installs and where, so i can be assured that it doesn’t mess up with already installed files (from the sgfxi script)? i would be very interested in installing the toolkit on linux (for me it’s a top priority, since opencl is included in my diploma thesis).

also, does the toolkit include the nvidia icd driver? will the application be able to “see” the amd cpu as an opencl device if i use the icd driver or do i have to install amd’s icd driver as well (which i don’t plan to do, since i’m afraid it will break things on the linux platform)? sorry for the above questions, but in the forum i couldn’t find anything that is close to my situation (drivers not from the package manager).

again, a very big THANX!

kawaman · November 12, 2010, 4:59pm

KARBOUS SAVES THE DAY!!! this was the bug! now everything works ok. thanks my friend, you rule!

can you tell me what files the installer installs and where, so i can be assured that it doesn’t mess up with already installed files (from the sgfxi script)? i would be very interested in installing the toolkit on linux (for me it’s a top priority, since opencl is included in my diploma thesis).

also, does the toolkit include the nvidia icd driver? will the application be able to “see” the amd cpu as an opencl device if i use the icd driver or do i have to install amd’s icd driver as well (which i don’t plan to do, since i’m afraid it will break things on the linux platform)? sorry for the above questions, but in the forum i couldn’t find anything that is close to my situation (drivers not from the package manager).

again, a very big THANX!

karbous · November 12, 2010, 5:32pm

External Image bugs made by others are always easier to find than own bugs… (I’m working on diploma as well and still struggling, btw. what is your diploma thesis about?)

The toolkit is only SDK with Profiler, samples, libs and docs. It has nothing to do with driver (so it should work with the sgfxi). As the release notes of NVidia SDK says it is a self extracting archive, by default it is unpacked to your home and to /usr/local.

ICD is just a file (I think) which is stored in a special place - /etc/OpenCL/vendors/ and must be “installed” to enable application choice which OpenCL implementation to select. So it won’t harm your system and if you will feel that it has influenced the system, I doubt it, you can always erase that file.

karbous · November 12, 2010, 5:32pm

External Image bugs made by others are always easier to find than own bugs… (I’m working on diploma as well and still struggling, btw. what is your diploma thesis about?)

The toolkit is only SDK with Profiler, samples, libs and docs. It has nothing to do with driver (so it should work with the sgfxi). As the release notes of NVidia SDK says it is a self extracting archive, by default it is unpacked to your home and to /usr/local.

ICD is just a file (I think) which is stored in a special place - /etc/OpenCL/vendors/ and must be “installed” to enable application choice which OpenCL implementation to select. So it won’t harm your system and if you will feel that it has influenced the system, I doubt it, you can always erase that file.

kawaman · November 13, 2010, 10:28am

just finished with the code. i made some changes and now it runs like a charm. basically, i needed to code a trivial application, so i could build a basic “abstraction” opencl wrapper library around it. i really like opencl’s interface (there are some similarities with opengl that i used some time ago) and being a long time nvidia fanboy myself, i convinced some guys at the lab to give it a shot before choosing the topic of my diploma thesis. some of them even bought nvidia gear for themselves, when i showed them the gpu’s tremendous power on simd problems.

well, i just started the struggle!! it involves an image processing pipeline (tasks) over the network. the image processing is done locally (one task) while other tasks are distributed over the network. for the image processing part i plan to use opencl (and maybe opengl). this is the data-parallel part. for other parts, i use pthreads for task-parallelization + socket programming for networking. other students that will follow will convert it to something that resembles a stack of web services (i just need something like a prototype). it’s nothing fancy, but i must get acquainted with opencl in the first place. so, i began with the trivial example of matrix multiplication. i’m an undergraduate student (the diploma thesis is required for graduation).

cool. i shall install it and see what happens!! (for the presentation i am sure some graphs from the profiler will be very useful).

yes, but i still need to install amd’s opencl driver as well (so my amd cpu will be listed as an opencl device along with the nvidia gpu). am i right? the question is where will amd’s libOpenCL.so be installed? will it overwrite nvidia’s one?

initially, i thought that the icd is a “general” libOpenCL.so installed in /usr/lib that dispatches opencl calls to the “real” libOpenCL.so, depending on the chosen context’s device(s) that are accompanied by the appropriate driver (the “real” .so).

kawaman · November 13, 2010, 10:28am

just finished with the code. i made some changes and now it runs like a charm. basically, i needed to code a trivial application, so i could build a basic “abstraction” opencl wrapper library around it. i really like opencl’s interface (there are some similarities with opengl that i used some time ago) and being a long time nvidia fanboy myself, i convinced some guys at the lab to give it a shot before choosing the topic of my diploma thesis. some of them even bought nvidia gear for themselves, when i showed them the gpu’s tremendous power on simd problems.

well, i just started the struggle!! it involves an image processing pipeline (tasks) over the network. the image processing is done locally (one task) while other tasks are distributed over the network. for the image processing part i plan to use opencl (and maybe opengl). this is the data-parallel part. for other parts, i use pthreads for task-parallelization + socket programming for networking. other students that will follow will convert it to something that resembles a stack of web services (i just need something like a prototype). it’s nothing fancy, but i must get acquainted with opencl in the first place. so, i began with the trivial example of matrix multiplication. i’m an undergraduate student (the diploma thesis is required for graduation).

cool. i shall install it and see what happens!! (for the presentation i am sure some graphs from the profiler will be very useful).

yes, but i still need to install amd’s opencl driver as well (so my amd cpu will be listed as an opencl device along with the nvidia gpu). am i right? the question is where will amd’s libOpenCL.so be installed? will it overwrite nvidia’s one?

initially, i thought that the icd is a “general” libOpenCL.so installed in /usr/lib that dispatches opencl calls to the “real” libOpenCL.so, depending on the chosen context’s device(s) that are accompanied by the appropriate driver (the “real” .so).

karbous · November 13, 2010, 1:17pm

The diploma’s topic sounds cool, anyway External Image . Good luck.

Well, I suppose your amd CPU is working fine, so there is no need to install driver (just in case you want the latest one). However, installing ICD’s are mandatory. The ICD’s are text files with names of dynamic OpenCL libs. I’v got in /etc/OpenCL/vendors files:

atiocl32.icd (contains libatiocl32.so)

atiocl64.icd (contains libatiocl64.so)

nvidia.icd (contains libcuda.so)

karbous · November 13, 2010, 1:17pm