i have 2 kernels that multiply 2 square matrices. no extensions used.
also, i have a dummy host cpu implementation (so i can compare the results).
i use profiling (by initializing the cl command queue to do so) and i collect the results after a clFinish() on this queue.
i also use gettimeofday() for host timing and i run the binary with time shell function.
when the matrices aren’t large, results are “normal”, except that the second kernel returns 0 as end time from clGetEventProfilingInfo(). i check clGetEventInfo() and it returns 0 (CL_COMPLETE) when queried with CL_EVENT_COMMAND_ EXECUTION_STATUS. the first kernel returns expected results from clGetEventProfilingInfo().
#dummy host implementation
$time ./matmul1 -p -t 0 -ha 256 -wa 256 -wb 256
82msec (0.082 sec) - Multiplication of A[256, 256] x B[256, 256] = C[256, 256] on dummy
real 0m0.086s
user 0m0.092s
sys 0m0.000s
#dev-gpu1 is for the first kernel
$time ./matmul1 -p -t 110 -ha 256 -wa 256 -wb 256
[mul_matrices_f_dev]: 6 msec, Throughput = 55.9241 GFlops, Time = 0.00060 sec, Size = 33554432 Kernel execution time on device: 0.00049 sec
452msec (0.452 sec) - Multiplication of A[256, 256] x B[256, 256] = C[256, 256] on dev-gpu1
real 0m0.580s
user 0m0.288s
sys 0m0.288s
#dev-gpu2 is for the second kernel
$time ./matmul1 -p -t 115 -ha 256 -wa 256 -wb 256
[mul_matrices_f_dev2]: 1 msec, Throughput = 335.5443 GFlops, Time = 0.00010 sec, Size = 33554432 Kernel execution time on device: 17157461376.46042 sec
533msec (0.533 sec) - Multiplication of A[256, 256] x B[256, 256] = C[256, 256] on dev-gpu2
real 0m0.661s
user 0m0.372s
sys 0m0.288s
timings:
#first line
[mul_matrices_f_dev2]: 1 msec, ← gettimeofday() elapsed time only for kernel execution
Throughput = 335.5443 GFlops, Time = 0.00010 sec, ← the same (now in sec instead of msec)
Size = 33554432 Kernel execution time on device: 17157461376.46042 sec ← clGetEventProfilingInfo() end-start converted to sec
Multiplication of A[256, 256] x B[256, 256] = C[256, 256] on dev-gpu2 ← matrix dimensions and target (here we run the second kernel)
and after that you see the time shell command output.
to assert correct results i rerun the tests and dump the input/output matrices to a text file. then i use diff and i see no difference.
also, i run the tests with different order (e.g. first the second kernel, then the first) and results are the same.
when matrices get “bigger”, the second kernel is speeding like a demon!!!
#here the host implementation is optimized, because we don't want it to run forever!
$time ./matmul1 -p -t 16 -ha 2048 -wa 2048 -wb 2048
[mul_matrices_f_4_host_sse_aligned_pthr]: 91332 msec, Throughput = 1.8810 GFlops, Time = 9.13320 sec, Size = 17179869184
91332msec (91.332 sec) - Multiplication of A[2048, 2048] x B[2048, 2048] = C[2048, 2048] on host-pthr+sse2
real 1m31.494s
user 5m40.401s
sys 0m0.280s
#first kernel -- consistent results with previous test
$time ./matmul1 -p -t 110 -ha 2048 -wa 2048 -wb 2048
[mul_matrices_f_dev]: 2208 msec, Throughput = 77.8074 GFlops, Time = 0.22080 sec, Size = 17179869184 Kernel execution time on device: 0.21784 sec
2954msec (2.954 sec) - Multiplication of A[2048, 2048] x B[2048, 2048] = C[2048, 2048] on dev-gpu1
real 0m3.239s
user 0m0.496s
sys 0m0.352s
#second kernel -- something must be wrong!
$time ./matmul1 -p -t 115 -ha 2048 -wa 2048 -wb 2048
[mul_matrices_f_dev2]: 31 msec, Throughput = 5541.8933 GFlops, Time = 0.00310 sec, Size = 17179869184 Kernel execution time on device: 17157460771.76913 sec
644msec (0.644 sec) - Multiplication of A[2048, 2048] x B[2048, 2048] = C[2048, 2048] on dev-gpu2
real 0m0.927s
user 0m0.580s
sys 0m0.348s
wow! i have always been an nvidia afficionado, but these results are beyond my imagination! 5 petaflops throughput???
as you can see, in both tests, the second kernel cannot report end time. this happens all the time.
the results you see are with “-cl-fast-relaxed-math” option. but without this option, still the second kernel cannot report end time.
the throughput calculation is made with the gettimeofday() data. clGetEventProfilingInfo() results are only displayed after the “Kernel execution time on device”.
again, when i dump the results, they are compared with the diff command and they are correct.
also, the second kernel consistently runs faster (e.g. with dimensions 1024x1024, 512x512, etc).
so, i would appreciate any help for the following:
any ideas why a kernel cannot report end time?
can a geforce 8800gt have this kind of performance?
is there something else i should know about profiling? (for example, any tool for linux that would make things clearer?)
i tried to google for “clgeteventprofilinginfo zero end” but nothing came up. also, i searched this forum and couldn’t find something.
i have an amd phenom quadcore (cpu) and i run debian squeeze.
i have 2 kernels that multiply 2 square matrices. no extensions used.
also, i have a dummy host cpu implementation (so i can compare the results).
i use profiling (by initializing the cl command queue to do so) and i collect the results after a clFinish() on this queue.
i also use gettimeofday() for host timing and i run the binary with time shell function.
when the matrices aren’t large, results are “normal”, except that the second kernel returns 0 as end time from clGetEventProfilingInfo(). i check clGetEventInfo() and it returns 0 (CL_COMPLETE) when queried with CL_EVENT_COMMAND_ EXECUTION_STATUS. the first kernel returns expected results from clGetEventProfilingInfo().
#dummy host implementation
$time ./matmul1 -p -t 0 -ha 256 -wa 256 -wb 256
82msec (0.082 sec) - Multiplication of A[256, 256] x B[256, 256] = C[256, 256] on dummy
real 0m0.086s
user 0m0.092s
sys 0m0.000s
#dev-gpu1 is for the first kernel
$time ./matmul1 -p -t 110 -ha 256 -wa 256 -wb 256
[mul_matrices_f_dev]: 6 msec, Throughput = 55.9241 GFlops, Time = 0.00060 sec, Size = 33554432 Kernel execution time on device: 0.00049 sec
452msec (0.452 sec) - Multiplication of A[256, 256] x B[256, 256] = C[256, 256] on dev-gpu1
real 0m0.580s
user 0m0.288s
sys 0m0.288s
#dev-gpu2 is for the second kernel
$time ./matmul1 -p -t 115 -ha 256 -wa 256 -wb 256
[mul_matrices_f_dev2]: 1 msec, Throughput = 335.5443 GFlops, Time = 0.00010 sec, Size = 33554432 Kernel execution time on device: 17157461376.46042 sec
533msec (0.533 sec) - Multiplication of A[256, 256] x B[256, 256] = C[256, 256] on dev-gpu2
real 0m0.661s
user 0m0.372s
sys 0m0.288s
timings:
#first line
[mul_matrices_f_dev2]: 1 msec, ← gettimeofday() elapsed time only for kernel execution
Throughput = 335.5443 GFlops, Time = 0.00010 sec, ← the same (now in sec instead of msec)
Size = 33554432 Kernel execution time on device: 17157461376.46042 sec ← clGetEventProfilingInfo() end-start converted to sec
Multiplication of A[256, 256] x B[256, 256] = C[256, 256] on dev-gpu2 ← matrix dimensions and target (here we run the second kernel)
and after that you see the time shell command output.
to assert correct results i rerun the tests and dump the input/output matrices to a text file. then i use diff and i see no difference.
also, i run the tests with different order (e.g. first the second kernel, then the first) and results are the same.
when matrices get “bigger”, the second kernel is speeding like a demon!!!
#here the host implementation is optimized, because we don't want it to run forever!
$time ./matmul1 -p -t 16 -ha 2048 -wa 2048 -wb 2048
[mul_matrices_f_4_host_sse_aligned_pthr]: 91332 msec, Throughput = 1.8810 GFlops, Time = 9.13320 sec, Size = 17179869184
91332msec (91.332 sec) - Multiplication of A[2048, 2048] x B[2048, 2048] = C[2048, 2048] on host-pthr+sse2
real 1m31.494s
user 5m40.401s
sys 0m0.280s
#first kernel -- consistent results with previous test
$time ./matmul1 -p -t 110 -ha 2048 -wa 2048 -wb 2048
[mul_matrices_f_dev]: 2208 msec, Throughput = 77.8074 GFlops, Time = 0.22080 sec, Size = 17179869184 Kernel execution time on device: 0.21784 sec
2954msec (2.954 sec) - Multiplication of A[2048, 2048] x B[2048, 2048] = C[2048, 2048] on dev-gpu1
real 0m3.239s
user 0m0.496s
sys 0m0.352s
#second kernel -- something must be wrong!
$time ./matmul1 -p -t 115 -ha 2048 -wa 2048 -wb 2048
[mul_matrices_f_dev2]: 31 msec, Throughput = 5541.8933 GFlops, Time = 0.00310 sec, Size = 17179869184 Kernel execution time on device: 17157460771.76913 sec
644msec (0.644 sec) - Multiplication of A[2048, 2048] x B[2048, 2048] = C[2048, 2048] on dev-gpu2
real 0m0.927s
user 0m0.580s
sys 0m0.348s
wow! i have always been an nvidia afficionado, but these results are beyond my imagination! 5 petaflops throughput???
as you can see, in both tests, the second kernel cannot report end time. this happens all the time.
the results you see are with “-cl-fast-relaxed-math” option. but without this option, still the second kernel cannot report end time.
the throughput calculation is made with the gettimeofday() data. clGetEventProfilingInfo() results are only displayed after the “Kernel execution time on device”.
again, when i dump the results, they are compared with the diff command and they are correct.
also, the second kernel consistently runs faster (e.g. with dimensions 1024x1024, 512x512, etc).
so, i would appreciate any help for the following:
any ideas why a kernel cannot report end time?
can a geforce 8800gt have this kind of performance?
is there something else i should know about profiling? (for example, any tool for linux that would make things clearer?)
i tried to google for “clgeteventprofilinginfo zero end” but nothing came up. also, i searched this forum and couldn’t find something.
i have an amd phenom quadcore (cpu) and i run debian squeeze.
Are you checking the error code of clGetEventProfilingInfo function? Maybe you are giving to the function wrong event… Or try to make en-queuing kernel with blocking call. …Try to post your host code.
Sorry, I’m not an expert. Don’t know how much performance can be gained from such graphic card. Depends on memory pattern access, register usage… Posting output of SDK sample oclDeviceQuery might help;-)
There is openclprof by NVidia (on linux usually in folder /usr/local/cuda/openclprof/bin ), which can report divergent branches, memory throughput,… - it might show why the second kernel is so quick
There is also occupancy xls calculator and cl-nv-verbose compiler option, which are handy.
Are you checking the error code of clGetEventProfilingInfo function? Maybe you are giving to the function wrong event… Or try to make en-queuing kernel with blocking call. …Try to post your host code.
Sorry, I’m not an expert. Don’t know how much performance can be gained from such graphic card. Depends on memory pattern access, register usage… Posting output of SDK sample oclDeviceQuery might help;-)
There is openclprof by NVidia (on linux usually in folder /usr/local/cuda/openclprof/bin ), which can report divergent branches, memory throughput,… - it might show why the second kernel is so quick
There is also occupancy xls calculator and cl-nv-verbose compiler option, which are handy.
first of all, thank you very much for your swift answer.
yes. CL_SUCCESS is returned. i don’t think that a blocking enqueue would make any difference. i have the host+device code at the end of the post, although i don’t think that it will be useful, since it’s derived from the oclMatrixMul SDK code (i just made some minor changes). also, i wrapped some functions in a handy library for quick testing. the kernels are basically the one from the nvidia SDK and the other from ati stream SDK.
firt of all, i made a major mistake! the throughput on the gpu is about 5 teraflops. i have owned many nvidia gfx cards (from tnt2) and i got carried away - i have always been fond of nvidia products, so when i saw a big number i went insane!!!
i also post the oclDeviceQuery output at the end of this post. i have the SDK installed on windows, so i run the binary provided by the SDK (didn’t compile it myself - i don’t have visual studio installed (and don’t plan to). i only have a windows license provided by the university i study in). the kernels have different memory access patterns. also, the second kernel has 1 synchronization point (barrier) instead of 2 the first kernel has.
openclprof is very interesting. but, as you know, in the linux land things are very complicated! since i run debian squeeze 64bit, i don’t like the old drivers provided by the package manager (currently v195.36). i use the sgfxi script to get the latest drivers (v260.xx, which also contain /usr/lib/libOpenCL.so). the problem is if the SDK blob provided by nvidia contains files that may be incompatible with my system (or mess with the ones already installed), i would be in trouble! also, i like to have complete control on what is installed and where on my file system. so, if nvidia provides a package without drivers or other binaries (JUST the samples+documentation AND the utilities), then i would gladly install it. this is the main reason i don’t like SDKs on linux.
if i use the occupancy xls calculator (i have openoffice 3.2 installed - my gfx card has 1.1 compute capability), when i insert the 60 registers reported by ‘-cl-nv-verbose’ kernel compilation, the xls cannot display values in the blue box [ 3.) GPU Occupancy Data is displayed here and in the graphs: ] - they are all zero. but, the first kernel is ok (15 registers).
below, you can see the ‘-cl-nv-verbose’ output.
#first kernel
$ ./cl_prog -g -i ../matmul2.cl --inc ../matmul2.h --options "-cl-nv-verbose -cl-fast-relaxed-math"
OPENCL COMPILATION LOG: ../matmul2.cl on GeForce 8800 GT [GPU]
: Considering profile 'compute_11' for gpu='sm_11' in 'cuModuleLoadDataEx_4'
: Retrieving binary for 'cuModuleLoadDataEx_4', for gpu='sm_11', usage mode=' --verbose '
: Considering profile 'compute_11' for gpu='sm_11' in 'cuModuleLoadDataEx_4'
: Control flags for 'cuModuleLoadDataEx_4' disable search path
: Ptx binary found for 'cuModuleLoadDataEx_4', architecture='compute_11'
: Ptx compilation for 'cuModuleLoadDataEx_4', for gpu='sm_11', ocg options=' --verbose '
ptxas info : Compiling entry function 'matrixMul' for 'sm_11'
ptxas info : Used 15 registers, 28+16 bytes smem, 199 bytes cmem[0], 4 bytes cmem[1]
opencl program 0x21fe210
file: ../matmul2.cl
num_kernels: 1
kernel 0 (this 0x2c16670)
kernel0: 0x2c16670
#second kernel
$ ./cl_prog -g -i ../matmul3-3.cl --options "-cl-nv-verbose -cl-fast-relaxed-math"
OPENCL COMPILATION LOG: ../matmul3-3.cl on GeForce 8800 GT [GPU]
: Considering profile 'compute_11' for gpu='sm_11' in 'cuModuleLoadDataEx_4'
: Retrieving binary for 'cuModuleLoadDataEx_4', for gpu='sm_11', usage mode=' --verbose '
: Considering profile 'compute_11' for gpu='sm_11' in 'cuModuleLoadDataEx_4'
: Control flags for 'cuModuleLoadDataEx_4' disable search path
: Ptx binary found for 'cuModuleLoadDataEx_4', architecture='compute_11'
: Ptx compilation for 'cuModuleLoadDataEx_4', for gpu='sm_11', ocg options=' --verbose '
ptxas info : Compiling entry function 'matrixMul3' for 'sm_11'
ptxas info : Used 60 registers, 24+16 bytes smem, 168 bytes cmem[0], 24 bytes cmem[1]
opencl program 0x2379210
file: ../matmul3-3.cl
num_kernels: 1
kernel 0 (this 0x2824eb0)
kernel0: 0x2824eb0
oclDeviceQuery output on windows+msys bash console:
$ ./oclDeviceQuery.exe
oclDeviceQuery.exe Starting...
OpenCL SW Info:
CL_PLATFORM_NAME: NVIDIA CUDA
CL_PLATFORM_VERSION: OpenCL 1.0 CUDA 3.2.1
OpenCL SDK Revision: 5537818
OpenCL Device Info:
1 devices found supporting OpenCL:
---------------------------------
Device GeForce 8800 GT
---------------------------------
CL_DEVICE_NAME: GeForce 8800 GT
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 260.99
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 14
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 512
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1500 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 128 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 511 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
CL_DEVICE_SINGLE_FP_CONFIG: INF-quietNaNs round-to-nearest round-to-zero round-to-inf fma
CL_DEVICE_IMAGE <dim> 2D_MAX_WIDTH 4096
2D_MAX_HEIGHT 32768
3D_MAX_WIDTH 2048
3D_MAX_HEIGHT 2048
3D_MAX_DEPTH 2048
CL_DEVICE_EXTENSIONS: cl_khr_byte_addressable_store
cl_khr_icd
cl_khr_gl_sharing
cl_nv_d3d9_sharing
cl_nv_compiler_options
cl_nv_device_attribute_query
cl_nv_pragma_unroll
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
CL_DEVICE_COMPUTE_CAPABILITY_NV: 1.1
NUMBER OF MULTIPROCESSORS: 14
NUMBER OF CUDA CORES: 112
CL_DEVICE_REGISTERS_PER_BLOCK_NV: 8192
CL_DEVICE_WARP_SIZE_NV: 32
CL_DEVICE_GPU_OVERLAP_NV: CL_TRUE
CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV: CL_TRUE
CL_DEVICE_INTEGRATED_MEMORY_NV: CL_FALSE
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0
oclDeviceQuery, Platform Name = NVIDIA CUDA, Platform Version = OpenCL 1.0 CUDA 3.2.1, SDK Revision = 5537818, NumDevs = 1, Device = GeForce 8800 GT
System Info:
Local Time/Date = 19:41:18, 11/10/2010
CPU Arch: 0
CPU Level: 16
# of CPU processors: 4
Windows Build: 2600
Windows Ver: 5.1
PASSED
Press <Enter> to Quit...
-----------------------------------------------------------
host+device code:
matmul-dev.c – code based on samples from nvidia gpu computing sdk
// matmul2.cl
// Multiply two matrices A * B = C
// Device code.
//include "matmul2.h"
#define AS(i, j) As[(i)*BLOCK_SIZE+(j)]
#define BS(i, j) Bs[(i)*BLOCK_SIZE+(j)]
__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1)))
void
matrixMul(__global float *C, __global float *A, __global float *B,
__local float *As, __local float *Bs,
int wA, int wB)
{
// 2D thread ID
int gx=get_global_id(0);
int gy=get_global_id(1);
// block index (in NDRange)
int bx=get_group_id(0);
int by=get_group_id(1);
// thread index (in workgroup)
int tx=get_local_id(0);
int ty=get_local_id(1);
// index of the first sub-matrix of A processed by the block
int aBegin=wA*BLOCK_SIZE*by;
// index of the last sub-matrix of A processed by the block
int aEnd=aBegin+wA-1;
// step size used to iterate through the sub-matrices of A
int aStep=BLOCK_SIZE;
// index of the first sub-matrix of B processed by the block
int bBegin=BLOCK_SIZE*bx;
// step size used to iterate through the sub-matrices of B
int bStep=BLOCK_SIZE*wB;
float4 Csub=(float4) (0.0f, 0.0f, 0.0f, 0.0f);
// loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for(
int a=aBegin, b=bBegin;
a<=aEnd;
a+=aStep, b+=bStep
) {
// load the matrices from global memory
// to local memory; each thread loads
// one element of each matrix
AS(ty, tx)=A[a+wA*ty+tx];
BS(ty, tx)=B[b+wB*ty+tx];
// synchronize to make sure the matrices are loaded
barrier(CLK_LOCAL_MEM_FENCE);
// multiply the two matrices together;
// each thread computes one element of the block sub-matrix
#pragma unroll
for(int k=0; k<BLOCK_SIZE; k+=4) {
float4 temp1=(float4) (
AS(ty, k), AS(ty, k+1),
AS(ty, k+2), AS(ty, k+3)
);
float4 temp2=(float4) (
BS(k, tx), BS(k+1, tx),
BS(k+2, tx), BS(k+3, tx)
);
Csub+=temp1*temp2;
}
// synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
barrier(CLK_LOCAL_MEM_FENCE);
}
// write the block sub-matrix to device memory;
// each thread writes one element
//C[wB*BLOCK_SIZE*by+BLOCK_SIZE*bx+wB*ty+tx]=
// Csub.x+Csub.y+Csub.z+Csub.w;
C[gy*wB+gx]=Csub.x+Csub.y+Csub.z+Csub.w;
}
matmul3-3.cl – code from ati stream sdk
// matmul3.cl
// Multiply two matrices A * B = C
// Device code.
#define TILEX 4
#define TILEX_SHIFT 2
#define TILEY 4
#define TILEY_SHIFT 2
// optimized version : Tile : 4x4
// both matrix A and matrix B are cached into local memory blocks
__kernel void matrixMul3(__global float4 *matrixC,
__global float4 *matrixA, __global float4* matrixB,
__local float4 *blockA, __local float4 *blockB,
int widthA)
{
float4 sum0=(float4) (0);
float4 sum1=(float4) (0);
float4 sum2=(float4) (0);
float4 sum3=(float4) (0);
int temp=widthA/4;
// calculate blockwise-MMM for each pair of blocks along
// the common width of matrixA and matrixB
for(int i=0; i<(temp/get_local_size(0)); i++) {
// local data for blockA from matrixA
// right now considering only square matrices so width of
// C = width of A
blockA[get_local_id(0)+get_local_size(0)*
(get_local_id(1)<<TILEY_SHIFT)]=
matrixA[i*get_local_size(0)+get_local_id(0)+
(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)];
blockA[get_local_id(0)+get_local_size(0)*
(get_local_id(1)<<TILEY_SHIFT)+get_local_size(0)]=
matrixA[i*get_local_size(0)+get_local_id(0)+
(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+
get_global_size(0)];
blockA[get_local_id(0)+get_local_size(0)*
(get_local_id(1)<<TILEY_SHIFT)+2*get_local_size(0)]=
matrixA[i*get_local_size(0)+get_local_id(0)+
(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+
2*get_global_size(0)];
blockA[get_local_id(0)+get_local_size(0)*
(get_local_id(1)<<TILEY_SHIFT)+3*get_local_size(0)]=
matrixA[i*get_local_size(0)+get_local_id(0)+
(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+
3*get_global_size(0)];
// local data for blockA from matrixB
blockB[get_local_id(0)+get_local_size(0)*
(get_local_id(1)<<TILEY_SHIFT)]=
matrixB[get_global_id(0)+
((i*get_local_size(1)+get_local_id(1))<<TILEY_SHIFT)*
get_global_size(0)];
blockB[get_local_id(0)+get_local_size(0)*
(get_local_id(1)<<TILEY_SHIFT)+get_local_size(0)]=
matrixB[get_global_id(0)+
((i*get_local_size(1)+get_local_id(1))<<TILEY_SHIFT)*
get_global_size(0)+get_global_size(0)];
blockB[get_local_id(0)+get_local_size(0)*
(get_local_id(1)<<TILEY_SHIFT)+2*get_local_size(0)]=
matrixB[get_global_id(0)+
((i*get_local_size(1)+get_local_id(1))<<TILEY_SHIFT)*
get_global_size(0)+2*get_global_size(0)];
blockB[get_local_id(0)+get_local_size(0)*
(get_local_id(1)<<TILEY_SHIFT)+3*get_local_size(0)]=
matrixB[get_global_id(0)+
((i*get_local_size(1)+get_local_id(1))<<TILEY_SHIFT)*
get_global_size(0)+3*get_global_size(0)];
barrier(CLK_LOCAL_MEM_FENCE);
// each local thread will read a float4
for(int j=0; j<get_local_size(0)<<2; j+=4) {
// block dimensions of A = block dimensions of C
float4 tempA0=
blockA[(j>>2)+
get_local_id(1)*TILEY*get_local_size(0)];
float4 tempA1=
blockA[(j>>2)+
(get_local_id(1)*TILEY+1)*get_local_size(0)];
float4 tempA2=
blockA[(j>>2)+
(get_local_id(1)*TILEY+2)*get_local_size(0)];
float4 tempA3=
blockA[(j>>2)+
(get_local_id(1)*TILEY+3)*get_local_size(0)];
// block dimensions of B = block dimensions of C
// should be localId.x * (TILEX / 4)
float4 tempB0=
blockB[get_local_id(0)+
j*get_local_size(0)];
float4 tempB1=
blockB[get_local_id(0)+
(j+1)*get_local_size(0)];
float4 tempB2=
blockB[get_local_id(0)+
(j+2)*get_local_size(0)];
float4 tempB3=
blockB[get_local_id(0)+
(j+3)*get_local_size(0)];
sum0.x+=tempA0.x*tempB0.x+tempA0.y*tempB1.x+
tempA0.z*tempB2.x+tempA0.w*tempB3.x;
sum0.y+=tempA0.x*tempB0.y+tempA0.y*tempB1.y+
tempA0.z*tempB2.y+tempA0.w*tempB3.y;
sum0.z+=tempA0.x*tempB0.z+tempA0.y*tempB1.z+
tempA0.z*tempB2.z+tempA0.w*tempB3.z;
sum0.w+=tempA0.x*tempB0.w+tempA0.y*tempB1.w+
tempA0.z*tempB2.w+tempA0.w*tempB3.w;
sum1.x+=tempA1.x*tempB0.x+tempA1.y*tempB1.x+
tempA1.z*tempB2.x+tempA1.w*tempB3.x;
sum1.y+=tempA1.x*tempB0.y+tempA1.y*tempB1.y+
tempA1.z*tempB2.y+tempA1.w*tempB3.y;
sum1.z+=tempA1.x*tempB0.z+tempA1.y*tempB1.z+
tempA1.z*tempB2.z+tempA1.w*tempB3.z;
sum1.w+=tempA1.x*tempB0.w+tempA1.y*tempB1.w+
tempA1.z*tempB2.w+tempA1.w*tempB3.w;
sum2.x+=tempA2.x*tempB0.x+tempA2.y*tempB1.x+
tempA2.z*tempB2.x+tempA2.w*tempB3.x;
sum2.y+=tempA2.x*tempB0.y+tempA2.y*tempB1.y+
tempA2.z*tempB2.y+tempA2.w*tempB3.y;
sum2.z+=tempA2.x*tempB0.z+tempA2.y*tempB1.z+
tempA2.z*tempB2.z+tempA2.w*tempB3.z;
sum2.w+=tempA2.x*tempB0.w+tempA2.y*tempB1.w+
tempA2.z*tempB2.w+tempA2.w*tempB3.w;
sum3.x+=tempA3.x*tempB0.x+tempA3.y*tempB1.x+
tempA3.z*tempB2.x+tempA3.w*tempB3.x;
sum3.y+=tempA3.x*tempB0.y+tempA3.y*tempB1.y+
tempA3.z*tempB2.y+tempA3.w*tempB3.y;
sum3.z+=tempA3.x*tempB0.z+tempA3.y*tempB1.z+
tempA3.z*tempB2.z+tempA3.w*tempB3.z;
sum3.w+=tempA3.x*tempB0.w+tempA3.y*tempB1.w+
tempA3.z*tempB2.w+tempA3.w*tempB3.w;
}
}
matrixC[get_global_id(0)+
(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)]=
sum0;
matrixC[get_global_id(0)+
(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+
get_global_size(0)]=
sum1;
matrixC[get_global_id(0)+
(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+
2*get_global_size(0)]=
sum2;
matrixC[get_global_id(0)+
(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+
3*get_global_size(0)]=
sum3;
}
first of all, thank you very much for your swift answer.
yes. CL_SUCCESS is returned. i don’t think that a blocking enqueue would make any difference. i have the host+device code at the end of the post, although i don’t think that it will be useful, since it’s derived from the oclMatrixMul SDK code (i just made some minor changes). also, i wrapped some functions in a handy library for quick testing. the kernels are basically the one from the nvidia SDK and the other from ati stream SDK.
firt of all, i made a major mistake! the throughput on the gpu is about 5 teraflops. i have owned many nvidia gfx cards (from tnt2) and i got carried away - i have always been fond of nvidia products, so when i saw a big number i went insane!!!
i also post the oclDeviceQuery output at the end of this post. i have the SDK installed on windows, so i run the binary provided by the SDK (didn’t compile it myself - i don’t have visual studio installed (and don’t plan to). i only have a windows license provided by the university i study in). the kernels have different memory access patterns. also, the second kernel has 1 synchronization point (barrier) instead of 2 the first kernel has.
openclprof is very interesting. but, as you know, in the linux land things are very complicated! since i run debian squeeze 64bit, i don’t like the old drivers provided by the package manager (currently v195.36). i use the sgfxi script to get the latest drivers (v260.xx, which also contain /usr/lib/libOpenCL.so). the problem is if the SDK blob provided by nvidia contains files that may be incompatible with my system (or mess with the ones already installed), i would be in trouble! also, i like to have complete control on what is installed and where on my file system. so, if nvidia provides a package without drivers or other binaries (JUST the samples+documentation AND the utilities), then i would gladly install it. this is the main reason i don’t like SDKs on linux.
if i use the occupancy xls calculator (i have openoffice 3.2 installed - my gfx card has 1.1 compute capability), when i insert the 60 registers reported by ‘-cl-nv-verbose’ kernel compilation, the xls cannot display values in the blue box [ 3.) GPU Occupancy Data is displayed here and in the graphs: ] - they are all zero. but, the first kernel is ok (15 registers).
below, you can see the ‘-cl-nv-verbose’ output.
#first kernel
$ ./cl_prog -g -i ../matmul2.cl --inc ../matmul2.h --options "-cl-nv-verbose -cl-fast-relaxed-math"
OPENCL COMPILATION LOG: ../matmul2.cl on GeForce 8800 GT [GPU]
: Considering profile 'compute_11' for gpu='sm_11' in 'cuModuleLoadDataEx_4'
: Retrieving binary for 'cuModuleLoadDataEx_4', for gpu='sm_11', usage mode=' --verbose '
: Considering profile 'compute_11' for gpu='sm_11' in 'cuModuleLoadDataEx_4'
: Control flags for 'cuModuleLoadDataEx_4' disable search path
: Ptx binary found for 'cuModuleLoadDataEx_4', architecture='compute_11'
: Ptx compilation for 'cuModuleLoadDataEx_4', for gpu='sm_11', ocg options=' --verbose '
ptxas info : Compiling entry function 'matrixMul' for 'sm_11'
ptxas info : Used 15 registers, 28+16 bytes smem, 199 bytes cmem[0], 4 bytes cmem[1]
opencl program 0x21fe210
file: ../matmul2.cl
num_kernels: 1
kernel 0 (this 0x2c16670)
kernel0: 0x2c16670
#second kernel
$ ./cl_prog -g -i ../matmul3-3.cl --options "-cl-nv-verbose -cl-fast-relaxed-math"
OPENCL COMPILATION LOG: ../matmul3-3.cl on GeForce 8800 GT [GPU]
: Considering profile 'compute_11' for gpu='sm_11' in 'cuModuleLoadDataEx_4'
: Retrieving binary for 'cuModuleLoadDataEx_4', for gpu='sm_11', usage mode=' --verbose '
: Considering profile 'compute_11' for gpu='sm_11' in 'cuModuleLoadDataEx_4'
: Control flags for 'cuModuleLoadDataEx_4' disable search path
: Ptx binary found for 'cuModuleLoadDataEx_4', architecture='compute_11'
: Ptx compilation for 'cuModuleLoadDataEx_4', for gpu='sm_11', ocg options=' --verbose '
ptxas info : Compiling entry function 'matrixMul3' for 'sm_11'
ptxas info : Used 60 registers, 24+16 bytes smem, 168 bytes cmem[0], 24 bytes cmem[1]
opencl program 0x2379210
file: ../matmul3-3.cl
num_kernels: 1
kernel 0 (this 0x2824eb0)
kernel0: 0x2824eb0
oclDeviceQuery output on windows+msys bash console:
$ ./oclDeviceQuery.exe
oclDeviceQuery.exe Starting...
OpenCL SW Info:
CL_PLATFORM_NAME: NVIDIA CUDA
CL_PLATFORM_VERSION: OpenCL 1.0 CUDA 3.2.1
OpenCL SDK Revision: 5537818
OpenCL Device Info:
1 devices found supporting OpenCL:
---------------------------------
Device GeForce 8800 GT
---------------------------------
CL_DEVICE_NAME: GeForce 8800 GT
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 260.99
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 14
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 512
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1500 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 128 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 511 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
CL_DEVICE_SINGLE_FP_CONFIG: INF-quietNaNs round-to-nearest round-to-zero round-to-inf fma
CL_DEVICE_IMAGE <dim> 2D_MAX_WIDTH 4096
2D_MAX_HEIGHT 32768
3D_MAX_WIDTH 2048
3D_MAX_HEIGHT 2048
3D_MAX_DEPTH 2048
CL_DEVICE_EXTENSIONS: cl_khr_byte_addressable_store
cl_khr_icd
cl_khr_gl_sharing
cl_nv_d3d9_sharing
cl_nv_compiler_options
cl_nv_device_attribute_query
cl_nv_pragma_unroll
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
CL_DEVICE_COMPUTE_CAPABILITY_NV: 1.1
NUMBER OF MULTIPROCESSORS: 14
NUMBER OF CUDA CORES: 112
CL_DEVICE_REGISTERS_PER_BLOCK_NV: 8192
CL_DEVICE_WARP_SIZE_NV: 32
CL_DEVICE_GPU_OVERLAP_NV: CL_TRUE
CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV: CL_TRUE
CL_DEVICE_INTEGRATED_MEMORY_NV: CL_FALSE
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0
oclDeviceQuery, Platform Name = NVIDIA CUDA, Platform Version = OpenCL 1.0 CUDA 3.2.1, SDK Revision = 5537818, NumDevs = 1, Device = GeForce 8800 GT
System Info:
Local Time/Date = 19:41:18, 11/10/2010
CPU Arch: 0
CPU Level: 16
# of CPU processors: 4
Windows Build: 2600
Windows Ver: 5.1
PASSED
Press <Enter> to Quit...
-----------------------------------------------------------
host+device code:
matmul-dev.c – code based on samples from nvidia gpu computing sdk
// matmul2.cl
// Multiply two matrices A * B = C
// Device code.
//include "matmul2.h"
#define AS(i, j) As[(i)*BLOCK_SIZE+(j)]
#define BS(i, j) Bs[(i)*BLOCK_SIZE+(j)]
__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1)))
void
matrixMul(__global float *C, __global float *A, __global float *B,
__local float *As, __local float *Bs,
int wA, int wB)
{
// 2D thread ID
int gx=get_global_id(0);
int gy=get_global_id(1);
// block index (in NDRange)
int bx=get_group_id(0);
int by=get_group_id(1);
// thread index (in workgroup)
int tx=get_local_id(0);
int ty=get_local_id(1);
// index of the first sub-matrix of A processed by the block
int aBegin=wA*BLOCK_SIZE*by;
// index of the last sub-matrix of A processed by the block
int aEnd=aBegin+wA-1;
// step size used to iterate through the sub-matrices of A
int aStep=BLOCK_SIZE;
// index of the first sub-matrix of B processed by the block
int bBegin=BLOCK_SIZE*bx;
// step size used to iterate through the sub-matrices of B
int bStep=BLOCK_SIZE*wB;
float4 Csub=(float4) (0.0f, 0.0f, 0.0f, 0.0f);
// loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for(
int a=aBegin, b=bBegin;
a<=aEnd;
a+=aStep, b+=bStep
) {
// load the matrices from global memory
// to local memory; each thread loads
// one element of each matrix
AS(ty, tx)=A[a+wA*ty+tx];
BS(ty, tx)=B[b+wB*ty+tx];
// synchronize to make sure the matrices are loaded
barrier(CLK_LOCAL_MEM_FENCE);
// multiply the two matrices together;
// each thread computes one element of the block sub-matrix
#pragma unroll
for(int k=0; k<BLOCK_SIZE; k+=4) {
float4 temp1=(float4) (
AS(ty, k), AS(ty, k+1),
AS(ty, k+2), AS(ty, k+3)
);
float4 temp2=(float4) (
BS(k, tx), BS(k+1, tx),
BS(k+2, tx), BS(k+3, tx)
);
Csub+=temp1*temp2;
}
// synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
barrier(CLK_LOCAL_MEM_FENCE);
}
// write the block sub-matrix to device memory;
// each thread writes one element
//C[wB*BLOCK_SIZE*by+BLOCK_SIZE*bx+wB*ty+tx]=
// Csub.x+Csub.y+Csub.z+Csub.w;
C[gy*wB+gx]=Csub.x+Csub.y+Csub.z+Csub.w;
}
matmul3-3.cl – code from ati stream sdk
// matmul3.cl
// Multiply two matrices A * B = C
// Device code.
#define TILEX 4
#define TILEX_SHIFT 2
#define TILEY 4
#define TILEY_SHIFT 2
// optimized version : Tile : 4x4
// both matrix A and matrix B are cached into local memory blocks
__kernel void matrixMul3(__global float4 *matrixC,
__global float4 *matrixA, __global float4* matrixB,
__local float4 *blockA, __local float4 *blockB,
int widthA)
{
float4 sum0=(float4) (0);
float4 sum1=(float4) (0);
float4 sum2=(float4) (0);
float4 sum3=(float4) (0);
int temp=widthA/4;
// calculate blockwise-MMM for each pair of blocks along
// the common width of matrixA and matrixB
for(int i=0; i<(temp/get_local_size(0)); i++) {
// local data for blockA from matrixA
// right now considering only square matrices so width of
// C = width of A
blockA[get_local_id(0)+get_local_size(0)*
(get_local_id(1)<<TILEY_SHIFT)]=
matrixA[i*get_local_size(0)+get_local_id(0)+
(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)];
blockA[get_local_id(0)+get_local_size(0)*
(get_local_id(1)<<TILEY_SHIFT)+get_local_size(0)]=
matrixA[i*get_local_size(0)+get_local_id(0)+
(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+
get_global_size(0)];
blockA[get_local_id(0)+get_local_size(0)*
(get_local_id(1)<<TILEY_SHIFT)+2*get_local_size(0)]=
matrixA[i*get_local_size(0)+get_local_id(0)+
(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+
2*get_global_size(0)];
blockA[get_local_id(0)+get_local_size(0)*
(get_local_id(1)<<TILEY_SHIFT)+3*get_local_size(0)]=
matrixA[i*get_local_size(0)+get_local_id(0)+
(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+
3*get_global_size(0)];
// local data for blockA from matrixB
blockB[get_local_id(0)+get_local_size(0)*
(get_local_id(1)<<TILEY_SHIFT)]=
matrixB[get_global_id(0)+
((i*get_local_size(1)+get_local_id(1))<<TILEY_SHIFT)*
get_global_size(0)];
blockB[get_local_id(0)+get_local_size(0)*
(get_local_id(1)<<TILEY_SHIFT)+get_local_size(0)]=
matrixB[get_global_id(0)+
((i*get_local_size(1)+get_local_id(1))<<TILEY_SHIFT)*
get_global_size(0)+get_global_size(0)];
blockB[get_local_id(0)+get_local_size(0)*
(get_local_id(1)<<TILEY_SHIFT)+2*get_local_size(0)]=
matrixB[get_global_id(0)+
((i*get_local_size(1)+get_local_id(1))<<TILEY_SHIFT)*
get_global_size(0)+2*get_global_size(0)];
blockB[get_local_id(0)+get_local_size(0)*
(get_local_id(1)<<TILEY_SHIFT)+3*get_local_size(0)]=
matrixB[get_global_id(0)+
((i*get_local_size(1)+get_local_id(1))<<TILEY_SHIFT)*
get_global_size(0)+3*get_global_size(0)];
barrier(CLK_LOCAL_MEM_FENCE);
// each local thread will read a float4
for(int j=0; j<get_local_size(0)<<2; j+=4) {
// block dimensions of A = block dimensions of C
float4 tempA0=
blockA[(j>>2)+
get_local_id(1)*TILEY*get_local_size(0)];
float4 tempA1=
blockA[(j>>2)+
(get_local_id(1)*TILEY+1)*get_local_size(0)];
float4 tempA2=
blockA[(j>>2)+
(get_local_id(1)*TILEY+2)*get_local_size(0)];
float4 tempA3=
blockA[(j>>2)+
(get_local_id(1)*TILEY+3)*get_local_size(0)];
// block dimensions of B = block dimensions of C
// should be localId.x * (TILEX / 4)
float4 tempB0=
blockB[get_local_id(0)+
j*get_local_size(0)];
float4 tempB1=
blockB[get_local_id(0)+
(j+1)*get_local_size(0)];
float4 tempB2=
blockB[get_local_id(0)+
(j+2)*get_local_size(0)];
float4 tempB3=
blockB[get_local_id(0)+
(j+3)*get_local_size(0)];
sum0.x+=tempA0.x*tempB0.x+tempA0.y*tempB1.x+
tempA0.z*tempB2.x+tempA0.w*tempB3.x;
sum0.y+=tempA0.x*tempB0.y+tempA0.y*tempB1.y+
tempA0.z*tempB2.y+tempA0.w*tempB3.y;
sum0.z+=tempA0.x*tempB0.z+tempA0.y*tempB1.z+
tempA0.z*tempB2.z+tempA0.w*tempB3.z;
sum0.w+=tempA0.x*tempB0.w+tempA0.y*tempB1.w+
tempA0.z*tempB2.w+tempA0.w*tempB3.w;
sum1.x+=tempA1.x*tempB0.x+tempA1.y*tempB1.x+
tempA1.z*tempB2.x+tempA1.w*tempB3.x;
sum1.y+=tempA1.x*tempB0.y+tempA1.y*tempB1.y+
tempA1.z*tempB2.y+tempA1.w*tempB3.y;
sum1.z+=tempA1.x*tempB0.z+tempA1.y*tempB1.z+
tempA1.z*tempB2.z+tempA1.w*tempB3.z;
sum1.w+=tempA1.x*tempB0.w+tempA1.y*tempB1.w+
tempA1.z*tempB2.w+tempA1.w*tempB3.w;
sum2.x+=tempA2.x*tempB0.x+tempA2.y*tempB1.x+
tempA2.z*tempB2.x+tempA2.w*tempB3.x;
sum2.y+=tempA2.x*tempB0.y+tempA2.y*tempB1.y+
tempA2.z*tempB2.y+tempA2.w*tempB3.y;
sum2.z+=tempA2.x*tempB0.z+tempA2.y*tempB1.z+
tempA2.z*tempB2.z+tempA2.w*tempB3.z;
sum2.w+=tempA2.x*tempB0.w+tempA2.y*tempB1.w+
tempA2.z*tempB2.w+tempA2.w*tempB3.w;
sum3.x+=tempA3.x*tempB0.x+tempA3.y*tempB1.x+
tempA3.z*tempB2.x+tempA3.w*tempB3.x;
sum3.y+=tempA3.x*tempB0.y+tempA3.y*tempB1.y+
tempA3.z*tempB2.y+tempA3.w*tempB3.y;
sum3.z+=tempA3.x*tempB0.z+tempA3.y*tempB1.z+
tempA3.z*tempB2.z+tempA3.w*tempB3.z;
sum3.w+=tempA3.x*tempB0.w+tempA3.y*tempB1.w+
tempA3.z*tempB2.w+tempA3.w*tempB3.w;
}
}
matrixC[get_global_id(0)+
(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)]=
sum0;
matrixC[get_global_id(0)+
(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+
get_global_size(0)]=
sum1;
matrixC[get_global_id(0)+
(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+
2*get_global_size(0)]=
sum2;
matrixC[get_global_id(0)+
(get_global_id(1)<<TILEY_SHIFT)*get_global_size(0)+
3*get_global_size(0)]=
sum3;
}
i ran the program on windows (on the amd cpu device). the second kernel crashes (segfaults with the messagebox ‘The instruction … referenced memory at … The memory could not be “read”’. the first kernel runs very slow, but generally, as i saw, amd’s opencl cpu implementation is extremely slow (i use mingw on windows).
i also tried the opencl visual profiler (with the 8800gt device), but i get the “could not read profiling output” error (i added a clReleaseEvent() call to the host code, since i read in this forum that the visual profiler is very sensitive to leaks). But, i could not get it to work and it crashes frequently, so i am not gonna waste my time any more with this.
next step will be to compile the oclMatrixMul sdk sample on linux with the second kernel and see what happens.
i ran the program on windows (on the amd cpu device). the second kernel crashes (segfaults with the messagebox ‘The instruction … referenced memory at … The memory could not be “read”’. the first kernel runs very slow, but generally, as i saw, amd’s opencl cpu implementation is extremely slow (i use mingw on windows).
i also tried the opencl visual profiler (with the 8800gt device), but i get the “could not read profiling output” error (i added a clReleaseEvent() call to the host code, since i read in this forum that the visual profiler is very sensitive to leaks). But, i could not get it to work and it crashes frequently, so i am not gonna waste my time any more with this.
next step will be to compile the oclMatrixMul sdk sample on linux with the second kernel and see what happens.
Isn’t the bug related to allocating only small amount of local memory? The NVidia Matrix mull example needs only BLOCK_SIZE *BLOCK_SIZE however, AMD MatrixMull example needs (blockSize * 4) * (blockSize * 4)… the error, memory could not be “read”', points to the same problem, I think. As well as only larger data show the bug…
Btw. If you already installed CUDA on linux, you don’t have to install OpenCL, it is a a part of CUDA toolkit;-)
Isn’t the bug related to allocating only small amount of local memory? The NVidia Matrix mull example needs only BLOCK_SIZE *BLOCK_SIZE however, AMD MatrixMull example needs (blockSize * 4) * (blockSize * 4)… the error, memory could not be “read”', points to the same problem, I think. As well as only larger data show the bug…
Btw. If you already installed CUDA on linux, you don’t have to install OpenCL, it is a a part of CUDA toolkit;-)
KARBOUS SAVES THE DAY!!! this was the bug! now everything works ok. thanks my friend, you rule!
can you tell me what files the installer installs and where, so i can be assured that it doesn’t mess up with already installed files (from the sgfxi script)? i would be very interested in installing the toolkit on linux (for me it’s a top priority, since opencl is included in my diploma thesis).
also, does the toolkit include the nvidia icd driver? will the application be able to “see” the amd cpu as an opencl device if i use the icd driver or do i have to install amd’s icd driver as well (which i don’t plan to do, since i’m afraid it will break things on the linux platform)? sorry for the above questions, but in the forum i couldn’t find anything that is close to my situation (drivers not from the package manager).
KARBOUS SAVES THE DAY!!! this was the bug! now everything works ok. thanks my friend, you rule!
can you tell me what files the installer installs and where, so i can be assured that it doesn’t mess up with already installed files (from the sgfxi script)? i would be very interested in installing the toolkit on linux (for me it’s a top priority, since opencl is included in my diploma thesis).
also, does the toolkit include the nvidia icd driver? will the application be able to “see” the amd cpu as an opencl device if i use the icd driver or do i have to install amd’s icd driver as well (which i don’t plan to do, since i’m afraid it will break things on the linux platform)? sorry for the above questions, but in the forum i couldn’t find anything that is close to my situation (drivers not from the package manager).
External Image bugs made by others are always easier to find than own bugs… (I’m working on diploma as well and still struggling, btw. what is your diploma thesis about?)
The toolkit is only SDK with Profiler, samples, libs and docs. It has nothing to do with driver (so it should work with the sgfxi). As the release notes of NVidia SDK says it is a self extracting archive, by default it is unpacked to your home and to /usr/local.
ICD is just a file (I think) which is stored in a special place - /etc/OpenCL/vendors/ and must be “installed” to enable application choice which OpenCL implementation to select. So it won’t harm your system and if you will feel that it has influenced the system, I doubt it, you can always erase that file.
External Image bugs made by others are always easier to find than own bugs… (I’m working on diploma as well and still struggling, btw. what is your diploma thesis about?)
The toolkit is only SDK with Profiler, samples, libs and docs. It has nothing to do with driver (so it should work with the sgfxi). As the release notes of NVidia SDK says it is a self extracting archive, by default it is unpacked to your home and to /usr/local.
ICD is just a file (I think) which is stored in a special place - /etc/OpenCL/vendors/ and must be “installed” to enable application choice which OpenCL implementation to select. So it won’t harm your system and if you will feel that it has influenced the system, I doubt it, you can always erase that file.
just finished with the code. i made some changes and now it runs like a charm. basically, i needed to code a trivial application, so i could build a basic “abstraction” opencl wrapper library around it. i really like opencl’s interface (there are some similarities with opengl that i used some time ago) and being a long time nvidia fanboy myself, i convinced some guys at the lab to give it a shot before choosing the topic of my diploma thesis. some of them even bought nvidia gear for themselves, when i showed them the gpu’s tremendous power on simd problems.
well, i just started the struggle!! it involves an image processing pipeline (tasks) over the network. the image processing is done locally (one task) while other tasks are distributed over the network. for the image processing part i plan to use opencl (and maybe opengl). this is the data-parallel part. for other parts, i use pthreads for task-parallelization + socket programming for networking. other students that will follow will convert it to something that resembles a stack of web services (i just need something like a prototype). it’s nothing fancy, but i must get acquainted with opencl in the first place. so, i began with the trivial example of matrix multiplication. i’m an undergraduate student (the diploma thesis is required for graduation).
cool. i shall install it and see what happens!! (for the presentation i am sure some graphs from the profiler will be very useful).
yes, but i still need to install amd’s opencl driver as well (so my amd cpu will be listed as an opencl device along with the nvidia gpu). am i right? the question is where will amd’s libOpenCL.so be installed? will it overwrite nvidia’s one?
initially, i thought that the icd is a “general” libOpenCL.so installed in /usr/lib that dispatches opencl calls to the “real” libOpenCL.so, depending on the chosen context’s device(s) that are accompanied by the appropriate driver (the “real” .so).
just finished with the code. i made some changes and now it runs like a charm. basically, i needed to code a trivial application, so i could build a basic “abstraction” opencl wrapper library around it. i really like opencl’s interface (there are some similarities with opengl that i used some time ago) and being a long time nvidia fanboy myself, i convinced some guys at the lab to give it a shot before choosing the topic of my diploma thesis. some of them even bought nvidia gear for themselves, when i showed them the gpu’s tremendous power on simd problems.
well, i just started the struggle!! it involves an image processing pipeline (tasks) over the network. the image processing is done locally (one task) while other tasks are distributed over the network. for the image processing part i plan to use opencl (and maybe opengl). this is the data-parallel part. for other parts, i use pthreads for task-parallelization + socket programming for networking. other students that will follow will convert it to something that resembles a stack of web services (i just need something like a prototype). it’s nothing fancy, but i must get acquainted with opencl in the first place. so, i began with the trivial example of matrix multiplication. i’m an undergraduate student (the diploma thesis is required for graduation).
cool. i shall install it and see what happens!! (for the presentation i am sure some graphs from the profiler will be very useful).
yes, but i still need to install amd’s opencl driver as well (so my amd cpu will be listed as an opencl device along with the nvidia gpu). am i right? the question is where will amd’s libOpenCL.so be installed? will it overwrite nvidia’s one?
initially, i thought that the icd is a “general” libOpenCL.so installed in /usr/lib that dispatches opencl calls to the “real” libOpenCL.so, depending on the chosen context’s device(s) that are accompanied by the appropriate driver (the “real” .so).
The diploma’s topic sounds cool, anyway External Image . Good luck.
Well, I suppose your amd CPU is working fine, so there is no need to install driver (just in case you want the latest one). However, installing ICD’s are mandatory. The ICD’s are text files with names of dynamic OpenCL libs. I’v got in /etc/OpenCL/vendors files:
The diploma’s topic sounds cool, anyway External Image . Good luck.
Well, I suppose your amd CPU is working fine, so there is no need to install driver (just in case you want the latest one). However, installing ICD’s are mandatory. The ICD’s are text files with names of dynamic OpenCL libs. I’v got in /etc/OpenCL/vendors files: