Incorrect results from cublasSgemm

I have been playing around with cublas and I have ran into a strange problem.

First I initialize A and B so that {A,B}[i]=i+1
When I call cublasSgemm as:
cublasSgemm(‘t’, ‘n’, 65536, 1, 1, 1.0, A, 1, B, 1, 0.0, Y, 65536);
I get a correct resault.

but if I increase the dimensions of A by 1 as:
cublasSgemm(‘t’, ‘n’, 65537, 1, 1, 1.0, mat1_d, 1, mat2_d, 1, 0.0, res_d, 65537);
Then I get wrong results:
Y[65536] =1 instead of 65537

I can increase the dimension of A further and always the results Y[i] where i>65535 are wrong.

it looks like a short int overflow problem somewhere!
Can anyone tell me whats going on?

Hi Zein,

We’ve been unable to reproduce this locally, so it’s possible it is fixed already in our internal codebase. If you are a registered developer, you have access to CUDA 0.9 and can try with that. If not, it won’t be long before a new public release will be out.


I can confirm that upgrading to CUDA 0.9 has solved this problem

I think I have a problem as well. I multiply two matrices V=10,6 H=10,5

The multiplication done is w+=V.transpose * H

The result should be shaped: 6,5

The results for the first 5 rows is correct, but the last row is off. If I change the dimensions, I get even bigger problems. Am I doing something wrong, or is this a bug?

I use cuda 2.1 and did store matrices in column-major order



#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

#include “cutil.h”

#include <cublas.h>

#include “cutil_inline.h”


main(int argc, char** argv){

int device;

struct cudaDeviceProp properties;

if( cutCheckCmdLineFlag(argc, (const char**)argv, “device”) )

cutilDeviceInit(argc, argv);


cudaSetDevice( cutGetMaxGflopsDeviceId() );


cutilSafeCall(cudaGetDeviceProperties(&properties, device));

cublasStatus stat;


int s1=6;int s2=5;int m1=0;int T=10;

unsigned int mem_size_w = sizeof(float)s1s2*(m1+1);

float* d_H;

float* d_V;

float* d_w;

size_t d_Hp;

size_t d_Vp;

size_t d_wp;

float* h_w = (float*)malloc(mem_size_w);

float* h_H = (float*)malloc(T*sizeof(float)*s2);

float* h_V = (float*)malloc(T*sizeof(float)*s1);

size_t h_wp=s2*sizeof(float);

size_t h_Hp=T*sizeof(float);

size_t h_Vp=T*sizeof(float);

cutilSafeCall(cudaMallocPitch((void**) &d_H, &d_Hp, T*sizeof(float), s2));

cutilSafeCall(cudaMallocPitch((void**) &d_V, &d_Vp, T*sizeof(float), s1));

cutilSafeCall(cudaMallocPitch((void**) &d_w, &d_wp, s2sizeof(float), s1(m1+1)));

for (int i=0;i<T;i++) {

   for (int k=0;k<s1;k++) {



   for (int k=0;k<s2;k++) {




cutilSafeCall(cudaMemcpy2D(d_H, d_Hp, h_H, h_Hp, T*sizeof(float), s2, cudaMemcpyHostToDevice));

cutilSafeCall(cudaMemcpy2D(d_V, d_Vp, h_V, h_Vp, T*sizeof(float), s1, cudaMemcpyHostToDevice));

cudaMemset(d_w, 0, s1d_wp(m1+1));

cublasSgemm(‘t’, ‘n’, s1, s2, T, 1.0f, d_V, d_Vp/sizeof(float), d_H, d_Hp/sizeof(float), 1.0f, d_w, d_wp/sizeof(float));

cutilSafeCall(cudaMemcpy2D(h_w, h_wp, d_w, d_wp, s2*sizeof(float), (m1+1)*s1, cudaMemcpyDeviceToHost));

for (int m=0;m<m1+1;m++){

   for (int i=0;i<s1;i++){

      for (int k=0;k<s2;k++) {

        if (k<s2-1) {printf("%3.8f,", h_w[m*s1*s2+i*s2+k]);} else {printf("%3.8f],\n[", h_w[m*s1*s2+i*s2+k]);}}




cublasFree (d_w);cublasFree (d_H);cublasFree (d_V);



Pitch returns the width of the array in bytes.
Try cudaMemset(d_w, 0, s1d_wp(m1+1)*sizeof(float));


Doesn’t cudaMemset also have the length indicated in bytes?

Whoops, you’re right of course. forgot that d_wp is in fact the pitch returned by the allocation. I was under the impression that it was being recalculated. My mistake :)


I have probably made an error somewhere, since my results are not consistent. If I can pinpoint the problem, I will write more about it.

yep, I made an error in the pitch of the back copy. It’s all working fine now. Sorry for bothering you.