When will cuda 8 be released? 1080 can't run with 7.5

cysin · May 27, 2016, 9:55am

I can’t run my program on 1080 with cuda 7.5 so is there a schedule for cuda 8.0? Thanks

BulatZiganshin · May 27, 2016, 10:25am

are you updated the drivers? it seems that cuda programs should be compiled on the fly from ptx by the drivers

cysin · May 27, 2016, 10:59am

I updated the drivers but did not recompile my program. Do I need to recompile?

BulatZiganshin · May 27, 2016, 11:56am

does you included ptx in the binaries? you may also try to compile cuda examples, with provided makefiles, just to check that the problem is not on your side

(i’m just a plain user, not an nvidia engineer)

cysin · May 27, 2016, 12:58pm

ptx should be in binaries(I am not sure). I am using driver 367.18 with cuda 7.5 on CentOS 7.2 . The cuda examples could run now. But my projects still have errors. I have recompiled my code but my kernel functions always return error code 8 which means the requested device function does not exist or is not compiled for the proper device architecture.
It’s weird.

Robert_Crovella · May 27, 2016, 1:01pm

sounds like you’re not compiling with PTX

cysin · May 27, 2016, 1:40pm

@txbob Does this matter? I have recompiled all my code. My code works all fine on old architecture. And one more question: is it for sure that 1080 can work with cuda 7.5 for now?

Robert_Crovella · May 27, 2016, 1:42pm

The only way a code compiled for CUDA 7.5 could work on a Pascal device is if:

You compile the code with embedded PTX
You have an appropriate driver (such as those that would be suitable for Pascal products) that can JIT-compile your code to run correctly on Pascal.

The above mechanism is how your sample code/projects are able to run successfully on Pascal.

Yes, it matters.

cysin · May 27, 2016, 1:48pm

@txbob Thanks for the prompt reply. Any idea about compiling with embedded PTX? I did a quick google search but seems unlucky. Should I wait for cuda 8 or continue to work it out?

Robert_Crovella · May 27, 2016, 1:56pm

It’s described in the nvcc manual:

http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architectures

Since you are running on linux, it may be as simple as:

nvcc -arch=sm_52 <your additional compile command line here>

cysin · May 27, 2016, 2:01pm

@txbob I am using camke and I am afraid I’ve already done this:

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -s")
set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} --default-stream per-thread")

LIST(APPEND CUDA_OPTS -gencode arch=compute_20,code=sm_20)
LIST(APPEND CUDA_OPTS -gencode arch=compute_20,code=sm_21)
LIST(APPEND CUDA_OPTS -gencode arch=compute_30,code=sm_30)
LIST(APPEND CUDA_OPTS -gencode arch=compute_35,code=sm_35)
LIST(APPEND CUDA_OPTS -gencode arch=compute_50,code=sm_50)
LIST(APPEND CUDA_OPTS -gencode arch=compute_52,code=sm_52)
LIST(APPEND CUDA_OPTS --ptxas-options=-v)

...

       CUDA_ADD_LIBRARY(sharedlib ${project_LIB_SRCS} OPTIONS ${CUDA_OPTS} SHARED)
       TARGET_LINK_LIBRARIES(sharedlib ${project_LIBS})

...

Am I doing something wrong here?

Robert_Crovella · May 27, 2016, 2:25pm

Yes.

None of these:

LIST(APPEND CUDA_OPTS -gencode arch=compute_20,code=sm_20)
LIST(APPEND CUDA_OPTS -gencode arch=compute_20,code=sm_21)
LIST(APPEND CUDA_OPTS -gencode arch=compute_30,code=sm_30)
LIST(APPEND CUDA_OPTS -gencode arch=compute_35,code=sm_35)
LIST(APPEND CUDA_OPTS -gencode arch=compute_50,code=sm_50)
LIST(APPEND CUDA_OPTS -gencode arch=compute_52,code=sm_52)

produce PTX in the final binary.

So add this:

LIST(APPEND CUDA_OPTS -gencode arch=compute_52,code=compute_52)

to the end of the above list

cysin · May 27, 2016, 3:28pm

@txbob Thanks a lot and my code works now. But it looks like that the code does not work as fast as it should be. It’s a multi-threaded server side program and the GPU util ratio is not stable, sometimes it’s 0 or less than 20. Is this normal? My program works much faster on 980 Ti.

Robert_Crovella · May 27, 2016, 3:44pm

I can’t really comment on a code I haven’t seen.

If your program runs faster on GTX 980 than GTX 1080 then I would do two things:

Wait for CUDA 8 RC to come out (should be soon), which I think will be able to compile directly for cc6.1 (arch=compute_61,code=sm_61) which is what a GTX 1080 is. Then re-test.
If your code produces the correct result but still runs more slowly after step 1, I would file a bug.

It would facilitate the process if you could produce a test version of your code that is not multi-threaded. This should be more-or-less orthogonal to CUDA performance anyway, so presumably you can demonstrate a single-threaded version that also displays the same disparity as the multi-threaded version.