jam1, thanks for the results.
Is that a 768MB GTX460 or a 1024MB variant?
Ceearem
jam1, thanks for the results.
Is that a 768MB GTX460 or a 1024MB variant?
Ceearem
The code didn’t compile for me at first (32-bit Linux PC with Fedora 11, GCC 4.4.1-2 and NVCC 3.2).
I had to modify the following things to some source files, all in src/USER-CUDA:
[*]Add #include “string.h” to atom_vec_atomic_cuda.cpp and atom_vec_charge_cuda.cpp
[*]Add #include “stdlib.h” to compute_temp_partial_cuda.cpp and verlet_cuda.cpp
[*]Add #include #include “stdlib.h” and #include “string.h” to cuda.cpp
With these modifications, it compiles, but now I get a unspecified launch failure:
# Using device 0: GeForce GTX 460
Cuda error: Cuda_NeighborBuild: neighbor build kernel execution failed in file 'neighbor.cu' in line 255 : unspecified launch failure.
Any idea how to fix this?
Whatever the problem is, that isn’t the root cause:
avidday@cuda:~$ bash --version
GNU bash, version 3.2.48(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2007 Free Software Foundation, Inc.
avidday@cuda:~$ ./comparison.sh
var1 = string1
var2 = string1
1: var1 and var2 match
avidday@cuda:~$ cat comparison.sh
#!/bin/bash
var1="string1"
var2="string1"
echo "var1 = " $var1
echo "var2 = " $var2
if [[ $var1 == $var2 ]]; then
echo "1: var1 and var2 match"
fi
Hi avidday, hi Gert-Jan
thanks for testing that. I think I will use one of our machines at the University to set up a multi OS machine with Ubunutu, Fedora and so on (though it will be the versions which are compatible to CUDA accodring to the webpage).
I currently have no idea what the problem could be. And I dont want to steal your time. But Its already a help for me to know that there is a problem.
The segmentation fault could be a bug in the code (can you send me the whole output of the run). At this particular point I had segmentation fauls (unspecified launch failures) with earlier driver versions, which disappeared with the latest ones. Gert can you post your driver version here?
Thanks
Ceearem
No problem for the time, glad to help you out External Image
The driver version I use is 260.19.12. I don’t think I can update it quickly, since it is a multi-user machine I’m testing with and I don’t want to break other people’s code and results.
I have that version on some of my machines as well (GTX470 and GTX295) and dont observe your problem, so it probably is not the driver version.
Ceearem
This is not the 460
Device 0: “GeForce GTX 260”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 1.3
Total amount of global memory: 939327488 bytes
Multiprocessors x Cores/MP = Cores: 27 (MP) x 8 (Cores/MP) = 216 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.24 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device is using TCC driver mode: No
BTW, I had to add a few stdlib.h and string.h at a few places for a good compile. I used the run_test instead of the single run.
Probably the same includes as I had to add.
If it isn’t the driver version, what can it be? Maybe the device capability (GTX 460 is 2.1, Makefile only supports 1.3 and 2.0).
These are the results stored in log.lammps and out.melt before the program stops:
log.lammps:
LAMMPS CUDA (4 Dec 2010)
# 3d Lennard-Jones melt
accelerator cuda gpu/node 4
units lj
atom_style atomic
newton off
lattice fcc 0.8442
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
region box block 0 40 0 40 0 40
create_box 1 box
Created orthogonal box = (0 0 0) to (67.1838 67.1838 67.1838)
1 by 1 by 1 processor grid
create_atoms 1 box
Created 256000 atoms
mass 1 1.0
velocity all create 3.0 87287
##PAIRSTYLES 4
pair_style lj/cut 2.5
#pair_style lj96/cut 2.5
#pair_style lj/gromacs 2.5 5.0
#pair_style lj/smooth 2.5 5.0
pair_coeff 1 1 1.0 1.0
neighbor 0.3 bin
neigh_modify every 20 delay 0 check no one 10000
##FIXES 2
fix 1 all nve
#fix 1 all nvt temp 3.0 3.0 200 tchain 2
#dump 1 all custom 1 dump.melt.cuda id type x y z vx vy vz fx fy fz
thermo 200
run 2000
out.melt
LAMMPS CUDA (4 Dec 2010)
# Using LAMMPS_CUDA
# CUDA WARNING: Compile Settings of cuda and cpp code differ!
# CUDA WARNING: Global Precision: cuda 0 cpp 1
# CUDA WARNING: Compile Settings of cuda and cpp code differ!
# CUDA WARNING: arch: cuda 1 cpp 20
# CUDA: Activate GPU
AddStream
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (67.1838 67.1838 67.1838)
1 by 1 by 1 processor grid
Created 256000 atoms
# CUDA: VerletCuda::setup: Allocate memory on device for maximum of 260000 atoms...
# CUDA: Using precision: Global: 4 X: 4 V: 4 F: 4 PPPM: 4
Setting up run ...
# CUDA: VerletCuda::setup: Upload data...
The only interesting thing here are the warnings about the CUDA architecture settings, can this be it? As far as I know I only used arch=20 where needed in step (iii) and (iv).
Ah OK I really should put an exit after that warning. Something went wrong during compilation. Some or all of the Host Object files (.cpp) were compiled with different compiler settings than the Kernel files. You can try deleting all the Object files (.o in src/USER-CUDA and * in src/Obj_serial) and recompile. My guess is that the Kernel files have been recompiled with double precision (precision=2) and the host object files have not been deleted before compiling the executable (so they are still precision=1).
So now I really need to find all those missing string.h and stdlib.h statements (which is kinda difficult because my compiler is not complaining P-) ).
Ceearem
Those are the files requiring additions:
In src directory
atom_vec_atomic_cuda.cpp
atom_vec_charge_cuda.cpp
compute_temp_partial_cuda.cpp
cuda.cpp
verlet_cuda.cpp
With either or both
include // was string.h
include // was stdlib.h both works but less warnings with cstd…
I tried again with your run_test script, but even after adding arch=20 in that script it didn’t work, errors are similar. Maybe the options (precision, arch) aren’t handled properly by the Makefile in some way.
See also my post above.
Given this is C++, don’t you mean
#include<cstring>
#include<cstdlib>
string.h and stdlib.h should at least generate warnings in gcc 4, I would have thought. Might even fail to compile in a very recent version.
Yeah you are right. Basically the Standard LAMMPS code includes those “string.h” etc, so I sticked with it. But probably all those includes should be changed.
Ceearem
ok, I’m out. Given all the postings above, I guess i should chime back in once it runs… Please re-post and I am happy to fire it away on ubuntu10.04 LTS
I have now set up an Ubuntu machine (using the Desktop 10.04 LTS download from the webpage) updated the driver (260.19.12), installed the cuda-toolkit (cuda-2.3), installed build-essentials and finally kate as editor (this also installs QT and a lot of over stuff).
Then I downloaded the lammps package, run the “test_run”, fixed all the missing includes and run it again. This produced the desired output.
I have updated the repository accordingly, deleted my local copy and downloaded the package fresh.
Going to gpulammps-read-only/src/USER-CUDA/Examples/PHD-Thesis-Tests and running “./run_test” worked.
So from my point of view it should work now on Ubuntu (10.04) systems as well (and hopefully on other debian based systems).
I also added an ./run_test_20 and ./run_test_21 which will set the architecture flag for nvcc to sm_20 and sm_21 (the normal test_run is using sm_13). Though I dont see a difference on GF100 GPUs (GTX470,GTX480,C2050).
Regards
Ceearem
I tried (run_test_21 on GTX460) but it still does not work for me. Errors at the end of running the script are:
# Using device 0: GeForce GTX 460
Cuda error: Cuda_NeighborBuild: neighbor build kernel execution failed in file 'neighbor.cu' in line 255 : unspecified launch failure.
# Using device 0: GeForce GTX 460
Cuda error: Cuda_Binning: binning Kernel execution failed in file 'neighbor.cu' in line 122 : unspecified launch failure.
# Using device 0: GeForce GTX 460
Cuda error: Cuda_NeighborBuild: neighbor build kernel execution failed in file 'neighbor.cu' in line 255 : unspecified launch failure.
# Using device 0: GeForce GTX 460
Cuda error: Cuda_Binning: binning Kernel execution failed in file 'neighbor.cu' in line 122 : unspecified launch failure.
Warnings in out.melt-21-s are still # CUDA WARNING: Compile Settings of cuda and cpp code differ!
# CUDA WARNING: Global Precision: cuda 0 cpp 1
# CUDA WARNING: Compile Settings of cuda and cpp code differ!
# CUDA WARNING: arch: cuda 1 cpp 20
Hm Gert-Jan
maybe its because you use a 32 bit system (I never ever tested a 32 bit one). I am checking now if there are any fixed 64 bit statements in the Makefiles.
Christian
Ah yes there are several 64 bit statements. One you find in src/MAKE/Makefile.serial at line 57, three in src/USER-CUDA/Install.sh at lines 283, 286 and 360 and one after Installation of the USER-CUDA package in the src/Makefile.package at line 6. If you want to you could try deleting the 64 in those files and compile again.
Christian
Some weird patch errors during the run_test20|21 script (on a clean checkout as of 20 minutes ago with manually edited path to CUDA):
Hunk #1 FAILED at 36.
Hunk #2 FAILED at 52.
2 out of 2 hunks FAILED -- saving rejects to file ../atom_vec_angle.h.rej
patching file ../atom_vec_atomic.h
patching file ../atom_vec_charge.h
cp: cannot stat `../atom_vec_full.h': No such file or directory
patching file ../atom_vec_full.h
Hunk #1 FAILED at 36.
Hunk #2 FAILED at 52.
2 out of 2 hunks FAILED -- saving rejects to file ../atom_vec_full.h.rej
Anyway, ran fine (I guess) afterwards, results are attached. I’m also attaching the output of deviceQuery which should have all the details you need. This is on 64bit Ubuntu 10.04 LTS, gcc is the system gcc 4.4.3.
Hope this helps.
dom
Thanks for your results. I really appreciate your help. Its interesting to see that the factor between a GTX470 and a GTS450 is closer to the ratio of the number of multi processors, rather than to the number of cores. But its hard to make a judgement if its really because the code cant make good use of the superscalar architecture, or whether its because of other texture, bandwidth or whatever characteristics.
Ceearem