While running the spark-xgboost example we are getting below issue while training xgboost model.
2/11/28 05:17:44 ERROR Executor: Exception in task 3.0 in stage 54.0 (TID 3401)
java.lang.NoClassDefFoundError: Could not initialize class ml.dmlc.xgboost4j.java.XGBoostJNI
at ml.dmlc.xgboost4j.java.Rabit.shutdown(Rabit.java:83)
at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.buildDistributedBooster(GpuXGBoost.scala:327)
at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.$anonfun$trainOnGpuInternal$1(GpuXGBoost.scala:254)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1508)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Here is the setting we did while creating the spark session:
Here is the logs generated after creating spark session:
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
com.nvidia#rapids-4-spark_2.12 added as a dependency
com.nvidia#rapids-4-spark-ml_2.12 added as a dependency
ai.rapids#cudf added as a dependency
com.nvidia#xgboost4j_3.0 added as a dependency
com.nvidia#xgboost4j-spark_3.0 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-187d07ca-4ebb-4af7-9ece-3c7e2f1d8c04;1.0
confs: [default]
found com.nvidia#rapids-4-spark_2.12;22.04.0 in central
found ai.rapids#cudf;22.04.0 in central
found org.slf4j#slf4j-api;1.7.30 in local-m2-cache
found com.nvidia#rapids-4-spark-ml_2.12;22.02.0 in central
found com.nvidia#xgboost4j_3.0;1.4.2-0.3.0 in central
found com.typesafe.akka#akka-actor_2.12;2.5.23 in central
found com.typesafe#config;1.3.3 in central
found org.scala-lang.modules#scala-java8-compat_2.12;0.8.0 in central
found com.fasterxml.jackson.core#jackson-databind;2.10.3 in central
found com.fasterxml.jackson.core#jackson-annotations;2.10.3 in central
found com.fasterxml.jackson.core#jackson-core;2.10.3 in central
found org.scalatest#scalatest_2.12;3.0.5 in central
found org.scalactic#scalactic_2.12;3.0.5 in central
found org.scala-lang.modules#scala-xml_2.12;1.0.6 in local-m2-cache
found com.esotericsoftware#kryo;4.0.2 in central
found com.esotericsoftware#reflectasm;1.11.3 in central
found org.ow2.asm#asm;5.0.4 in central
found com.esotericsoftware#minlog;1.3.0 in local-m2-cache
found org.objenesis#objenesis;2.5.1 in local-m2-cache
found org.scala-lang#scala-reflect;2.12.8 in local-m2-cache
found commons-logging#commons-logging;1.2 in local-m2-cache
found com.nvidia#xgboost4j-spark_3.0;1.4.2-0.3.0 in central
:: resolution report :: resolve 500ms :: artifacts dl 20ms
:: modules in use:
ai.rapids#cudf;22.04.0 from central in [default]
com.esotericsoftware#kryo;4.0.2 from central in [default]
com.esotericsoftware#minlog;1.3.0 from local-m2-cache in [default]
com.esotericsoftware#reflectasm;1.11.3 from central in [default]
com.fasterxml.jackson.core#jackson-annotations;2.10.3 from central in [default]
com.fasterxml.jackson.core#jackson-core;2.10.3 from central in [default]
com.fasterxml.jackson.core#jackson-databind;2.10.3 from central in [default]
com.nvidia#rapids-4-spark-ml_2.12;22.02.0 from central in [default]
com.nvidia#rapids-4-spark_2.12;22.04.0 from central in [default]
com.nvidia#xgboost4j-spark_3.0;1.4.2-0.3.0 from central in [default]
com.nvidia#xgboost4j_3.0;1.4.2-0.3.0 from central in [default]
com.typesafe#config;1.3.3 from central in [default]
com.typesafe.akka#akka-actor_2.12;2.5.23 from central in [default]
commons-logging#commons-logging;1.2 from local-m2-cache in [default]
org.objenesis#objenesis;2.5.1 from local-m2-cache in [default]
org.ow2.asm#asm;5.0.4 from central in [default]
org.scala-lang#scala-reflect;2.12.8 from local-m2-cache in [default]
org.scala-lang.modules#scala-java8-compat_2.12;0.8.0 from central in [default]
org.scala-lang.modules#scala-xml_2.12;1.0.6 from local-m2-cache in [default]
org.scalactic#scalactic_2.12;3.0.5 from central in [default]
org.scalatest#scalatest_2.12;3.0.5 from central in [default]
org.slf4j#slf4j-api;1.7.30 from local-m2-cache in [default]
:: evicted modules:
com.nvidia#rapids-4-spark_2.12;22.02.0 by [com.nvidia#rapids-4-spark_2.12;22.04.0] in [default]
ai.rapids#cudf;22.02.0 by [ai.rapids#cudf;22.04.0] in [default]
org.scala-lang#scala-reflect;2.12.4 by [org.scala-lang#scala-reflect;2.12.8] in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 25 | 0 | 0 | 3 || 23 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-187d07ca-4ebb-4af7-9ece-3c7e2f1d8c04
confs: [default]
0 artifacts copied, 23 already retrieved (0kB/13ms)
22/12/08 18:24:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/08 18:24:31 WARN ResourceProfile: The executor resource config for resource: gpu was specified but no corresponding task resource request was specified.
22/12/08 18:24:33 WARN RapidsPluginUtils: RAPIDS Accelerator 22.04.0 using cudf 22.04.0.
22/12/08 18:24:33 WARN RapidsPluginUtils: RAPIDS Accelerator is enabled, to disable GPU support set `spark.rapids.sql.enabled` to false.
22/12/08 18:24:45 WARN Plugin: Installing rapids UDF compiler extensions to Spark. The compiler is disabled by default. To enable it, set `spark.rapids.sql.udfCompiler.enabled` to true
I will be sending the logs generated while training the xgboost model shortly in text file
With these configurations, we are able to run our training process, but our prediction step is taking a lot of time to execute. Also, the code is not using all the GPUs. What settings or configurations would you suggest so that the code uses all 4 GPUs that we have?