Jugnu Life :-): January 2016

Using Spark with different version of Hive

Spark by default comes packaged with one version of Hive.

If we want to use something different then we need to tell about version and jars.

See example

Default version is 1.2.0
Our cluster version is 0.14.0

The example would be , see the properties in bold.

export SPARK_HOME=/opt/spark/spark-1.5.2-bin-hadoop2.6
cd /opt/spark/spark-1.5.2-bin-hadoop2.6
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_CONF_DIR=/etc/hive/conf
HIVE_LIB_DIR=/usr/phd/3.0.0.0-249/hive/lib
GUAVA_CLASSPATH=/usr/phd/3.0.0.0-249/hive/lib/guava-11.0.2.jar
hive_metastore_classpath=$HIVE_CONF_DIR:$HIVE_LIB_DIR/*:/usr/phd/3.0.0.0-249/hadoop/*:/usr/phd/3.0.0.0-249/hadoop-mapreduce/*:/usr/phd/3.0.0.0-249/hadoop-yarn/*:/usr/phd/3.0.0.0-249/hadoop-hdfs/*

SPARK_REPL_OPTS="-XX:MaxPermSize=512m" bin/spark-shell \
--master yarn-client \
--packages "com.databricks:spark-csv_2.10:1.2.0" \
--repositories "http:/… \
--files ${SPARK_HOME}/conf/hive-site.xml \
--conf spark.executor.memory=5g \
--conf spark.executor.cores=2 \
--conf spark.driver.memory=10g \
--conf spark.driver.maxResultSize=512m \
--conf spark.executor.instances=2 \
--conf "spark.driver.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -Xms2g -Xmx10g -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf "spark.yarn.am.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249" \
--conf "spark.executor.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf spark.sql.hive.metastore.version=0.14.0 \
--conf spark.sql.hive.metastore.jars=$hive_metastore_classpath \
--conf spark.driver.extraClassPath=$GUAVA_CLASSPATH

Spark FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. Invalid method name: 'alter_table_with_cascade'

Exception

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. Invalid method name: 'alter_table_with_cascade'

Solution

The error indicates the version of Hive metastore is different from jar packaged with Spark
Give the correct Hive jar

Example

export SPARK_HOME=/opt/spark/spark-1.5.2-bin-hadoop2.6
cd /opt/spark/spark-1.5.2-bin-hadoop2.6
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_CONF_DIR=/etc/hive/conf
HIVE_LIB_DIR=/usr/phd/3.0.0.0-249/hive/lib
GUAVA_CLASSPATH=/usr/phd/3.0.0.0-249/hive/lib/guava-11.0.2.jar
hive_metastore_classpath=$HIVE_CONF_DIR:$HIVE_LIB_DIR/*:/usr/phd/3.0.0.0-249/hadoop/*:/usr/phd/3.0.0.0-249/hadoop-mapreduce/*:/usr/phd/3.0.0.0-249/hadoop-yarn/*:/usr/phd/3.0.0.0-249/hadoop-hdfs/*

SPARK_REPL_OPTS="-XX:MaxPermSize=512m" bin/spark-shell \
--master yarn-client \
--packages "com.databricks:spark-csv_2.10:1.2.0" \
--repositories "http...." \
--files ${SPARK_HOME}/conf/hive-site.xml \
--conf spark.executor.memory=5g \
--conf spark.sql.hive.version=0.14.0 \
--conf spark.executor.cores=2 \
--conf spark.driver.memory=10g \
--conf spark.driver.maxResultSize=512m \
--conf spark.executor.instances=2 \
--conf "spark.driver.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -Xms2g -Xmx10g -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf "spark.yarn.am.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249" \
--conf "spark.executor.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf spark.sql.hive.metastore.version=0.14.0 \
--conf spark.sql.hive.metastore.jars=$hive_metastore_classpath \
--conf spark.driver.extraClassPath=$GUAVA_CLASSPATH

References

http://ben-tech.blogspot.com.au/2016/01/using-hivecontext-to-read-hive-tables.html

Spark java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/JobConf when creating Hive client using classpath:

Exception

java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/JobConf when creating Hive client using classpath:

Solution

Give mapreduce jars

I am on Pivotal Hadoop but others distribution should be same

export SPARK_HOME=/opt/spark/spark-1.5.2-bin-hadoop2.6
cd /opt/spark/spark-1.5.2-bin-hadoop2.6
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_CONF_DIR=/etc/hive/conf
HIVE_LIB_DIR=/usr/phd/3.0.0.0-249/hive/lib
GUAVA_CLASSPATH=/usr/phd/3.0.0.0-249/hive/lib/guava-11.0.2.jar
hive_metastore_classpath=$HIVE_CONF_DIR:$HIVE_LIB_DIR/*:/usr/phd/3.0.0.0-249/hadoop/*:/usr/phd/3.0.0.0-249/hadoop-mapreduce/*:/usr/phd/3.0.0.0-249/hadoop-yarn/*:/usr/phd/3.0.0.0-249/hadoop-hdfs/*

SPARK_REPL_OPTS="-XX:MaxPermSize=512m" bin/spark-shell \
--master yarn-client \
--packages "com.databricks:spark-csv_2.10:1.2.0" \
--repositories "http...." \
--files ${SPARK_HOME}/conf/hive-site.xml \
--conf spark.executor.memory=5g \
--conf spark.executor.cores=2 \
--conf spark.driver.memory=10g \
--conf spark.driver.maxResultSize=512m \
--conf spark.executor.instances=2 \
--conf "spark.driver.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -Xms2g -Xmx10g -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf "spark.yarn.am.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249" \
--conf "spark.executor.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf spark.sql.hive.metastore.version=0.14.0 \
--conf spark.sql.hive.metastore.jars=$hive_metastore_classpath \
--conf spark.driver.extraClassPath=$GUAVA_CLASSPATH

Spark java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: hdfs

Exception

java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:444)

Solution

Give Hadoop hdfs jars

Example

I am on Pivotal Hadoop but others distribution should be same

export SPARK_HOME=/opt/spark/spark-1.5.2-bin-hadoop2.6
cd /opt/spark/spark-1.5.2-bin-hadoop2.6
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_CONF_DIR=/etc/hive/conf
HIVE_LIB_DIR=/usr/phd/3.0.0.0-249/hive/lib
GUAVA_CLASSPATH=/usr/phd/3.0.0.0-249/hive/lib/guava-11.0.2.jar
hive_metastore_classpath=$HIVE_CONF_DIR:$HIVE_LIB_DIR/*:/usr/phd/3.0.0.0-249/hadoop/*:/usr/phd/3.0.0.0-249/hadoop-mapreduce/*:/usr/phd/3.0.0.0-249/hadoop-yarn/*:/usr/phd/3.0.0.0-249/hadoop-hdfs/*

SPARK_REPL_OPTS="-XX:MaxPermSize=512m" bin/spark-shell \
--master yarn-client \
--packages "com.databricks:spark-csv_2.10:1.2.0" \
--repositories "http...." \
--files ${SPARK_HOME}/conf/hive-site.xml \
--conf spark.executor.memory=5g \
--conf spark.executor.cores=2 \
--conf spark.driver.memory=10g \
--conf spark.driver.maxResultSize=512m \
--conf spark.executor.instances=2 \
--conf "spark.driver.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -Xms2g -Xmx10g -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf "spark.yarn.am.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249" \
--conf "spark.executor.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf spark.sql.hive.metastore.version=0.14.0 \
--conf spark.sql.hive.metastore.jars=$hive_metastore_classpath \
--conf spark.driver.extraClassPath=$GUAVA_CLASSPATH

Spark java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: com/google/common/base/Predicate

Spark context available as sc.
java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: com/google/common/base/Predicate when creating Hive client using classpath

Solution

Related bug https://issues.apache.org/jira/browse/SPARK-11702
Give the Guavus jar path

Example

I am on Pivotal Hadoop but others distribution should be same

export SPARK_HOME=/opt/spark/spark-1.5.2-bin-hadoop2.6
cd /opt/spark/spark-1.5.2-bin-hadoop2.6
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_CONF_DIR=/etc/hive/conf
HIVE_LIB_DIR=/usr/phd/3.0.0.0-249/hive/lib
GUAVA_CLASSPATH=/usr/phd/3.0.0.0-249/hive/lib/guava-11.0.2.jar
hive_metastore_classpath=$HIVE_CONF_DIR:$HIVE_LIB_DIR/*:/usr/phd/3.0.0.0-249/hadoop/*:/usr/phd/3.0.0.0-249/hadoop-mapreduce/*:/usr/phd/3.0.0.0-249/hadoop-yarn/*:/usr/phd/3.0.0.0-249/hadoop-hdfs/*

SPARK_REPL_OPTS="-XX:MaxPermSize=512m" bin/spark-shell \
--master yarn-client \
--packages "com.databricks:spark-csv_2.10:1.2.0" \
--repositories "http...." \
--files ${SPARK_HOME}/conf/hive-site.xml \
--conf spark.executor.memory=5g \
--conf spark.executor.cores=2 \
--conf spark.driver.memory=10g \
--conf spark.driver.maxResultSize=512m \
--conf spark.executor.instances=2 \
--conf "spark.driver.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -Xms2g -Xmx10g -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf "spark.yarn.am.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249" \
--conf "spark.executor.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf spark.sql.hive.metastore.version=0.14.0 \
--conf spark.sql.hive.metastore.jars=$hive_metastore_classpath \
--conf spark.driver.extraClassPath=$GUAVA_CLASSPATH