Jugnu Life :-): Spark

Showing posts with label Spark. Show all posts

Using Spark with different version of Hive

Spark by default comes packaged with one version of Hive.

If we want to use something different then we need to tell about version and jars.

See example

Default version is 1.2.0
Our cluster version is 0.14.0

The example would be , see the properties in bold.

export SPARK_HOME=/opt/spark/spark-1.5.2-bin-hadoop2.6
cd /opt/spark/spark-1.5.2-bin-hadoop2.6
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_CONF_DIR=/etc/hive/conf
HIVE_LIB_DIR=/usr/phd/3.0.0.0-249/hive/lib
GUAVA_CLASSPATH=/usr/phd/3.0.0.0-249/hive/lib/guava-11.0.2.jar
hive_metastore_classpath=$HIVE_CONF_DIR:$HIVE_LIB_DIR/*:/usr/phd/3.0.0.0-249/hadoop/*:/usr/phd/3.0.0.0-249/hadoop-mapreduce/*:/usr/phd/3.0.0.0-249/hadoop-yarn/*:/usr/phd/3.0.0.0-249/hadoop-hdfs/*

SPARK_REPL_OPTS="-XX:MaxPermSize=512m" bin/spark-shell \
--master yarn-client \
--packages "com.databricks:spark-csv_2.10:1.2.0" \
--repositories "http:/… \
--files ${SPARK_HOME}/conf/hive-site.xml \
--conf spark.executor.memory=5g \
--conf spark.executor.cores=2 \
--conf spark.driver.memory=10g \
--conf spark.driver.maxResultSize=512m \
--conf spark.executor.instances=2 \
--conf "spark.driver.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -Xms2g -Xmx10g -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf "spark.yarn.am.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249" \
--conf "spark.executor.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf spark.sql.hive.metastore.version=0.14.0 \
--conf spark.sql.hive.metastore.jars=$hive_metastore_classpath \
--conf spark.driver.extraClassPath=$GUAVA_CLASSPATH

Spark FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. Invalid method name: 'alter_table_with_cascade'

Exception

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. Invalid method name: 'alter_table_with_cascade'

Solution

The error indicates the version of Hive metastore is different from jar packaged with Spark
Give the correct Hive jar

Example

export SPARK_HOME=/opt/spark/spark-1.5.2-bin-hadoop2.6
cd /opt/spark/spark-1.5.2-bin-hadoop2.6
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_CONF_DIR=/etc/hive/conf
HIVE_LIB_DIR=/usr/phd/3.0.0.0-249/hive/lib
GUAVA_CLASSPATH=/usr/phd/3.0.0.0-249/hive/lib/guava-11.0.2.jar
hive_metastore_classpath=$HIVE_CONF_DIR:$HIVE_LIB_DIR/*:/usr/phd/3.0.0.0-249/hadoop/*:/usr/phd/3.0.0.0-249/hadoop-mapreduce/*:/usr/phd/3.0.0.0-249/hadoop-yarn/*:/usr/phd/3.0.0.0-249/hadoop-hdfs/*

SPARK_REPL_OPTS="-XX:MaxPermSize=512m" bin/spark-shell \
--master yarn-client \
--packages "com.databricks:spark-csv_2.10:1.2.0" \
--repositories "http...." \
--files ${SPARK_HOME}/conf/hive-site.xml \
--conf spark.executor.memory=5g \
--conf spark.sql.hive.version=0.14.0 \
--conf spark.executor.cores=2 \
--conf spark.driver.memory=10g \
--conf spark.driver.maxResultSize=512m \
--conf spark.executor.instances=2 \
--conf "spark.driver.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -Xms2g -Xmx10g -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf "spark.yarn.am.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249" \
--conf "spark.executor.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf spark.sql.hive.metastore.version=0.14.0 \
--conf spark.sql.hive.metastore.jars=$hive_metastore_classpath \
--conf spark.driver.extraClassPath=$GUAVA_CLASSPATH

References

http://ben-tech.blogspot.com.au/2016/01/using-hivecontext-to-read-hive-tables.html

Spark java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/JobConf when creating Hive client using classpath:

Exception

java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/JobConf when creating Hive client using classpath:

Solution

Give mapreduce jars

I am on Pivotal Hadoop but others distribution should be same

export SPARK_HOME=/opt/spark/spark-1.5.2-bin-hadoop2.6
cd /opt/spark/spark-1.5.2-bin-hadoop2.6
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_CONF_DIR=/etc/hive/conf
HIVE_LIB_DIR=/usr/phd/3.0.0.0-249/hive/lib
GUAVA_CLASSPATH=/usr/phd/3.0.0.0-249/hive/lib/guava-11.0.2.jar
hive_metastore_classpath=$HIVE_CONF_DIR:$HIVE_LIB_DIR/*:/usr/phd/3.0.0.0-249/hadoop/*:/usr/phd/3.0.0.0-249/hadoop-mapreduce/*:/usr/phd/3.0.0.0-249/hadoop-yarn/*:/usr/phd/3.0.0.0-249/hadoop-hdfs/*

SPARK_REPL_OPTS="-XX:MaxPermSize=512m" bin/spark-shell \
--master yarn-client \
--packages "com.databricks:spark-csv_2.10:1.2.0" \
--repositories "http...." \
--files ${SPARK_HOME}/conf/hive-site.xml \
--conf spark.executor.memory=5g \
--conf spark.executor.cores=2 \
--conf spark.driver.memory=10g \
--conf spark.driver.maxResultSize=512m \
--conf spark.executor.instances=2 \
--conf "spark.driver.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -Xms2g -Xmx10g -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf "spark.yarn.am.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249" \
--conf "spark.executor.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf spark.sql.hive.metastore.version=0.14.0 \
--conf spark.sql.hive.metastore.jars=$hive_metastore_classpath \
--conf spark.driver.extraClassPath=$GUAVA_CLASSPATH

Spark java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: hdfs

Exception

java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:444)

Solution

Give Hadoop hdfs jars

Example

I am on Pivotal Hadoop but others distribution should be same

export SPARK_HOME=/opt/spark/spark-1.5.2-bin-hadoop2.6
cd /opt/spark/spark-1.5.2-bin-hadoop2.6
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_CONF_DIR=/etc/hive/conf
HIVE_LIB_DIR=/usr/phd/3.0.0.0-249/hive/lib
GUAVA_CLASSPATH=/usr/phd/3.0.0.0-249/hive/lib/guava-11.0.2.jar
hive_metastore_classpath=$HIVE_CONF_DIR:$HIVE_LIB_DIR/*:/usr/phd/3.0.0.0-249/hadoop/*:/usr/phd/3.0.0.0-249/hadoop-mapreduce/*:/usr/phd/3.0.0.0-249/hadoop-yarn/*:/usr/phd/3.0.0.0-249/hadoop-hdfs/*

SPARK_REPL_OPTS="-XX:MaxPermSize=512m" bin/spark-shell \
--master yarn-client \
--packages "com.databricks:spark-csv_2.10:1.2.0" \
--repositories "http...." \
--files ${SPARK_HOME}/conf/hive-site.xml \
--conf spark.executor.memory=5g \
--conf spark.executor.cores=2 \
--conf spark.driver.memory=10g \
--conf spark.driver.maxResultSize=512m \
--conf spark.executor.instances=2 \
--conf "spark.driver.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -Xms2g -Xmx10g -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf "spark.yarn.am.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249" \
--conf "spark.executor.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf spark.sql.hive.metastore.version=0.14.0 \
--conf spark.sql.hive.metastore.jars=$hive_metastore_classpath \
--conf spark.driver.extraClassPath=$GUAVA_CLASSPATH

Spark java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: com/google/common/base/Predicate

Spark context available as sc.
java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: com/google/common/base/Predicate when creating Hive client using classpath

Solution

Related bug https://issues.apache.org/jira/browse/SPARK-11702
Give the Guavus jar path

Example

I am on Pivotal Hadoop but others distribution should be same

export SPARK_HOME=/opt/spark/spark-1.5.2-bin-hadoop2.6
cd /opt/spark/spark-1.5.2-bin-hadoop2.6
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_CONF_DIR=/etc/hive/conf
HIVE_LIB_DIR=/usr/phd/3.0.0.0-249/hive/lib
GUAVA_CLASSPATH=/usr/phd/3.0.0.0-249/hive/lib/guava-11.0.2.jar
hive_metastore_classpath=$HIVE_CONF_DIR:$HIVE_LIB_DIR/*:/usr/phd/3.0.0.0-249/hadoop/*:/usr/phd/3.0.0.0-249/hadoop-mapreduce/*:/usr/phd/3.0.0.0-249/hadoop-yarn/*:/usr/phd/3.0.0.0-249/hadoop-hdfs/*

SPARK_REPL_OPTS="-XX:MaxPermSize=512m" bin/spark-shell \
--master yarn-client \
--packages "com.databricks:spark-csv_2.10:1.2.0" \
--repositories "http...." \
--files ${SPARK_HOME}/conf/hive-site.xml \
--conf spark.executor.memory=5g \
--conf spark.executor.cores=2 \
--conf spark.driver.memory=10g \
--conf spark.driver.maxResultSize=512m \
--conf spark.executor.instances=2 \
--conf "spark.driver.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -Xms2g -Xmx10g -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf "spark.yarn.am.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249" \
--conf "spark.executor.extraJavaOptions=-Dstack.name=phd -Dstack.version=3.0.0.0-249 -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ParallelGCThreads=5 -XX:ConcGCThreads=3" \
--conf spark.sql.hive.metastore.version=0.14.0 \
--conf spark.sql.hive.metastore.jars=$hive_metastore_classpath \
--conf spark.driver.extraClassPath=$GUAVA_CLASSPATH

Spark latest nightly master release and docs

Binaries are available at

http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/

You can see the latest docs at

http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/

Prepare Spark api docs for offline browsing

Get the source code file from Spark website

https://spark.apache.org/downloads.html

Use command as

mvn scala:doc

This will build scaladocs for all the projects.

Go to required project target folder and seen what is needed.

If you want nightly release code for latest version

Go to URL below to get the code and see docs

http://people.apache.org/~pwendell/spark-nightly/

https://repository.apache.org/content/repositories/snapshots/org/apache/spark/

Spark Pivot example

Spark 1.6 has Pivot functionality.

Let's try that out.

Create a simple file with following data

cat /tmp/sample.csv
language,year,earning
net,2012,10000
java,2012,20000
net,2012,5000
net,2013,48000
java,2013,30000

Start the Spark shell with Spark csv

bin/spark-shell --packages "com.databricks:spark-csv_2.10:1.2.0"

Load the sample file

scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("file:///tmp/sample.csv")
df: org.apache.spark.sql.DataFrame = [language: string, year: int, earning: int]

Run simple pivot

scala> df.groupBy("language").pivot("year","2012","2013").agg(sum("earning")).show
+--------+-----+-----+
|language| 2012| 2013|
+--------+-----+-----+
| java|20000|30000|
| net|15000|48000|
+--------+-----+-----+

Let's try that out.

Create a simple file with following data

cat /tmp/sample.csv
language,year,earning
net,2012,10000
java,2012,20000
net,2012,5000
net,2013,48000
java,2013,30000

Start the Spark shell with Spark csv

bin/spark-shell --packages "com.databricks:spark-csv_2.10:1.2.0"

Load the sample file

scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("file:///tmp/sample.csv")
df: org.apache.spark.sql.DataFrame = [language: string, year: int, earning: int]

Run simple pivot

scala> df.groupBy("language").pivot("year","2012","2013").agg(sum("earning")).show
+--------+-----+-----+
|language| 2012| 2013|
+--------+-----+-----+
| java|20000|30000|
| net|15000|48000|
+--------+-----+-----+

References

https://github.com/apache/spark/pull/7841

http://sqlhints.com/2014/03/10/pivot-and-unpivot-in-sql-server/

Production Implementation of Machine Learning models

This is representative of how we implemented and created end to end framework to push machine learning models in production.

Working in large organisations give a challenge to how actually run your code in production to create meaningful business value. Often Machine learning models / Analytics sits in source control (e.g git) for long long time before actually running and helping customers. This flow can be represented by figure below in which Data Scientists / Analysts make something ( e.g R model ) and now they do not know or have power to actually test it in a line of fire where customers actually are.

We made a machine learning pipeline using Spark , H2O Sparkling water which gives very nice modular api's to all work from data munging , training , scoring etc. Everything in single code base.

Spark has fundamentally changed Bigdata world. All the work which we were doing in different sets of tools in past have now unified into one single Swiss knife.

Our new machine learning pipeline looks like as shown below.

Data scientists / Analysts working on specific use case use Spark + Sparkling water to create machine learning models.
Commit there model in git
Code is build in Jenkins to create jar/rpm artifacts which are stored in Nexus
The deployment is automated via the Chef

The whole time to actually running model in production is now short circuit to as small as time taken to train new model + 5 mins.

Data scientists have full power to push any new model to production without going through huge bureaucracy and do A/B testing for the new ideas. All they have to do is just push new code / follow standard peer code review process.

A organisation who can quickly try out new things can fail quickly , learn quickly and innovate quickly. This is kind of culture we are trying to create. Data driven culture of experimentation.

If you liked reading this , you will also like reading my upcoming book Apache Oozie Essentials , its a use case driven Oozie implementation. This book is sprinkled with the examples and exercises to help you take your big data learning to the next level and you will also get to read my lovely memories in form of Bed Time stories from my awesome Bigdata implementation at Commonwealth Bank Australia.

https://www.packtpub.com/big-data-and-business-intelligence/apache-oozie-essentials

java.lang.NoClassDefFoundError: org/apache/spark/sql/types/AtomicType

15/09/19 12:17:10 INFO DAGScheduler: Job 0 finished: take at CsvRelation.scala:174, took 1.900011 s

Error: application failed with exception

java.lang.NoClassDefFoundError: org/apache/spark/sql/types/AtomicType

at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:155)

at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:70)

at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:138)

If using Spark-csv with Spark 1.3.0 , with spark-csv type inference this error is thrown

Initial job has not accepted any resources; check your cluster UI

15/09/19 10:28:44 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

This message in log means that Hadoop cluster does not have resources which are asked by Job.

I submitted Spark job on my test local cluster , and got this message.

Additional logs also show following ,

15/09/19 10:23:20 INFO yarn.YarnRMClient: Registering the ApplicationMaster  15/09/19 10:23:20 INFO yarn.YarnAllocator: Will request 2 executor containers, each with 1 cores and 1408 MB memory including 384 MB overhead  15/09/19 10:23:20 INFO yarn.YarnAllocator: Container request (host: Any, capability: <memory:1408, vCores:1>)  15/09/19 10:23:20 INFO yarn.YarnAllocator: Container request (host: Any, capability: <memory:1408, vCores:1>)  15/09/19 10:23:20 INFO yarn.ApplicationMaster: Started progress reporter thread with (heartbeat : 5000, initial allocation : 200) intervals




My cluster had only 3 GB RAM so Yarn cannot allocated what Spark was asking

Insert data into Hive from spark

Using the SchemaRDD / DataFrame API via HiveContext

Assume you're using the latest code, something probably like:

val hc = new HiveContext(sc)
import hc.implicits._
existedRdd.toDF().insertInto("hivetable")
or

existedRdd.toDF().registerTempTable("mydata")
hc.sql("insert into hivetable as select xxx from mydata")

Initial job has not accepted any resources; check your cluster UI

522640 [Timer-0] WARN org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

Check that Spark worker has been started and it is visible at Master page

http://10.0.0.11:8080/

If not then start the worker

Example

./bin/spark-class org.apache.spark.deploy.worker.Worker spark://precise32:7077

How Aapche Spark works ( short summary )

Why spark was made

Before Spark was in the market map reduce was used a processing brain on top of Hadoop.

Apache Spark original research paper

The typical flow for map reduce is

read data from disk.
Apply some processing
Dump intermediary data on disk
Do some more processing
Show final results

Now if some processing has to be done incrementally by changing some variable over across all data set
The mapreduce will again start from reading from disk and if you the processing 100 times , it will do it 100 times * 2 ( for intermediary processing also )

The Spark solves the following typical use cases where same processing is applied to datasets with varying variable inputs.

Two typical usecases where Spark shines are:

Iterative jobs
Many common machine learning algorithms apply a function repeatedly to the same dataset
to optimize a parameter (e.g., through gradient descent).While each iteration can be expressed as a
MapReduce/Dryad job, each job must reload the data from disk, incurring a significant performance penalty.

Interactive analysis
Hadoop is often used to perform ad-hoc exploratory queries on big datasets, through
SQL interfaces such as Pig and Hive. Ideally, a user would be able to load a dataset of interest into
memory across a number of machines and query it repeatedly.

To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs).
An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.

Spark provides two main abstractions for parallel programming:

resilient distributed datasets and
parallel operations on these datasets (invoked by passing a function to apply on a dataset).
Which are based on typical functional programming concepts of map , flatmap , filters etc

In addition, Spark supports two restricted types of shared variables that can be used in functions running
on the cluster

Variables in Spark

Broadcast variables: Read Only variable
Accumulators: These are variables that workers can only “add” to using an associative operation

Example Spark code

val file = spark.textFile("hdfs://...")

val errs = file.filter(_.contains("ERROR"))

val ones = errs.map(_ => 1)

val count = ones.reduce(_+_)

Hive Shark Impala Comparison

Credits
This post is nothing but reproduce of work done here at AmpLabs. If you want latest and detailed read i suggest you to go there.
Bigdata world is so beautiful , the research in this field is driving at such a fast pace that eventually BigData is no more synonymous with long running queries , its becoming LiveData everywhere.
The work compares the computation time of Redshift , Hive , Impala , Shark with different types of queries

The performance of Shark in memory has been consistent in all 4 types. It would be interesting to see the comparison when Hive 0.11 is used as it also adds few performance improvements driven by work at Hortonworks.

Spark Standalone mode installation steps

Based on

http://spark-project.org/docs/latest/spark-standalone.html

Download Spark from

http://spark-project.org/downloads/

I used the prebuilt version of Spark for this post. For building from source please see instructions on website

Download

http://spark-project.org/download/spark-0.7.3-prebuilt-hadoop1.tgz

Extract it to some location say

/home/jagat/Downloads/spark-0.7.3

For running this you need both Scala and Java

I have downloaded them and configured following things

In /etc/environment file

Add

SCALA_HOME="/home/jagat/Downloads/scala-2.9.3"
PATH=$PATH:$SCALA_HOME/bin

JAVA_HOME="/home/jagat/Downloads/jdk1.7.0_25"
PATH=$PATH:$JAVA_HOME/bin

You might change the above paths depending on your system

Go to

/home/jagat/Downloads/spark-0.7.3/conf

Rename the spark-env.sh.template file to spart-env.sh

Add the following values

export SCALA_HOME="/home/jagat/Downloads/scala-2.9.3"
export PATH=$PATH:$SCALA_HOME/bin

export JAVA_HOME="/home/jagat/Downloads/jdk1.7.0_25"
export PATH=$PATH:$JAVA_HOME/bin

Now check your hosts file that your system DNS is resolving correct. Specially if you are on Ubuntu like me

Go to /etc/hosts

Change the following

jagat@Dell9400:~$ cat /etc/hosts
127.0.0.1    localhost
192.168.0.104    Dell9400

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

for Ubuntu the loopback address for your host is 127.0.1.1 , change it to your exact IP

Now we are ready to go

Go to

/home/jagat/Downloads/spark-0.7.3/

Start Spark Master

./run spark.deploy.master.Master

Check the URL where master started

http://localhost:8080/

It will give info that master is started with

URL: spark://Dell9400:7077

This is the URL which we need in all our applications

Lets start one Worker by telling it about master

./run spark.deploy.worker.Worker spark://Dell9400:7077

This register the worker with the master.

Now refresh the master page

http://localhost:8080/

You can see that a worker is added on the page

Now

Connecting a Job to the Cluster

To run a job on the Spark cluster, simply pass the spark://IP:PORT URL of the master as to the SparkContext constructor.

To run an interactive Spark shell against the cluster, run the following command:

MASTER=spark://IP:PORT ./spark-shell

Thats it

I admit that was very raw steps , but i kept it simple and quick for first time users

Happy Sparking :)

ERROR cluster.ClusterScheduler: Exiting due to error from cluster scheduler: Disconnected from Spark cluster

Full error which i got

Spark context available as sc.
13/08/04 20:46:04 ERROR client.Client$ClientActor: Connection to master failed; stopping client
13/08/04 20:46:04 ERROR cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster!
13/08/04 20:46:04 ERROR cluster.ClusterScheduler: Exiting due to error from cluster scheduler: Disconnected from Spark cluster
jagat@Dell9400:~/Downloads/spark-0.7.3$

Possible Reasons

1)

I had multiple scala versions in system

Check and make sure you have only one version

2)

Check /etc/hosts file

The DNS should resolve properly