Building mesos from source

Install the required software for building

Download JDK from Oracle website and set following env variables

export JAVA_HOME=/home/jagat/development/tools/jdk1.6.0_45
export PATH=$PATH:$JAVA_HOME/bin

Install software for build stuff

Copy paste this

sudo apt-get install -y cmake git-core git-svn subversion checkinstall build-essential dh-make debhelper ant ant-optional autoconf automake liblzo2-dev libzip-dev sharutils libfuse-dev reprepro libtool libssl-dev asciidoc xmlto ssh curl

sudo apt-get install -y devscripts

sudo apt-get build-dep pkg-config dh-autoreconf python-dev libcurl4-openssl-dev libboost-all-dev libunwind8-dev

Clone the code

git clone mesos

Go to directory where code is copied

cd mesos

All of the commands below should finish without errors.





Some of the errors i got and solved

checking consistency of all components of python development environment... no
configure: error: in `/home/jagat/development/code/berkley/mesos':
configure: error:
  Could not link test program to Python. Maybe the main Python library has been
  installed in some non-standard library path. If so, pass it to configure,
  via the LDFLAGS environment variable.
  Example: ./configure LDFLAGS="-L/usr/non-standard-path/python/lib"
   You probably have to install the development version of the Python package
   for your distribution.  The exact name of this package varies among them.

See `config.log' for more details
jagat@nanak-P570WM:~/development/code/berkley/mesos$ apt-cache search python27

Install python-dev

sudo apt-get install python-dev

checking for curl_global_init in -lcurl... no
configure: error: cannot find libcurl
  You can avoid this with --without-curl, but it will mean executor
  and task resources cannot be downloaded over http.

sudo apt-get instal libcurl4-openssl-dev

checking whether -pthread is sufficient with -shared... yes
checking for backtrace in -lunwind... no
configure: error: failed to determine linker flags for using Java (bad JAVA_HOME or missing support for your architecture?)

Download JDK from

set env variable as

export JAVA_HOME=/home/jagat/development/tools/jdk1.6.0_45

export PATH=$PATH:$JAVA_HOME/bin


And in end it fails wiht this message

cc1plus: all warnings being treated as errors
make[2]: *** [sched/libmesos_no_3rdparty_
la-sched.lo] Error 1
make[2]: Leaving directory `/home/jagat/development/tools/mesos-0.13.0/build/src'
make[1]: *** [check] Error 2
make[1]: Leaving directory `/home/jagat/development/tools/mesos-0.13.0/build/src'
make: *** [check-recursive] Error 

Read this

or shortcut


MESOS_CPPFLAGS += -Wall -Werror

With below line

MESOS_CPPFLAGS += -Wall -Werror -Wno-unused-local-typedefs. 

Hadoop get filesize and directory from ls command

hadoop fs -ls /my/folder  | awk '{print $8}' > only_directory.txt

You can just change the numbers above for getting different information


hadoop fs -ls /my/folder  | awk '{print $6, $8}' > only_date_directory.txt

How Aapche Spark works ( short summary )

Why spark was made

Before Spark was in the market  map reduce was used a processing brain on top of Hadoop.

Apache Spark original research paper

The typical flow for map reduce is

  • read data from disk.
  • Apply some processing
  • Dump intermediary data on disk
  • Do some more processing
  • Show final results

Now if some processing has to be done incrementally by changing some variable over across all data set
The mapreduce will again start from reading from disk and if you the processing 100 times , it will do it 100 times * 2 ( for intermediary processing also )

The Spark solves the following typical use cases where same processing is applied to datasets with varying variable inputs.

Two typical usecases where Spark shines are:

Iterative jobs
Many common machine learning algorithms apply a function repeatedly to the same dataset
to optimize a parameter (e.g., through gradient descent).While each iteration can be expressed as a
MapReduce/Dryad job, each job must reload the data from disk, incurring a significant performance penalty.

Interactive analysis
Hadoop is often used to perform ad-hoc exploratory queries on big datasets, through
SQL interfaces such as Pig and Hive.  Ideally, a user would be able to load a dataset of interest into
memory across a number of machines and query it repeatedly.

To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs).
An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.

Spark provides two main abstractions for parallel programming:

resilient distributed datasets and
parallel operations on these datasets (invoked by passing a function to apply on a dataset).
Which are based on typical functional programming concepts of map , flatmap , filters etc

In addition, Spark supports two restricted types of shared variables that can be used in functions running
on the cluster

Variables in Spark

  • Broadcast variables: Read Only variable
  • Accumulators: These are variables that workers can only “add” to using an associative operation
Example Spark code

val file = spark.textFile("hdfs://...")
val errs = file.filter(_.contains("ERROR"))
val ones = => 1)
val count = ones.reduce(_+_)

Hive Shark Impala Comparison

This post is nothing but reproduce of work done here at AmpLabs. If you want latest and detailed read i suggest you to go there.
Bigdata world is so beautiful , the research in this field is driving at such a fast pace that eventually BigData is no more synonymous with long running queries , its becoming LiveData everywhere.
The work compares the computation time of Redshift , Hive , Impala , Shark with different types of queries





The performance of Shark in memory has been consistent in all 4 types. It would be interesting to see the comparison when Hive 0.11 is used as it also adds few performance improvements driven by work at Hortonworks.

Spark Standalone mode installation steps

Based on

Download Spark from

I used the prebuilt version of Spark for this post. For building from source please see instructions on website


Extract it to some location say


For running this you need both Scala and Java

I have downloaded them and configured following things

In /etc/environment file




You might change the above paths depending on your system

Go to


Rename the file to

Add the following values

export SCALA_HOME="/home/jagat/Downloads/scala-2.9.3"

export JAVA_HOME="/home/jagat/Downloads/jdk1.7.0_25"
export PATH=$PATH:$JAVA_HOME/bin

Now check your hosts file that your system DNS is resolving correct. Specially if you are on Ubuntu like me

Go to /etc/hosts

Change the following

jagat@Dell9400:~$ cat /etc/hosts    localhost    Dell9400

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

for Ubuntu the loopback address for your host is , change it to your exact IP

Now we are ready to go

Go to


Start Spark Master

./run spark.deploy.master.Master

Check the URL where master started


It will give info that master is started with

URL: spark://Dell9400:7077

This is the URL which we need in all our applications

Lets start one Worker by telling it about master

./run spark.deploy.worker.Worker spark://Dell9400:7077

This register the worker with the master.

Now refresh the master page


You can see that a worker is added on the page


Connecting a Job to the Cluster

To run a job on the Spark cluster, simply pass the spark://IP:PORT URL of the master as to the SparkContext constructor.

To run an interactive Spark shell against the cluster, run the following command:

MASTER=spark://IP:PORT ./spark-shell

Thats it

I admit that was very raw steps , but i kept it simple and quick for first time users

Happy Sparking :)

ERROR cluster.ClusterScheduler: Exiting due to error from cluster scheduler: Disconnected from Spark cluster

Full error which i got

Spark context available as sc.
13/08/04 20:46:04 ERROR client.Client$ClientActor: Connection to master failed; stopping client
13/08/04 20:46:04 ERROR cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster!
13/08/04 20:46:04 ERROR cluster.ClusterScheduler: Exiting due to error from cluster scheduler: Disconnected from Spark cluster

Possible Reasons


I had multiple scala versions in system

Check and make sure you have only one version


Check /etc/hosts file

The DNS should resolve properly

YARN vs Mesos

A very good discussion on the same topic is present on Quora

Mesos is a meta, framework scheduler rather than an application scheduler like YARN

Besides the above link following additional (updated) info i found which you might find useful.

There might be many other things as open source community moves very fast and this post also might be very old while you are reading.

With changes in Capacity scheduler now Yarn can support CPU also as resource scheduler. See JIRA YARN-2 for details.

Yarn now has support for cgroups in containers. A very good related blog post

Storm on Yarn can now directly used

Starting 0.6 Spark on Yarn is now offically supported

GSOC project to add security to Mesos related to adding security features to Mesos which its lacking currently and Yarn has that via Kerberos. Wiki on Mesos security website

Lastly  papers

Google Omega

This paper is based on research done in Amplabs and Google for next generation schedulers on parallel infrastructures.



It classifies the schedulers into following types

Monolithic schedulers use a single, centralized scheduling algorithm for all jobs (our existing
scheduler is one of these).

schedulers have a single active resource manager that offers compute resources to multiple parallel, independent “scheduler frameworks”, as in Mesos and Hadoop-on-Demand (HPC)

The paper classifies Yarn as Monolithic scheduler and Mesos onto Two level scheduler.

It is an interesting read and also raises one question for Yarn

I quote

It might appear that YARN is a two-level scheduler, too. In YARN, resource requests from per-job
application masters are sent to a single global scheduler in the resource master , which allocates resources on various machines, subject to application-specified constraints. But the application masters provide job-management services, not scheduling, so YARN is effectively a monolithic scheduler architecture.
At the time of writing, YARN only supports one resource type (fixed-sized memory chunks). Our experience suggests that it will eventually need a rich API to the resource mastin order to cater for diverse application requirements, including multiple resource dimensions, constraints, and placement choices for failure-tolerance.

Although YARN application masters can request resources on particular machines,it is unclear how they acquire and maintain the state needed to make such placement decisions. 

Google seems to be drifting away from Yarn , unlike its counterpart Yahoo

Quoting Hortonworks from

Architecturally how does YARN compare with Mesos?
Conceptually YARN and Mesos address similar requirements. They enable organizations to pool and share horizontal compute resources across a multitude of workloads. YARN was architected specifically as an evolution of Hadoop 1.x. YARN thus tightly integrates with HDFS, MapReduce and Hadoop security.

Convert all files to pdf via command line Windows

Convert all files in folder to pdf

This should work for various types of

Microsoft Office ( Word , Powerpoint etc ) files

Okay I am on Windows system so I don’t have many luxuries which you might have on Linux system.

But here we go


We need some utility to files to pdf.

Download this utility to get virtual pdf printer on your system. Remember to set it as your default printer.


Get list of all files

Go to folder where your files are

And run following command

dir /s /b > filenames.txt

This will run dir command and send its output to file names filenames.txt


See the command line parameter stuff for your office product.


Power point 2010 its

"c:\program files\microsoft office\office14\POWERPNT.exe" /P "MyFile.pptx"

Now use some tool like Notepad ++ to generate command for all files you got in step 2


Mount raw hard disk partition drive on Virtual Box

Mount raw hard disk partition drive on Virtual Box

Know your drive number

To know the drive number of your system go to

Run > compmgmt.msc

Go to disk Management

See the Disk 1 , Disk 0 , Disk 3 etc

See the number

Replace this at the end of command here. Example replace 1 with 2 or something what ever you note

Know the partition number

C:\Program Files\Oracle\VirtualBox>VBoxManage internalcommands listpartitions -rawdisk \\.\PhysicalDrive1

This will show all the partitions of current drive

Choose the partition you want to mount

Create VMDK disk based on this info

C:\Program Files\Oracle\VirtualBox\VBoxManage.exe internalcommands createrawvmdk -filename C:\Jagat\VirtualBox\directDisk.vmdk -rawdisk \\.\PhysicalDrive1 -partitions 7


Mirror a website using WinHTTrack command line

You can use command like

Use proxy if your network needs it

C:\Program Files\WinHTTrack\httrack.exe -P -O "C:\\Jagat\\WinHTTrack\\ClouderaCDH4RHEL6" --verbose

read status line: An existing connection was forcibly closed by the remote

SVN error while checkout

read status line: An existing connection was forcibly closed by the remote


Try replacing http with https vice versa

If this don't work then

Update to latest SVN from net

Proxy settings for SVN

There is a “servers” file in svn which is present at the following location

Win : C:\Documents and Settings\jj\Application Data\Subversion\servers

Linux: /etc/subversion/servers

Specify the proxy settings there

Read details at

Proxy settings for Tortoise SVN

You can go to TortoiseSVN->Setting >> Network

Possibly transient ZooKeeper exception

[WARN] RecoverableZooKeeper - Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid                                                                                                                                   
 [INFO] RetryCounter - Sleeping 4000ms before retry


Check that you have correct hbase-site.xml in /etc/hbase/conf folder

Or your hbase configuration settings

Property to look for is

    <value>yournode01,yournode02 </value>

Find my maven configuration file

Use command

mvn -X

The following packages have been kept back

While updating packages on Ubuntu if you get this message

Use command

sudo apt-get dist-upgrade

This will update everything.