Jugnu Life :-): August 2013

Building mesos from source

Install the required software for building

Download JDK from Oracle website and set following env variables

export JAVA_HOME=/home/jagat/development/tools/jdk1.6.0_45
export PATH=$PATH:$JAVA_HOME/bin

Install software for build stuff

Copy paste this

sudo apt-get install -y cmake git-core git-svn subversion checkinstall build-essential dh-make debhelper ant ant-optional autoconf automake liblzo2-dev libzip-dev sharutils libfuse-dev reprepro libtool libssl-dev asciidoc xmlto ssh curl

sudo apt-get install -y devscripts

sudo apt-get build-dep pkg-config dh-autoreconf python-dev libcurl4-openssl-dev libboost-all-dev libunwind8-dev

Clone the code

git clone https://git-wip-us.apache.org/repos/asf/mesos.git mesos

Go to directory where code is copied

cd mesos

All of the commands below should finish without errors.

./bootstrap

./configure

make

-----------------

Some of the errors i got and solved

mbolic-functions
checking consistency of all components of python development environment... no
configure: error: in `/home/jagat/development/code/berkley/mesos':
configure: error:
Could not link test program to Python. Maybe the main Python library has been
installed in some non-standard library path. If so, pass it to configure,
via the LDFLAGS environment variable.
Example: ./configure LDFLAGS="-L/usr/non-standard-path/python/lib"
============================================================================
   ERROR!
   You probably have to install the development version of the Python package
   for your distribution. The exact name of this package varies among them.
============================================================================

See `config.log' for more details
jagat@nanak-P570WM:~/development/code/berkley/mesos$ apt-cache search python27

Solution
Install python-dev

sudo apt-get install python-dev

checking for curl_global_init in -lcurl... no
configure: error: cannot find libcurl
-------------------------------------------------------------------
You can avoid this with --without-curl, but it will mean executor
and task resources cannot be downloaded over http.
-------------------------------------------------------------------

Solution
sudo apt-get instal libcurl4-openssl-dev

checking whether -pthread is sufficient with -shared... yes
checking for backtrace in -lunwind... no
configure: error: failed to determine linker flags for using Java (bad JAVA_HOME or missing support for your architecture?)

Solution
Download JDK from

set env variable as

export JAVA_HOME=/home/jagat/development/tools/jdk1.6.0_45

export PATH=$PATH:$JAVA_HOME/bin

----

And in end it fails wiht this message

cc1plus: all warnings being treated as errors
make[2]: *** [sched/libmesos_no_3rdparty_

la-sched.lo] Error 1
make[2]: Leaving directory `/home/jagat/development/tools/mesos-0.13.0/build/src'
make[1]: *** [check] Error 2
make[1]: Leaving directory `/home/jagat/development/tools/mesos-0.13.0/build/src'
make: *** [check-recursive] Error

Solution

Read this
http://www.mail-archive.com/dev@mesos.apache.org/msg02267.html

or shortcut

open

mesos/src/Makefile.am:

Replace

MESOS_CPPFLAGS += -Wall -Werror

With below line

MESOS_CPPFLAGS += -Wall -Werror -Wno-unused-local-typedefs.

Hadoop get filesize and directory from ls command

hadoop fs -ls /my/folder | awk '{print $8}' > only_directory.txt

You can just change the numbers above for getting different information

Example

hadoop fs -ls /my/folder | awk '{print $6, $8}' > only_date_directory.txt

How Aapche Spark works ( short summary )

Why spark was made

Before Spark was in the market map reduce was used a processing brain on top of Hadoop.

Apache Spark original research paper

The typical flow for map reduce is

read data from disk.
Apply some processing
Dump intermediary data on disk
Do some more processing
Show final results

Now if some processing has to be done incrementally by changing some variable over across all data set
The mapreduce will again start from reading from disk and if you the processing 100 times , it will do it 100 times * 2 ( for intermediary processing also )

The Spark solves the following typical use cases where same processing is applied to datasets with varying variable inputs.

Two typical usecases where Spark shines are:

Iterative jobs
Many common machine learning algorithms apply a function repeatedly to the same dataset
to optimize a parameter (e.g., through gradient descent).While each iteration can be expressed as a
MapReduce/Dryad job, each job must reload the data from disk, incurring a significant performance penalty.

Interactive analysis
Hadoop is often used to perform ad-hoc exploratory queries on big datasets, through
SQL interfaces such as Pig and Hive. Ideally, a user would be able to load a dataset of interest into
memory across a number of machines and query it repeatedly.

To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs).
An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.

Spark provides two main abstractions for parallel programming:

resilient distributed datasets and
parallel operations on these datasets (invoked by passing a function to apply on a dataset).
Which are based on typical functional programming concepts of map , flatmap , filters etc

In addition, Spark supports two restricted types of shared variables that can be used in functions running
on the cluster

Variables in Spark

Broadcast variables: Read Only variable
Accumulators: These are variables that workers can only “add” to using an associative operation

Example Spark code

val file = spark.textFile("hdfs://...")

val errs = file.filter(_.contains("ERROR"))

val ones = errs.map(_ => 1)

val count = ones.reduce(_+_)

Hive Shark Impala Comparison

Credits
This post is nothing but reproduce of work done here at AmpLabs. If you want latest and detailed read i suggest you to go there.
Bigdata world is so beautiful , the research in this field is driving at such a fast pace that eventually BigData is no more synonymous with long running queries , its becoming LiveData everywhere.
The work compares the computation time of Redshift , Hive , Impala , Shark with different types of queries

The performance of Shark in memory has been consistent in all 4 types. It would be interesting to see the comparison when Hive 0.11 is used as it also adds few performance improvements driven by work at Hortonworks.

Spark Standalone mode installation steps

Based on

http://spark-project.org/docs/latest/spark-standalone.html

Download Spark from

http://spark-project.org/downloads/

I used the prebuilt version of Spark for this post. For building from source please see instructions on website

Download

http://spark-project.org/download/spark-0.7.3-prebuilt-hadoop1.tgz

Extract it to some location say

/home/jagat/Downloads/spark-0.7.3

For running this you need both Scala and Java

I have downloaded them and configured following things

In /etc/environment file

Add

SCALA_HOME="/home/jagat/Downloads/scala-2.9.3"
PATH=$PATH:$SCALA_HOME/bin

JAVA_HOME="/home/jagat/Downloads/jdk1.7.0_25"
PATH=$PATH:$JAVA_HOME/bin

You might change the above paths depending on your system

Go to

/home/jagat/Downloads/spark-0.7.3/conf

Rename the spark-env.sh.template file to spart-env.sh

Add the following values

export SCALA_HOME="/home/jagat/Downloads/scala-2.9.3"
export PATH=$PATH:$SCALA_HOME/bin

export JAVA_HOME="/home/jagat/Downloads/jdk1.7.0_25"
export PATH=$PATH:$JAVA_HOME/bin

Now check your hosts file that your system DNS is resolving correct. Specially if you are on Ubuntu like me

Go to /etc/hosts

Change the following

jagat@Dell9400:~$ cat /etc/hosts
127.0.0.1    localhost
192.168.0.104    Dell9400

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

for Ubuntu the loopback address for your host is 127.0.1.1 , change it to your exact IP

Now we are ready to go

Go to

/home/jagat/Downloads/spark-0.7.3/

Start Spark Master

./run spark.deploy.master.Master

Check the URL where master started

http://localhost:8080/

It will give info that master is started with

URL: spark://Dell9400:7077

This is the URL which we need in all our applications

Lets start one Worker by telling it about master

./run spark.deploy.worker.Worker spark://Dell9400:7077

This register the worker with the master.

Now refresh the master page

http://localhost:8080/

You can see that a worker is added on the page

Now

Connecting a Job to the Cluster

To run a job on the Spark cluster, simply pass the spark://IP:PORT URL of the master as to the SparkContext constructor.

To run an interactive Spark shell against the cluster, run the following command:

MASTER=spark://IP:PORT ./spark-shell

Thats it

I admit that was very raw steps , but i kept it simple and quick for first time users

Happy Sparking :)

ERROR cluster.ClusterScheduler: Exiting due to error from cluster scheduler: Disconnected from Spark cluster

Full error which i got

Spark context available as sc.
13/08/04 20:46:04 ERROR client.Client$ClientActor: Connection to master failed; stopping client
13/08/04 20:46:04 ERROR cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster!
13/08/04 20:46:04 ERROR cluster.ClusterScheduler: Exiting due to error from cluster scheduler: Disconnected from Spark cluster
jagat@Dell9400:~/Downloads/spark-0.7.3$

Possible Reasons

1)

I had multiple scala versions in system

Check and make sure you have only one version

2)

Check /etc/hosts file

The DNS should resolve properly

YARN vs Mesos

A very good discussion on the same topic is present on Quora

http://www.quora.com/How-does-YARN-compare-to-Mesos

Mesos is a meta, framework scheduler rather than an application scheduler like YARN

Besides the above link following additional (updated) info i found which you might find useful.

There might be many other things as open source community moves very fast and this post also might be very old while you are reading.

With changes in Capacity scheduler now Yarn can support CPU also as resource scheduler. See JIRA YARN-2 for details.

Yarn now has support for cgroups in containers. A very good related blog post

Storm on Yarn can now directly used

Starting 0.6 Spark on Yarn is now offically supported

GSOC project to add security to Mesos related to adding security features to Mesos which its lacking currently and Yarn has that via Kerberos. Wiki on Mesos security website

Lastly papers

Google Omega
http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf

This paper is based on research done in Amplabs and Google for next generation schedulers on parallel infrastructures.

Mesos
http://bnrg.cs.berkeley.edu/~adj/publications/paper-files/nsdi_mesos.pdf

YARN
http://www.socc2013.org/home/program/a5-vavilapalli.pdf

It classifies the schedulers into following types

Monolithic schedulers use a single, centralized scheduling algorithm for all jobs (our existing
scheduler is one of these).

Two-level
schedulers have a single active resource manager that offers compute resources to multiple parallel, independent “scheduler frameworks”, as in Mesos and Hadoop-on-Demand (HPC)

The paper classifies Yarn as Monolithic scheduler and Mesos onto Two level scheduler.

It is an interesting read and also raises one question for Yarn

I quote

It might appear that YARN is a two-level scheduler, too. In YARN, resource requests from per-job
application masters are sent to a single global scheduler in the resource master , which allocates resources on various machines, subject to application-specified constraints. But the application masters provide job-management services, not scheduling, so YARN is effectively a monolithic scheduler architecture.
At the time of writing, YARN only supports one resource type (fixed-sized memory chunks). Our experience suggests that it will eventually need a rich API to the resource mastin order to cater for diverse application requirements, including multiple resource dimensions, constraints, and placement choices for failure-tolerance.

Although YARN application masters can request resources on particular machines,it is unclear how they acquire and maintain the state needed to make such placement decisions.

Google seems to be drifting away from Yarn , unlike its counterpart Yahoo

Quoting Hortonworks from

Office Hours: Q&A on YARN in Hadoop 2

Architecturally how does YARN compare with Mesos?
Conceptually YARN and Mesos address similar requirements. They enable organizations to pool and share horizontal compute resources across a multitude of workloads. YARN was architected specifically as an evolution of Hadoop 1.x. YARN thus tightly integrates with HDFS, MapReduce and Hadoop security.

Convert all files to pdf via command line Windows

Convert all files in folder to pdf

This should work for various types of

Microsoft Office ( Word , Powerpoint etc ) files

Okay I am on Windows system so I don’t have many luxuries which you might have on Linux system.

But here we go

We need some utility to files to pdf.

http://www.cutepdf.com/

Download this utility to get virtual pdf printer on your system. Remember to set it as your default printer.

Get list of all files

Go to folder where your files are

And run following command

dir /s /b > filenames.txt

This will run dir command and send its output to file names filenames.txt

See the command line parameter stuff for your office product.

Example

Power point 2010 its

"c:\program files\microsoft office\office14\POWERPNT.exe" /P "MyFile.pptx"

Now use some tool like Notepad ++ to generate command for all files you got in step 2

Done

Mount raw hard disk partition drive on Virtual Box

Know your drive number

To know the drive number of your system go to

Run > compmgmt.msc

Go to disk Management

See the Disk 1 , Disk 0 , Disk 3 etc

See the number

Replace this at the end of command here. Example replace 1 with 2 or something what ever you note

\\.\PhysicalDrive1

Know the partition number

C:\Program Files\Oracle\VirtualBox>VBoxManage internalcommands listpartitions -rawdisk \\.\PhysicalDrive1

This will show all the partitions of current drive

Choose the partition you want to mount

Create VMDK disk based on this info

C:\Program Files\Oracle\VirtualBox\VBoxManage.exe internalcommands createrawvmdk -filename C:\Jagat\VirtualBox\directDisk.vmdk -rawdisk \\.\PhysicalDrive1 -partitions 7

Done

Mirror a website using WinHTTrack command line

You can use command like

Use proxy if your network needs it

C:\Program Files\WinHTTrack\httrack.exe http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/4/ -P jaggija:@proxy.com:8080 -O "C:\\Jagat\\WinHTTrack\\ClouderaCDH4RHEL6" --verbose

read status line: An existing connection was forcibly closed by the remote

SVN error while checkout

read status line: An existing connection was forcibly closed by the remote

Solution

Try replacing http with https vice versa

If this don't work then

Update to latest SVN from net

Proxy settings for SVN

There is a “servers” file in svn which is present at the following location

Win : C:\Documents and Settings\jj\Application Data\Subversion\servers

Linux: /etc/subversion/servers

Specify the proxy settings there

Read details at

http://vikashazrati.wordpress.com/2009/01/25/http-proxy-sv/

Proxy settings for Tortoise SVN

You can go to TortoiseSVN->Setting >> Network

http://stackoverflow.com/questions/111543/tortoisesvn-error-options-of-https-could-not-connect-to-server

Possibly transient ZooKeeper exception

[WARN] RecoverableZooKeeper - Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid

[INFO] RetryCounter - Sleeping 4000ms before retry

Solution

Check that you have correct hbase-site.xml in /etc/hbase/conf folder

Or your hbase configuration settings

Property to look for is

<name>hbase.zookeeper.quorum</name>

<value>yournode01,yournode02 </value>

</property>

Find my maven configuration file

Use command

mvn -X

The following packages have been kept back

While updating packages on Ubuntu if you get this message

Use command

sudo apt-get dist-upgrade

This will update everything.