Building mesos from source

Install the required software for building

export JAVA_HOME=/home/jagat/development/tools/jdk1.6.0_45
export PATH=$PATH:$JAVA_HOME/bin

Install software for build stuff

Copy paste this

sudo apt-get install -y cmake git-core git-svn subversion checkinstall build-essential dh-make debhelper ant ant-optional autoconf automake liblzo2-dev libzip-dev sharutils libfuse-dev reprepro libtool libssl-dev asciidoc xmlto ssh curl

sudo apt-get install -y devscripts

sudo apt-get build-dep pkg-config dh-autoreconf python-dev libcurl4-openssl-dev libboost-all-dev libunwind8-dev

Clone the code

git clone https://git-wip-us.apache.org/repos/asf/mesos.git mesos

Go to directory where code is copied

cd mesos

All of the commands below should finish without errors.

./bootstrap

./configure

make

-----------------

Some of the errors i got and solved

mbolic-functions
checking consistency of all components of python development environment... no
configure: error: in /home/jagat/development/code/berkley/mesos':
configure: error:
Could not link test program to Python. Maybe the main Python library has been
installed in some non-standard library path. If so, pass it to configure,
via the LDFLAGS environment variable.
Example: ./configure LDFLAGS="-L/usr/non-standard-path/python/lib"
============================================================================
ERROR!
You probably have to install the development version of the Python package
for your distribution.  The exact name of this package varies among them.
============================================================================

See config.log' for more details
jagat@nanak-P570WM:~/development/code/berkley/mesos$apt-cache search python27 Solution Install python-dev sudo apt-get install python-dev checking for curl_global_init in -lcurl... no configure: error: cannot find libcurl ------------------------------------------------------------------- You can avoid this with --without-curl, but it will mean executor and task resources cannot be downloaded over http. ------------------------------------------------------------------- Solution sudo apt-get instal libcurl4-openssl-dev checking whether -pthread is sufficient with -shared... yes checking for backtrace in -lunwind... no configure: error: failed to determine linker flags for using Java (bad JAVA_HOME or missing support for your architecture?) Solution Download JDK from set env variable as export JAVA_HOME=/home/jagat/development/tools/jdk1.6.0_45 export PATH=$PATH:$JAVA_HOME/bin ---- And in end it fails wiht this message cc1plus: all warnings being treated as errors make[2]: *** [sched/libmesos_no_3rdparty_ la-sched.lo] Error 1 make[2]: Leaving directory /home/jagat/development/tools/mesos-0.13.0/build/src' make[1]: *** [check] Error 2 make[1]: Leaving directory /home/jagat/development/tools/mesos-0.13.0/build/src' make: *** [check-recursive] Error Solution Read this http://www.mail-archive.com/dev@mesos.apache.org/msg02267.html or shortcut open mesos/src/Makefile.am:   Replace    MESOS_CPPFLAGS += -Wall -Werror With below line MESOS_CPPFLAGS += -Wall -Werror -Wno-unused-local-typedefs.  Hadoop get filesize and directory from ls command hadoop fs -ls /my/folder | awk '{print$8}' > only_directory.txt

You can just change the numbers above for getting different information

Example

hadoop fs -ls /my/folder  | awk '{print $6,$8}' > only_date_directory.txt

How Aapche Spark works ( short summary )

Before Spark was in the market  map reduce was used a processing brain on top of Hadoop.

Apache Spark original research paper

The typical flow for map reduce is

• Apply some processing
• Dump intermediary data on disk
• Do some more processing
• Show final results

Now if some processing has to be done incrementally by changing some variable over across all data set
The mapreduce will again start from reading from disk and if you the processing 100 times , it will do it 100 times * 2 ( for intermediary processing also )

The Spark solves the following typical use cases where same processing is applied to datasets with varying variable inputs.

Two typical usecases where Spark shines are:

Iterative jobs
Many common machine learning algorithms apply a function repeatedly to the same dataset
to optimize a parameter (e.g., through gradient descent).While each iteration can be expressed as a
MapReduce/Dryad job, each job must reload the data from disk, incurring a significant performance penalty.

Interactive analysis
Hadoop is often used to perform ad-hoc exploratory queries on big datasets, through
SQL interfaces such as Pig and Hive.  Ideally, a user would be able to load a dataset of interest into
memory across a number of machines and query it repeatedly.

To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs).
An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.

Spark provides two main abstractions for parallel programming:

resilient distributed datasets and
parallel operations on these datasets (invoked by passing a function to apply on a dataset).
Which are based on typical functional programming concepts of map , flatmap , filters etc

In addition, Spark supports two restricted types of shared variables that can be used in functions running
on the cluster

Variables in Spark

• Accumulators: These are variables that workers can only “add” to using an associative operation
Example Spark code

val file = spark.textFile("hdfs://...")
val errs = file.filter(_.contains("ERROR"))
val ones = errs.map(_ => 1)
val count = ones.reduce(_+_)

Hive Shark Impala Comparison

Credits
This post is nothing but reproduce of work done here at AmpLabs. If you want latest and detailed read i suggest you to go there.
Bigdata world is so beautiful , the research in this field is driving at such a fast pace that eventually BigData is no more synonymous with long running queries , its becoming LiveData everywhere.
The work compares the computation time of Redshift , Hive , Impala , Shark with different types of queries

The performance of Shark in memory has been consistent in all 4 types. It would be interesting to see the comparison when Hive 0.11 is used as it also adds few performance improvements driven by work at Hortonworks.

Spark Standalone mode installation steps

Based on

http://spark-project.org/docs/latest/spark-standalone.html

I used the prebuilt version of Spark for this post. For building from source please see instructions on website

Extract it to some location say

For running this you need both Scala and Java

In /etc/environment file

PATH=$PATH:$SCALA_HOME/bin

PATH=$PATH:$JAVA_HOME/bin

You might change the above paths depending on your system

Go to

Rename the spark-env.sh.template file to spart-env.sh

export PATH=$PATH:$SCALA_HOME/bin

export PATH=$PATH:$JAVA_HOME/bin

Now check your hosts file that your system DNS is resolving correct. Specially if you are on Ubuntu like me

Go to /etc/hosts

Change the following

jagat@Dell9400:~$cat /etc/hosts 127.0.0.1 localhost 192.168.0.104 Dell9400 # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters for Ubuntu the loopback address for your host is 127.0.1.1 , change it to your exact IP Now we are ready to go Go to /home/jagat/Downloads/spark-0.7.3/ Start Spark Master ./run spark.deploy.master.Master Check the URL where master started http://localhost:8080/ It will give info that master is started with URL: spark://Dell9400:7077 This is the URL which we need in all our applications Lets start one Worker by telling it about master ./run spark.deploy.worker.Worker spark://Dell9400:7077 This register the worker with the master. Now refresh the master page http://localhost:8080/ You can see that a worker is added on the page Now Connecting a Job to the Cluster To run a job on the Spark cluster, simply pass the spark://IP:PORT URL of the master as to the SparkContext constructor. To run an interactive Spark shell against the cluster, run the following command: MASTER=spark://IP:PORT ./spark-shell Thats it I admit that was very raw steps , but i kept it simple and quick for first time users Happy Sparking :) ERROR cluster.ClusterScheduler: Exiting due to error from cluster scheduler: Disconnected from Spark cluster Full error which i got Spark context available as sc. 13/08/04 20:46:04 ERROR client.Client$ClientActor: Connection to master failed; stopping client
13/08/04 20:46:04 ERROR cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster!
13/08/04 20:46:04 ERROR cluster.ClusterScheduler: Exiting due to error from cluster scheduler: Disconnected from Spark cluster
jagat@Dell9400:~/Downloads/spark-0.7.3$Possible Reasons 1) I had multiple scala versions in system Check and make sure you have only one version 2) Check /etc/hosts file The DNS should resolve properly YARN vs Mesos A very good discussion on the same topic is present on Quora http://www.quora.com/How-does-YARN-compare-to-Mesos Mesos is a meta, framework scheduler rather than an application scheduler like YARN Besides the above link following additional (updated) info i found which you might find useful. There might be many other things as open source community moves very fast and this post also might be very old while you are reading. With changes in Capacity scheduler now Yarn can support CPU also as resource scheduler. See JIRA YARN-2 for details. Yarn now has support for cgroups in containers. A very good related blog post Storm on Yarn can now directly used Starting 0.6 Spark on Yarn is now offically supported GSOC project to add security to Mesos related to adding security features to Mesos which its lacking currently and Yarn has that via Kerberos. Wiki on Mesos security website Lastly papers Google Omega http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf This paper is based on research done in Amplabs and Google for next generation schedulers on parallel infrastructures. Mesos http://bnrg.cs.berkeley.edu/~adj/publications/paper-files/nsdi_mesos.pdf YARN http://www.socc2013.org/home/program/a5-vavilapalli.pdf It classifies the schedulers into following types Monolithic schedulers use a single, centralized scheduling algorithm for all jobs (our existing scheduler is one of these). Two-level schedulers have a single active resource manager that offers compute resources to multiple parallel, independent “scheduler frameworks”, as in Mesos and Hadoop-on-Demand (HPC) The paper classifies Yarn as Monolithic scheduler and Mesos onto Two level scheduler. It is an interesting read and also raises one question for Yarn I quote It might appear that YARN is a two-level scheduler, too. In YARN, resource requests from per-job application masters are sent to a single global scheduler in the resource master , which allocates resources on various machines, subject to application-specified constraints. But the application masters provide job-management services, not scheduling, so YARN is effectively a monolithic scheduler architecture. At the time of writing, YARN only supports one resource type (fixed-sized memory chunks). Our experience suggests that it will eventually need a rich API to the resource mastin order to cater for diverse application requirements, including multiple resource dimensions, constraints, and placement choices for failure-tolerance. Although YARN application masters can request resources on particular machines,it is unclear how they acquire and maintain the state needed to make such placement decisions. Google seems to be drifting away from Yarn , unlike its counterpart Yahoo Quoting Hortonworks from Architecturally how does YARN compare with Mesos? Conceptually YARN and Mesos address similar requirements. They enable organizations to pool and share horizontal compute resources across a multitude of workloads. YARN was architected specifically as an evolution of Hadoop 1.x. YARN thus tightly integrates with HDFS, MapReduce and Hadoop security. Convert all files to pdf via command line Windows Convert all files in folder to pdf This should work for various types of Microsoft Office ( Word , Powerpoint etc ) files Okay I am on Windows system so I don’t have many luxuries which you might have on Linux system. But here we go 1) We need some utility to files to pdf. Download this utility to get virtual pdf printer on your system. Remember to set it as your default printer. 2) Get list of all files Go to folder where your files are And run following command dir /s /b > filenames.txt This will run dir command and send its output to file names filenames.txt 3) See the command line parameter stuff for your office product. Example Power point 2010 its "c:\program files\microsoft office\office14\POWERPNT.exe" /P "MyFile.pptx" Now use some tool like Notepad ++ to generate command for all files you got in step 2 Done Mount raw hard disk partition drive on Virtual Box Mount raw hard disk partition drive on Virtual Box Know your drive number To know the drive number of your system go to Run > compmgmt.msc Go to disk Management See the Disk 1 , Disk 0 , Disk 3 etc See the number Replace this at the end of command here. Example replace 1 with 2 or something what ever you note Know the partition number C:\Program Files\Oracle\VirtualBox>VBoxManage internalcommands listpartitions -rawdisk \\.\PhysicalDrive1 This will show all the partitions of current drive Choose the partition you want to mount Create VMDK disk based on this info C:\Program Files\Oracle\VirtualBox\VBoxManage.exe internalcommands createrawvmdk -filename C:\Jagat\VirtualBox\directDisk.vmdk -rawdisk \\.\PhysicalDrive1 -partitions 7 Done Mirror a website using WinHTTrack command line You can use command like Use proxy if your network needs it C:\Program Files\WinHTTrack\httrack.exe http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/4/ -P jaggija:@proxy.com:8080 -O "C:\\Jagat\\WinHTTrack\\ClouderaCDH4RHEL6" --verbose read status line: An existing connection was forcibly closed by the remote SVN error while checkout read status line: An existing connection was forcibly closed by the remote Solution Try replacing http with https vice versa If this don't work then Update to latest SVN from net Proxy settings for SVN There is a “servers” file in svn which is present at the following location Win : C:\Documents and Settings\jj\Application Data\Subversion\servers Linux: /etc/subversion/servers Specify the proxy settings there Read details at http://vikashazrati.wordpress.com/2009/01/25/http-proxy-sv/ Proxy settings for Tortoise SVN You can go to TortoiseSVN->Setting >> Network http://stackoverflow.com/questions/111543/tortoisesvn-error-options-of-https-could-not-connect-to-server Possibly transient ZooKeeper exception [WARN] RecoverableZooKeeper - Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
[INFO] RetryCounter - Sleeping 4000ms before retry

Solution

Check that you have correct hbase-site.xml in /etc/hbase/conf folder

Property to look for is

<property>
<name>hbase.zookeeper.quorum</name>
<value>yournode01,yournode02 </value>
</property>

Use command

mvn -X

The following packages have been kept back

While updating packages on Ubuntu if you get this message

Use command