Machine learning for large organisations


While doing machine learning and analytics in large organisations often we have to cater many different objectives to fulfil business needs. This post shares my learnings of past while working with large scale organisations who have diverse teams often distributed silos of departments teams who work using similar set of technologies but with varying business goals.

The design of this organisational machine learning pipeline based on following goals:

  • Have set of tools which user prefer
  • Have unified way of executing the models made by them
  • Have quick way of doing A/B testing for the models 
  • Have ability to follow continuous integration principles

I will share both open source and the licensed tools which can be used to create such kind of environment.

From high level typical data science flow is

  1. Analyse the data
  2. Play with it to create machine learning model
  3. Export the model in PMML format
  4. Execute the PMML model to generate scores in batch or real time

You can have different terms for scores / batch / real time but I will keep above flow for simplicity for now. With both of the patterns (buy or open source) one of the goal was to reuse whatever is available open source and avoid reinventing the wheel.

There are different set of people involved in people who make PMML files and people who execute (operate) PMML files. But there is very fine overlap between them. The people who make the models and hence PMML artefacts are also responsible for running validations pipelines (more on this later)
Analysis and Modelling 

Analyst and Data scientists use various tools for example R , Spark , Scikit-learn etc . Each of them accessed via web front end or native clients provided for them. Majority of the people prefer these three tools hence first goal to support them makes almost 90% people happy. 

PMML export

Most of the tools support export to PMML. The end artefact for the analytics team is the PMML file. See the links below for various tools in references. The only tricky bit is if your model is not supported by PMML. But that is very rare. Most of times people work with regressions. See the complete list of models supported by PMML below. 

Continuous Integration and A/B Testing

Each of the PMML is committed to git repo which produces deployable artefact (rpm) with each A/B test a new release of models is done by teams. Each team has own project so they can make changes / deploy as per their own release cycles. This is also controlled as per environments for production / development etc.

Scoring engine 

Now only last task is to run the PMML file , Spark at this moment cannot run PMML file , it can only produce one.  In the open source world JPMML is the  scoring engine which supports running PMML files. It has Rest API which gives teams ability to trigger models as and when they want. But teams also want to bring back results back to there systems and this is where Spring XD comes to picture. It allows anyone to write simple few lines of statements and do all end to end work.

See sample code for running PMML in Spring XD

analytic-pmml
     --location=/models/iris-flower-naive-bayes.pmml.xml
     --inputFieldMapping=
       'sepalLength:Sepal.Length,
        sepalWidth:Sepal.Width,
        petalLength:Petal.Length,
        petalWidth:Petal.Width'
     --outputFieldMapping='Predicted_Species:predictedSpecies' | SomeSINK“

Many teams want to push results to database , many want in hdfs , others want in Rabbit MQ. 

All they change is the Sink location and Spring XD does the rest.

With REST API integration with applications of business units the modelling and data science pipeline is self service with business empowered enough to make machine learning modes , deploy with CI practices and run as and when they want. 

The time to A/ B test is very less and you can fail fast.

Validation pipelines

Validation pipelines run of same dataset but are run small sample volume and tool of the choice for people use made models. For example I will generate model using scikit-learn , I will export as PMML. I will also run my validation using scikit-learn and compare with what output the real scoring engine gives.



Thanks for reading , please do share your comments and best practices.


References and further readings

PMML

R PMML package

Spark PMML export

JPMML

Open scoring engine

Spring XD PMML execution


Operating system level tuning for Hadoop

1)

THC should be disabled

echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
Add the above to /etc/rc.local

2)

The noatime mount for disks to speed up reads

3)

The open file limit ulimit to high number 32000

4)

Turn off caching on controller

5)

net.core.somaxconn=1024
Socket listen size

6)

Disable SE Linux

SELINUX=disabled


7)

umask 022

8)

sysctl.conf

vm.swappiness=0
vm.overcommit_memory = 1
vm.overcommit_ratio = 100

9)

Disable ip6

10)

Host DNS is properly set

References

http://www.slideshare.net/vgogate/hadoop-configuration-performance-tuning

Passing and set hive conf variables in JDBC and ODBC server

If we want to pass the JDBC and ODBC server with custom hive conf which are normally set via set commands in the script.

The syntax s

jdbc:hive2://<host>:<port>/dbName;sess_var_list?hive_conf_list#hive_var_list

Example can be

jdbc:hive2://foobar:10000/database;auth=noSasl?mapreduce.map.memory.mb=8000;mapreduce.map.java.opts=-Xmx6277m;#foo=bar

Hadoop Yarn tuning calculator

Yarn tuning calculator

I just tarted one excel sheet where we can plug in the nodes numnber of cores , disks , RAM and
it will give us the values for various yarn properties.

It is based on Hortonworks suggestions.

You can see it at

https://goo.gl/gzSz27

Hortonworks also has the python srcipt.

https://github.com/hortonworks/hdp-configuration-utils

But keep in mind the fact that python script does not takes into consideration for things
which are already running in machine.

Those things we can configure manually in excel sheet