While doing machine learning and analytics in large organisations often we have to cater many different objectives to fulfil business needs. This post shares my learnings of past while working with large scale organisations who have diverse teams often distributed silos of departments teams who work using similar set of technologies but with varying business goals.
The design of this organisational machine learning pipeline based on following goals:
- Have set of tools which user prefer
- Have unified way of executing the models made by them
- Have quick way of doing A/B testing for the models
- Have ability to follow continuous integration principles
I will share both open source and the licensed tools which can be used to create such kind of environment.
From high level typical data science flow is
- Analyse the data
- Play with it to create machine learning model
- Export the model in PMML format
- Execute the PMML model to generate scores in batch or real time
You can have different terms for scores / batch / real time but I will keep above flow for simplicity for now. With both of the patterns (buy or open source) one of the goal was to reuse whatever is available open source and avoid reinventing the wheel.
There are different set of people involved in people who make PMML files and people who execute (operate) PMML files. But there is very fine overlap between them. The people who make the models and hence PMML artefacts are also responsible for running validations pipelines (more on this later)
Analysis and Modelling
Analyst and Data scientists use various tools for example R , Spark , Scikit-learn etc . Each of them accessed via web front end or native clients provided for them. Majority of the people prefer these three tools hence first goal to support them makes almost 90% people happy.
Most of the tools support export to PMML. The end artefact for the analytics team is the PMML file. See the links below for various tools in references. The only tricky bit is if your model is not supported by PMML. But that is very rare. Most of times people work with regressions. See the complete list of models supported by PMML below.
Continuous Integration and A/B Testing
Each of the PMML is committed to git repo which produces deployable artefact (rpm) with each A/B test a new release of models is done by teams. Each team has own project so they can make changes / deploy as per their own release cycles. This is also controlled as per environments for production / development etc.
Now only last task is to run the PMML file , Spark at this moment cannot run PMML file , it can only produce one. In the open source world JPMML is the scoring engine which supports running PMML files. It has Rest API which gives teams ability to trigger models as and when they want. But teams also want to bring back results back to there systems and this is where Spring XD comes to picture. It allows anyone to write simple few lines of statements and do all end to end work.
See sample code for running PMML in Spring XD
--outputFieldMapping='Predicted_Species:predictedSpecies' | SomeSINK“
Many teams want to push results to database , many want in hdfs , others want in Rabbit MQ.
All they change is the Sink location and Spring XD does the rest.
With REST API integration with applications of business units the modelling and data science pipeline is self service with business empowered enough to make machine learning modes , deploy with CI practices and run as and when they want.
The time to A/ B test is very less and you can fail fast.
Validation pipelines run of same dataset but are run small sample volume and tool of the choice for people use made models. For example I will generate model using scikit-learn , I will export as PMML. I will also run my validation using scikit-learn and compare with what output the real scoring engine gives.
Thanks for reading , please do share your comments and best practices.
References and further readings
R PMML package
Spark PMML export
Open scoring engine
Spring XD PMML execution