Author: Longhow Lam
Introduction
Databricks has gained popularity over the years. The possibility to set up managed spark clusters, provides data scientists a means to scale workflows to the cloud easily. It lacks however a good GUI to be attractive for a broader group of citizen data scientist or advanced business users. Moreover, it is not easy to push only certain parts of your data science workflow to the cloud, making it more cost effective to use cloud infra structure.
That’s where Dataiku with the new tight integration with Kubernetes (k8s) comes in. Kubernetes has become very popular the last few years in the machine learning world. The flexibility to quickly scale up or down makes k8s very handy for training machine learning models and model deployment. The challenge however: Setting up the right infrastructure and machine learning training jobs can still be quit a technical exercise, requiring more dedicated data engineers or machine learning engineers.
In this brief write-up you will see how easy it is to work with kubernetes in Dataiku.
Dataiku and kubernetes
Dataiku provides a nice user interface where a lot of technicalities of working with k8s have been abstracted away, brining the power of kubernetes closer and more accessible to end users. In Dataiku there are different aspects of the data science workflow that can integrate with k8s. One aspect is pushing trained models into production in containers as API. An earlier brief write-up on this can be found here.
Another aspect is to push certain parts of your data science workflow to kubernetes clusters. Now the nice thing with Dataiku is that you can have your Dataiku server on premise or in the cloud on “a just big enough” machine to serve your users and push jobs in your data science workflow that require heavy lifting to containers and close down when those jobs have finished. This way you can deal in a very cost efficient way with cloud infrastructure.
Parts of the workflow that can be pushed to container execution;
- Running Spark jobs in containers,
- Running notebook kernels in Dataiku in containers,
- Running a part of the data science workflow in containers.
Let’s focus on the last two points. Suppose we have prepared a data set to build a machine learning model, the data is rather large and we want to try many different models and hyperparameter tuning.
On the one hand, you don’t want to swamp the Dataiku server with your hefty model training and influence the performance of other users on the Dataiku server negatively. But on the other hand, you want to claim the full power of the server for just your job. Here is what you can do, why not push the job to a (disposable / temporally) kubernetes cluster!
Kubernetes on Google Cloud Platform
Dataiku supports (on premise) un-managed clusters as well as managed kubernetes clusters on the three major cloud providers (Azure, AWS and GCP). My personal favourite is GCPso let me use that here. It only takes a few easy steps to make use of kubernetes in Dataiku.
Step 1. Prepare the Dataiku machine
Make sure the server where Dataiku is installed has set up the ‘gcloud‘, ‘docker‘ and ‘kubectl‘ commands. Moreover, the machine must have the appropriate permissions to push images to the Google Container Registry (GCR) service and have full control on the GKE service.
Step 2. Setup a kubernetes cluster
If you are given the rights, you can now start kubernetes clusters on GCP with the Dataiku interface. In the settings pane you can either create a new cluster or attach your Dataiku session to an existing cluster that is already running on GKE.
When creating a new cluster, you need to specify at least the name of the cluster and the node pools (i.e. machine type and number of nodes). There are many more things you can specify, we’ll leave them at their default values here. A cool thing to do is to attach GPU’s to your node pools to speed up the training of deep learning models.
Then it takes a few minutes for the new cluster to start and be available. Once it is there you can see the cluster in the google cloud console, or within the Dataiku software you will see that the cluster is available.
Not only can you see the state of the cluster, you are able to mange the cluster within Dataiku and perform certain actions as the following screenshot displays.
Step 3. Setup a configuration
The next thing you need to do is set up an execution configuration. With these configurations you can set memory limits and CPU limits on the cluster. Since each execution configuration can have different restrictions, you can use multiple ones to provide differentiated container sizes and quotas to different users.
Let’s create a configuration ‘testconf‘ with no restrictions as displayed in the figure below.
Make use of the cluster
Now the fun can begin. We have a k8s cluster and we want to use it to do some heavy lifting for us. Let’s start with a machine learning model.
Training machine learning models
Suppose we have two simple data steps to create our final data set for the creation of our machine learning model. A filter and the creation of some additional features. These steps are not that computational intensive, we let Dataiku perform these steps as shown in the flow below.
Now we can use the GUI in Dataiku to create machine learning models as usual. So specify the Target variable, select the input features, select the algorithms and hyper parameters to tune as the figure below displays.
Now the important difference with a normal training run on the Dataiku server, is the selection of the ‘containerized execution configuration‘ as displayed in the figure above. We select ‘testconfig’ that we created earlier. and click ‘TRAIN’. Now Dataiku will push the training to the kubernetes cluster. You can see it is working in the Google cloud console, moreover, the nice thing is that the progress of the model training is conveniently reported back in Dataiku itself.
Once the training has finished you will have all the model results and details you would expect when you would have trained these models ‘locally’ on the Dataiku server itself. So model metrics, variable importance, sub population analyses, partial dependancy plots, the possibility to score new data etc.
Running Notebooks
Dataiku contains a managed notebook environment, you can create and manage notebooks that can interact with the rest of Dataiku. Normally these notebooks run ‘locally’ on the Dataiku server, but now you can run them on the kubernetes cluster as well.
Create a new Python notebook, select the container configuration ‘testconf’ and start working in the notebook. You will see that if you execute cells, it will now perform the execution on the kubernetes cluster, as depicted in the figure below.
Conclusion
With the release of version 6 of Dataiku, the software makes cloud infrastructures with kubernetes very accessible to a broad group of people, advanced business analysts and citizen data scientist. Giving them more time to focus on other things in a data science project: The business understanding, story telling and impact of the problem.
Moreover, I do believe that with the ability of end users pushing work to scalable cloud infrastructure, Dataiku is a serious alternative for #databricks. You now effectively combine the user-friendliness of a GUI for advanced analytics for advanced business users, a code environment for the ‘real data scientist’ [ whatever that may mean 🙂 ], and a cost efficient way of deploying compute intensive (spark) jobs in Dataiku on kubernetes clusters in the cloud.