Things to think about when you have many data scientists in your team
It is even more important for businesses to invest in data science because everyone can sense the tide of AI and no one wants to be left out. In this new age party of AI, it is crucial for business especially the one with new data science teams ponder upon new ways that are available to help data scientists do their magic and keep the party alive. In this article, we will focus on looking at the things that one has to be careful about even before building a data science team or formulating business use cases.
Before that let us try to define what is the journey of a data scientist. “A data scientist journey can be defined as a sequence of events that starts with a business problem, machine learning formulation, building, deploying and monitoring.”
When we extend the journey of a data scientist to journeys of many data scientists, unseen issues may arise. One way to look at the journey could be to dissect the machine learning ((FYI when I say machine learning, deep learning is included) problem solution design into several components and inspect closely on the issues that may come.
In a typical setup, a data scientist understands the business, business problem(s) and translates it to a (or many) machine learning based problem. Once with approval from stakeholders, the data scientist reaches to data processing step where he/she connects to data, performs various data processing activities (cleaning, manipulation) and moves onto exploring the data. Data processing and exploration of data rather go in parallel. In the modeling stage, data scientist is involved in various data split strategy, evaluation, tuning etc. Once data scientist is comfortable in the modeling stage, he/she has to develop training pipelines (which are basically the steps that you followed in the previous stages for the model to recognize the data used), creates endpoint (how are you exposing the model to the client?) and finally monitoring. There can be subsequent steps after monitoring like model retraining, feedback loops etc. But for the sake of generalization, let us focus on the mentioned stages only.
Business Problem Stage:
As mentioned earlier, a lot of businesses want to jump the ship of AI just for the sake of it. With little\no knowledge of data science or AI in business folks (because that is not their area of expertise), it should be important for management to include data scientists in the business problem detection stage. Sometimes, a business may think that ML is the answer to their problem whilst the problem is something else altogether and vice-versa. Engaging a data scientist(s) at later stages could slow things down. Also, the data scientist may not think of the bigger picture because he/she was never shown the bigger picture. It is equally important (if not more) for a data scientist to understand how business is going to consume the ML solutions.
Machine Learning Formulation Stage:
This is the most crucial stage of all. Here not only the business problem is conversion to a machine learning problem is done but things like available data sources (internal and external), infrastructure, timelines should be considered before moving forward. Here I would like to emphasize on having a cloud space for the data scientists for so many apparent reasons.
All it takes is a cloud storage and a provision to spin up a cluster or there are so many cloud vendors that ready-to-use provide Jupyter Notebooks with Python/R kernel available. It may be worthwhile to have a git repository for a specific use case.
Experimentation Phase (Data Processing, Exploratory Data Analysis (EDA) and Modeling:
This is an iterative stage of data processing, EDA and Modeling. These are the stages, a data scientist has to cross to reach the outcome. Much of data scientist(s) time is spent here. This is also a phase where collaboration needs to be one fo the focus areas.
The cloud space helps in these stages as well. With dedicated space in the cloud, data scientist (s) can work together in one area without having to share codes, data, environment, etc with one another. One may argue that sharing is easy in containers such as dockers. However, as effective docker containers are, practically, data scientists refrain to use them for day-to-day activities. On the other hand, if there is one dedicated space with provision to create multiple notebooks, a lot of pain areas like sharing and environment are taken care of.
I have also noticed, when a new data scientist joins the team and wants to go through the work that has been done in the past, he\she is exposed to the final code (mostly in git). While that exposure is effective, but it is never the full picture. Having exposure to the intermediate steps will help new data scientists to see the full picture and may reduce several iterations of work thereby saving time.
Moreover, a lot of times, teams neglect the iterations and only focus on the final product. However, a lot of insights can be generated in the intermediate steps a data scientist performs which are lost in data scientist’s local machine. In cloud since the machine requirement is flexible, so the problem of data scientists reaching back to management asking for a better infra (laptop) is also eradicated.
Deployment Phase:
The deployment phase is where you focus on the consumption strategy. I am assuming that we are not talking about the offline model sharing strategy as those are outdated now. Here things like traffic, scaling, distribution, privacy, model versioning, etc play crucial pointers to think about. For a data scientist, the worrisome factor could be the training pipeline. Whilst in the previous step, everything was in the notebook which is easy to understand, those are not production-ready codes. At this stage, having a training code and the pipelines in a container is really helpful.
Historically, this is the stage where data scientists leverage clouds to deploy a model. I would like to emphasize on it is equally (if not more) is important to have dedicated cloud space from the experimentation phase.
Exploring cloud spaces like AWS Sagemaker or Azure ML Studio or GCP AI Platform while thinking about the points mentioned can be super-beneficial for you and your data science team.
Thanks
Subarna Rana