Modern software methodologies have some catching up to do to accommodate data science practitioners. It was the early 2000s when The Agile Manifesto marked a step toward a more humane direction for software engineering by placing an emphasis on stakeholder collaboration and rapid feedback.
This evolved into prescriptive frameworks like Scrum and Kanban that made these ideals achievable through a set of simple (albeit difficult) roles and processes. They largely serve to mediate interactions with engineers and have become industry standards because they’re effective.
The Language of Data Science
Over time new roles have surfaced to address an ever-mounting volume and complexity of data and interactions. But process frameworks haven’t really adapted to these new individuals at all. Data scientists in particular face a big challenge in that they are typically considered to be “similar enough” to engineers to simply adopt their habits. But this does not really work.
I remember realising how poorly our language was equipped to express a process that I shared with a particular data scientist. I was sitting on a flimsy plastic chair that he kept beside his desk for these visits. We’d go through his work and, in the capacity of product manager, I would discuss the direction of his research or the relative merits of a machine learning model that he had prototyped.
The terms that we had at our disposal were rooted in engineering practices and gave scant account of designing algorithm products. They served just fine at describing the cadence of writing code to satisfy a given intent and we went through the motions thinking that is all that there was to it.
But we were actually cycling through a much wider variety of distinct phases. I hadn’t fully recognised it because those very useful terms had locked me into seeing a process that was effective enough at something similar enough to feel applicable – he was writing code, we did have a cadence and we were indeed concerned with intent.
The ill fit of this model presented itself in how this work interacted with more traditional software development. It was difficult to find engineering resources to plan and implement his models at the appropriate times. Data science and engineering followed the same processes but were still disconnected and at odds with each other at critical moments.
We resolved this through a dogged commitment to understanding each other’s needs, which for me meant lots of time in that little chair. This allowed us to see patterns in the interaction between product, data science and engineering.
Improving Algorithm Development
The most important thing to plan for is a longer cadence. It’s useful to work in short cycles or “sprints” as Agile prescribes, but don’t be fooled that this comes anywhere close to a full iteration of algorithm development. That takes much longer, so we can’t ‘rinse and repeat’ in the same way we do in software engineering.
Even large engineering tasks are broken down into smaller chunks that can be evaluated and built in short cycles. Big things are only achieved cumulatively through a series of self-contained and repeatable practices, making one ‘sprint’ much the same as another.
Whereas data science sprints are less alike because it takes more time to pass through the phases of product iteration, it helps for product managers to plan in larger cycles because there’s a predictable flow to the demands that each data science phase is likely to place on the engineers.
Early phases, for example, often require very little input. Data scientists can all but disappear for the sake of productivity, leaving engineers free to focus on other things. Once they do emerge, they may need deep input from an engineering team to make architectural choices or to discuss the feasibility of a given model. This ramps up towards putting a model into production, at which point the full focus of at least a portion of an engineering team is required.
Seasons of Data Science
This cycle places a lot of responsibility on product owners to align a gradual shift in the demand and availability of engineering resources. I like to think of it seasonally: hoard those hard-to-find-time-for engineering tasks for the data science ‘winter ‘ and don’t get caught out by the summer bustle.
This is much easier if everyone knows what phases that the data scientists are going through and what that means to them so make a point of extended your process language to include data science terms: “Oh we’re evaluating models for production already? Cool, I’ll make space for an architecture chat later this week.“
This gradual increase in the demand that data scientists place on engineers happens because they ultimately need to share the problem. It has an element of resource management and that’s a useful frame, but it really comes down to a slow transition from being a data science problem to becoming an engineering one.
Rather than allocate hours in a resource budget, it’s better to see this as a collaborative flow. When data scientists first want to meet with engineers to discuss an architecture, they’re not consuming dev time, they’re supplying the next objective.
From the engineers’ perspective, it’s an opportunity to guide the problem to a soft landing on their desks. Throughout these cycles, a shared sense of each others’ needs and language will help create common ground for discussing the problem as it changes hands.