Famously, nearly nine out of ten data science projects never make it to production.
The rest run into a fatal roadblock or limitation somewhere along the way.
If only one in ten software development projects made it into production there would be wild-eyed CEOs across the land lunging over desks to grab project managers by the scruff of the neck!
And even when projects do make it into production the time-to-market is very long and the ROI often disappointing.
So what’s going on?
In this blog, I’ll explore nine key reasons that I have seen in the enterprise that mean that most data science projects never make it into production.
These will cover the entire range of the problem-space: business, data science and IT.
The biggest and most common issue is that many organisations have a centralised, siloed and separate data science team.
Centralising data scientists deprives them of the contextual understanding they need to do their jobs effectively.
What ends up happening is that organisations treat their data science talent like ‘internal consultants’: they create a backlog of requests from the wider company and assign them to individual team members.
Treating data scientists as a sort of ‘on-demand wizard’ that can just magically execute any task in total isolation is destined to fail. The only thing most data scientists can really do in isolation is data exploration, analysis and developing proofs-of-concept (PoCs).
In the event that a PoC is developed, there is often no thought given to its feasibility because the data scientist will have limited knowledge of the constraints of the operational systems involved.
The result is lots of data analysis as well as the occasional model that never sees the light of day.
Instead, your data scientists need to be decentralised in some way: either working in cross-functional teams or at least in smaller data science teams embedded across the organisation.
As a discipline, data science sits between the business and IT: data scientists take their tools and apply them to an IT process in order to ultimately drive a business outcome.
But doing this well requires maturity in each of these areas and extraordinary alignment between them.
Yet so many businesses have completely misaligned objectives and incentives between data science, IT and business.
Their data scientists are trying to experiment with new technologies but are divorced from business realities, the IT folks are just trying to reduce risk and keep things running, and the business are trying to execute on new ideas and strategies without considering the limits of the data/IT context.
In this scenario, you can have the right people and the right technology...but unless you align the stars nothing will come of it.
Instead, the objectives and incentives of these disparate areas of the business need to be aligned so they are working with, rather than against, each other.
Machine learning is still a cutting-edge technology that, accordingly, comes with a great deal of uncertainty.
Many enterprises are delving into machine learning for the first time and simply don’t know what’s going to happen.
They don’t know if a given project is going to work or not, because they’ve never done it before. In this sense, it is a risky proposition. But one with a potentially massive pay-off.
As such, ML must be approached as an experimental, iterative, open-ended process. It requires commitment, patience and open-mindedness.
The problem is that engineers—whose remit, remember, is risk reduction!—are often the gatekeepers of what gets put into production and what does not. And they are typically not big fans of the high levels of uncertainty involved in data science projects.
They prefer something known and simpler that they can break into smaller steps and easily derisk.
If the remit of your team is not specifically around implementing AI/ML (which is often the case) then you will find that it will be continuously deprioritised in favour of less risky, more certain, better understood projects.
If your data science team is siloed and separate and there is no alignment of objectives or incentives between the business, IT and data, then you will pick the wrong problem to solve.
It’s very easy for an isolated data science team—particularly one that is several degrees removed from the person dealing with the problem the data project is aiming to solve—to get lost down a sidetrack.
What happens then is that a whole lot of time, energy and money gets invested into something you solve really well that...isn’t actually a business problem.
Isolated data scientists get lost in cleaning the data, building the model, improving the model and you get a perfect, wonderfully accurate model...that is completely divorced from what the business needs.
Remember the golden rule: always align your use of data to strategic use cases that can be traced back towards a concrete business goal in order to generate tangible and measurable business value.
It’s very easy to get excited at the prospect of doing artificial intelligence and machine learning and to want to dive into the shiniest and fanciest technologies.
But you run into problems when you start prioritising the technology you use over the result you are trying to get.
I remember one gig where I was on a team of data scientists using convolutional neural networks to try to count pixels on images!
This is like using a laser to fry an egg.
A large part of delivering value in data science is knowing when to cut your losses, and the faster you can get a product or feature in front of users and start testing it, the easier these decisions are to make.
Most problems can be solved to a degree that is ‘good enough’ with something comparatively simple. As Elon Musk is fond of saying: “the best part is no part”.
The least complex solution that can return value should always be your starting point.
Avoid the trap of defaulting to the latest, greatest technology or model. Instead, best practice is to start simple, demonstrate something valuable, then iterate from there.
Prove your way to complexity!
Because data science is so damn exciting, some companies just get the ‘best’ data scientists on board without thinking about how they would fit in with other teams or exactly the kind of people they might need.
Accordingly, many data science job descriptions are just lists of fancy models and methods. This is a great way of hiring an academic, but not a great way of hiring someone who can meaningfully contribute to your business.
By hiring an idealised data scientist you will get someone who can capably increase the accuracy of your model from 96% to 98%, but who doesn’t necessarily know what they need to do to use said model to create business value.
Would you rather a 98% accuracy model that is producing nothing, or a 96% accuracy model that is generating value in production?
What the best data scientists bring to the table is much more than the sum total of their knowledge and experiences, it’s the ability to absorb new information on the fly and harness methods they might not have been aware of to solve novel business problems.
Unless your company has some very specific AI centred use cases, deep knowledge of dense academic subject matter is not often actually going to be very useful. A great deal more value can often be returned by using generalists with a base set of tools and techniques that have the ability to dive deep as needed and collaborate on implementation as quickly as possible.
Your data science project will only be as good as the data you put into it.
Lots of companies try to run before they can walk by implementing machine learning models before they have their data quality ducks in a row.
Data scientists need access to high-quality data that they can trust to be up-to-date and usable for the task at hand.
Instead, enterprises must first get the foundation of their data pyramid in place: ensuring that data scientists have self-serve access to high-quality, trustworthy data. Otherwise they have to divert their energy to manually tracking down the people who have the data and checking it’s up-to-date.
Similarly, data must be highly discoverable.
Amazingly, it’s not uncommon for data projects to fail because the teams involved simply don't have access to the relevant data, the data are not ready for consumption or the infrastructure isn’t capable of scaling the use case.
The first problem—access—encompasses limitations in simply getting hold of the data in the right way.
Issues we often encounter is data siloed away in a legacy warehouse, the proper tooling not being available to interrogate it at sufficient speed or volume, the latency being poor or the data scientists might be struggling to get the right access or authorisation.
If data scientists can’t interrogate their data at speed and scale then it’s very hard to even do any basic exploratory work, never mind building production-ready models.
The second problem—consumability—is all about how suitable the data, once accessed, is for the use cases in question.
Sometimes the data are not prepared with the potential consumer in mind and so are not properly indexed or appropriately transformed. This makes it hard to search through and means that the data are not ready to be used directly in target use cases.
If our data folks cannot use the data they find, they have to spend a lot of time and energy converting it to meet their needs, which creates massive delays, risks creating inherently brittle systems and ruins the flow of data through the organisation.
The third issue—useability—refers to the ability of the infrastructure to support certain use cases.
The design of production infrastructure without giving thought to how it impacts potential data science applications can have huge downstream consequences. Some of the most interesting data techniques (clustering, for example) are effectively impossible at scale with certain architectures.
This problem arises when ML models are operative and producing insights, but there is no way of productionising those insights: i.e. integrating them into actual products and services.
Data scientists—particularly when very siloed and detached from the business—can make the mistake of putting all their energy into creating great models without putting any thought into how the resultant insights will be integrated into useful products and services.
When they don’t have the details of how insights are going to be implemented in the real-world of limited infrastructure and tooling, their work usually isn’t suitable to be integrated further into the business.
This is why it’s critical to create cross-functional teams consisting of data scientists, engineers and business: so a clear roadmap to production is embedded into the way that the data scientists approach the problem.
Getting data science projects into production is tricky business, with few knowing how to pull it off. Much less succeeding in generating sustainable business value!
But watch out for these traps and you will massively increase your chances of success.
Check out our eBook The Promise of AI to see why artificial intelligence is so important to delivering a scalable data and business strategy that is founded on more than experience and guesswork.
Interested in seeing our latest blogs as soon as they get released? Sign up for our newsletter using the form below, and also follow us on LinkedIn.