Databricks is an Enterprise AI cloud data platform that is particularly useful for deploying advanced data science projects (such as artificial intelligence (AI) and machine learning (ML)) in the enterprise.
The company was founded in 2013 by the founders of Apache Spark, a well-known open source data tool. Since then, both the company and the wider Enterprise AI field have drawn a lot of interest and are growing rapidly: Databricks raised $2.6 billion Series G + H investment in 2021 alone and is valued at $38 billion.
In this blog, I’ll briefly explore how the Databricks platform works, why it’s such a big deal and how to get the most out of it using a data mesh approach.
A massive challenge that enterprises face in working with the vast quantities of siloed data they have across their organisation is how to actually find and bring it all together in a useful way.
Databricks helps clients store, clean and visualise vast amounts of data from disparate sources. It allows them to provide a single platform for many common data tasks ranging from basic ETL through to business intelligence all the way to ML and AI.
It simplifies the creation of modern data warehouses that enable organisations to provide self-service analytics and machine learning across the organisations’ global data with enterprise-grade performance and governance.
The core of the Databricks platform consists of four open source tools that are wrapped into a cohesive, enterprise-friendly package delivered as a service on the cloud.
I’ll explore the four tools, before turning to how Databricks brings them together to create more than the sum of their parts.
Apache Spark
The core of Databricks is Apache Spark, an open source big data processing engine.
Spark was a truly impactful development in the big data industry, allowing for large-scale distributed computing on large datasets in a way that had not previously been possible.
Initially envisioned as an alternative to Google’s MapReduce system, it has substantial open source and academic pedigree, with well-known open source advocates and academics among the founding team, most prominently Ali Ghodsi, Databricks CEO and adjunct professor at the University of California, Berkeley.
Even working with large datasets, the engine is extremely flexible and scalable. It unifies both batch and streaming data, incorporates many different processing models and supports SQL. These characteristics make it much easier to use, highly accessible and extremely expressive.
DeltaLake
DeltaLake is an open source storage layer that runs on top of data lakes to deliver greater reliability, security and performance.
It is fully compatible with Apache Spark APIs and, similarly, runs both streaming and batch operations.
The features of Delta Lake improve both the manageability and performance of working with data in cloud storage objects and enable a "lakehouse" approach that combines the key features of data warehouses and data lakes: delivering the data management capability and high performance found in warehouses with the low-cost, flexible object stores offered by data lakes.
The lakehouse also broadens the set of tools available. The warehouse is suitable for standard business intelligence (BI) and the lake more for AI, meaning you get the best of AI and BI, using a single copy of the data on a single open platform.
MLFlow
MLFlow is an open source tool that manages the lifecycle of machine learning pipelines and applications.
Data scientists in the enterprise have a complicated job: they need to run multiple experiments, deploying ML models in different ways (integrating them into existing apps or creating new services), training the algorithms and so on, all using a wide variety of tooling.
Bringing all these different aspects together in a way that ensures that the data that goes in is trustworthy and the insights that come out can be used and integrated is a challenge that must be solved if enterprises are to make machine learning anything more than a fun side project.
That is what MLFlow does: it productionises machine learning in the enterprise, making the fragmented pipelines and tooling cohesive and cogent and ensuring that the insights that result can be integrated into the wider business.
Koalas
Koalas helps data scientists to be more productive when working with big data and Apache Spark, in particular.
Data scientists often work in Python, which is not natively compatible with Spark (which does have native Scala and Java APIs). In particular, they are pretty fond of a data analytics Python library called Pandas.
Koalas is a Pandas API that sits on top of Spark, which allows data scientists to use it without having to learn a whole new programming language.
The tool minimises the learning curve for data scientists familiar with the Pandas Python library to get started with Spark. This allows them to be productive immediately and significantly lowers the barrier to entry to both Spark and Databricks.
It’s a small, but extremely impactful addition that makes Spark much more accessible.
The Databricks platform combines these four open source tools and makes them available as-a-service on the cloud.
The service wrapper brings together all the separate layers that the core tools offer, all accessible via a single SaaS interface. The result is a platform that can offer the whole range of data capabilities:
- Cloud-native: works on any major cloud provider
- Data storage: store a broad range of data including structured, unstructured and streaming
- Governance and management: in-built security controls and governance
- Data science tools: wide range of production-ready data tooling from engineering to BI, AI and ML
Together, these layers create a unified technology platform that provides everything a data scientist needs to autonomously draw on whatever environments, tools and infrastructure they need.
What makes this such a big deal is that, by providing this cloud-native service wrapper around the core tools, Databricks helps to alleviate one of the biggest enterprise data challenges: fragmentation.
Environments, tools, pipelines, databases, APIs, lakes, warehouses,...there are thousands of moving parts in an enterprise data estate. The real business value lies not in getting any particular part running smoothly, but rather in creating a coherent and cohesive web of integrated data capabilities that brings data in one end and produces useful business value out the other.
For example, getting a machine learning tool running in a single environment is pretty straight forward. Getting high-quality, trustworthy data into an automated and well-trained machine learning model, producing valuable insights and then being able to integrate these into product development is another issue entirely.
Enterprises need a cohesive solution that provides an enterprise-grade service that helps them increase cohesiveness, reduce complexity and, ultimately, product high-quality, data-driven products.
By providing a SaaS layer in the cloud, Databricks allows data scientists to autonomously provision the tools and environments that they need to produce valuable products: provisioning clusters as needed, launching compute on-demand, easily defining environments, integrating insights into product development and so on.
Databricks’ powerful foundation of the core open source components combined with the enterprise-centric service wrapper makes the platform as useful and accessible as possible from a business perspective. The platform can play a powerful role in reducing fragmentation, which ultimately improves the outcome of data science and analytics functions in enterprises.
But the platform doesn’t solve all of your problems! It solves the technology problems. You still need to get high-quality, trustworthy data both in and out of the platform.
This is where the data mesh comes in, which can be powerfully combined with Databricks to provide a holistic, end-to-end data solution.
Enterprises looking to deploy AI solutions often buy the fancy tech but struggle with getting the data in and out, with predictably depressing results. They get stuck on the data engineering part, which hamstrings their advanced data science capabilities.
Scaling the adoption of these tools and making sure they work for the enterprise all hinges on the quality and availability of the data that comes in and out.
They need to be able to go beyond just trying to batch upload data via ETL and instead find a robust, scalable solution for bridging their data sources and their AI/ML engines.
This is where the data mesh paradigm is very powerful.
The data mesh is an approach to data that emphasises a decentralised domain-centric structure, a product mindset, federated data governance and cross-functional teams with end-to-end accountability.
The result is an intelligent web of data flows between business domains, with each node in the web producing and consuming data for and from other nodes in a way that is meaningful, secure, private and easily consumable. Because each node is responsible and accountable for making data discoverable, the process is very granular, highly scalable and resilient. The data that results is high quality, highly-discoverable data-as-a-product that is designed explicitly with data consumers (e.g. data scientists) in mind.
Working as part of a data mesh, data scientists have all the datasets and data products they need to be highly autonomous and independent in doing their work. It also gives them an architecture and governance structure in which they can make their insights usable for end users such as product owners or developers.
When Databricks is adopted within a data mesh architecture it can fulfil a critical role providing the infrastructure and tooling that bridges the data coming in, on the one hand, and the insights going out, on the other.
It serves to simplify and unify the technology aspects of the data mesh, allowing enterprises to scale their data tools effectively and make them useful as part of a broader data estate.
The combination of the cohesive Databricks platform sitting within the wider data mesh architecture allows businesses to fully unleash the power of both the data they have on hand and the tooling within Databricks.
At Mesh AI, we specialise in bringing a holistic data vision to the enterprise that prioritises unified, scalable approaches that deliver real business value.
We have substantial real-world experience with both the data mesh and Databricks, helping to use these technologies and approaches to unleash the real value of data in the enterprise!
You can read more about how we do this in our other data mesh content here: https://www.mesh-ai.com/content-hub
Interested in seeing our latest blogs as soon as they get released? Sign up for our newsletter using the form below, and also follow us on LinkedIn: