Democratizing SAP Data: Building an Enterprise Lakehouse on Databricks from SAP SourcesĀ 

SAP runs the operational core of most manufacturing enterprises, yet the people who need that data most can rarely access it. This guide covers the architecture, ingestion strategies, and trade-offs for democratizing SAP data on Databricks.

Democratizing SAP Data: Building an Enterprise Lakehouse on Databricks from SAP Sources
    Add a header to begin generating the table of contents
    Architecture diagram showing SAP data flowing through BDC Connect and Replication Flows into a Databricks medallion lakehouse with Unity Catalog governance across Bronze, Silver, and Gold layers.

    1. What manufacturers keep telling us

    At Syren, we work primarily with manufacturing and supply chain enterprises, and we hear the same story almost everywhere we go. SAP runs the operational core of the business, production planning, materials management, logistics, finance, procurement, plant maintenance, and yet the people who most need that data to do their jobs cannot get to it.

    The plant manager is waiting for a stockout report. The supply chain analyst is trying to model supplier risk against actual purchase order behavior. The finance partner is reconciling a close. The data science team is trying to build a demand forecast that combines SAP shipment history with weather, point-of-sale, and IoT signals from the shop floor.

    Every one of those people ends up either filing a ticket with a central BW team, working from a stale Excel export, or building a shadow data pipeline on the side. None of those is sustainable, and none of them scales to the kind of AI-driven decisioning that manufacturers actually need to compete on now.

    This raises a fair question: SAP has had analytics tools for thirty years BW, BEx, SAC, Datasphere, and now Business Data Cloud. Why not just use those?

    SAP's analytics stack is genuinely good at what it was built for: structured, semantically rich, governed analysis of SAP data. For SAP-internal financial reporting, planning, and BI, it often is the right tool, and we would never recommend ripping it out.

    But the workloads that define modern manufacturing and supply chain analytics look different. They combine SAP data with significant non-SAP data, MES output, IoT telemetry from the shop floor, supplier risk feeds, weather, point-of-sale, and logistics tracking. They use open-source ML frameworks and foundation models. They span structured, semi-structured, and unstructured data.

    They demand a single platform across the whole data estate, not separate stacks for "SAP" and "everything else." The SAP analytics stack was never designed for this, and SAP itself acknowledged as much when it partnered with Databricks in February 2025 rather than trying to build a competing AI and ML platform.

    So the question isn't "SAP tools or Databricks." It's: where does each set of workloads belong? SAP-internal analytics, planning, and governed financial reporting stay on the SAP stack where they belong. Cross-domain analytics, ML, AI, and the lakehouse that underpins them belong on Databricks. The work of democratizing SAP data is the work of getting trusted, semantically rich SAP data across that boundary cleanly without losing the business context that makes it trustworthy in the first place.

    This is the problem we work on every day. Sitting at the intersection of manufacturing, supply chain, and Databricks gives us at Syren a vantage point that is relatively rare most SAP integrators don't go deep on the lakehouse side, and most Databricks practitioners don't have manufacturing process knowledge.

    Combining those two perspectives is what lets us build solutions that actually democratize SAP data for the teams that need it, rather than just relocating a silo.

    A lakehouse architecture is the cleanest answer the industry has produced for the cross-domain side of that boundary, and the SAP–Databricks partnership has made that answer dramatically more practical to implement. This article is about how to actually build that lakehouse what to plan for, what to avoid, and what trade-offs to defend in an architecture review.

    2. What "democratizing SAP data" actually means

    Three things have to be true at once, or you don't have democratization; you just have data movement.

    A lakehouse delivers all three when it's built correctly. Without thoughtful design, you get a fast data lake with SAP data dumped into it, which is the same data silo, just relocated.

    3. The SAP source landscape

    Before designing the Lakehouse, you need an honest inventory of what "SAP" actually means in your estate. This shapes every downstream choice, and it is foundational to any serious SAP data management strategy.

    Architecture diagram showing SAP data flowing through BDC Connect and Replication Flows into a Databricks medallion lakehouse with Unity Catalog governance across Bronze, Silver, and Gold layers.

    3.1 SAP ECC (typically ECC 6.0)

    Still the most widely deployed SAP ERP. Runs on Oracle, Db2, SQL Server, or HANA. Mainstream maintenance ends December 2027, which is the single biggest driver of S/4HANA migration projects in flight today. Integration realities:

    3.2 SAP S/4HANA: three deployment models, three integration profiles

    The strategic modernization here is extraction-enabled (and CDC-enabled) CDS views plus the ODP framework, replacing the old RSA7 / BW Delta Queue model.

    3.3 SAP BW and BW/4HANA

    The classic SAP analytical warehouse and its HANA-native successor. Holds modeled, harmonized data (InfoProviders, aDSOs, CompositeProviders) rather than raw transactions. Often a better source for the Lakehouse than raw ECC tables, because the business modeling work is already done. BW Bridge inside the Datasphere is the recommended path for BW customers transitioning to the BDC world.

    3.4 SAP HANA as a database

    HANA appears in two roles: as the underlying DB for S/4HANA / BW4HANA (integrate at the application layer, not here), and as a standalone HANA Cloud / HANA sidecar running calculation views for analytics. JDBC integration to HANA is straightforward but comes with source-load and licensing caveats.

    3.5 SAP Datasphere

    SAP's cloud data fabric. Connects, models, federates, and governs data across SAP and non-SAP sources. Now the engine that powers SAP's recommended integration paths, including Replication Flows.

    3.6 SAP Business Data Cloud (BDC), the 2025/2026 umbrella

    Announced February 2025. BDC is the SaaS that now contains Datasphere, SAP Analytics Cloud, BW (as Bridge), curated SAP data products, Intelligent Applications, and the embedded SAP Databricks. SAP HANA Cloud became a native BDC component in the Sapphire 2026 cycle. For any architecture conversation starting in 2026, BDC is the term to plan around.

    3.7 Line-of-business and edge SAP applications

    Easy to forget but very common in scope: SAP SuccessFactors (HR, separate cloud), SAP Ariba (procurement), SAP Concur (T&E), SAP IBP (integrated business planning), SAP CRM / C/4HANA (customer experience suite), SAP Fieldglass. Each has its own APIs. BDC's curated data products are progressively expanding to cover these.

    Practical takeaway: ECC-on-Oracle, S/4HANA Cloud public edition, BW/4HANA, and SuccessFactors look identical on an architecture slide. They are not. Map every system in scope to this list before scoping ingestion.

    4. The target: what does the Lakehouse look like?

    A defensible Lakehouse from SAP sources follows the medallion pattern, but with SAP-specific considerations at each layer.

    Bronze,Ā landed, semantically intact

    Bronze stores SAP data as close to source as practical, preserving business context. This is not raw KUNNR columns with no documentation. It's source-faithful data with three things attached: the semantic metadata (field descriptions, primary keys, relationships), the governance tags (PersonalData classifications, data residency markers), and the lineage back to the originating SAP object. SAP BDC Connect's automatic semantic metadata sync into Unity Catalog has made this layer dramatically cleaner than it was even a year ago, table and column comments, primary and foreign keys, and PersonalData governance tags flow through automatically.

    Silver, conformed, joined, business-ready

    Silver is where SAP data becomes useful across the enterprise. Customer master joined across SAP and CRM. Financial actuals from S/4HANA joined with planning data from IBP. Purchase orders joined with supplier risk data from a third-party feed. This is the layer most consumers should actually be querying. It's also where the SAP-vs-everything-else boundary dissolves.

    Gold, domain marts and ML features

    Gold is purpose-built: finance close marts, supply chain risk features, customer 360 marts. Owned by domains, not by central IT. This is where democratization shows up as actual business outcomes.

    What sits across all three layers

    5. How SAP data reaches the Lakehouse (Ingestion strategies)

    There are seven materially different paths for SAP data integration. They differ on five axes that matter in a review: compliance with SAP's ODP support note, latency, business semantics preserved, total cost, and operational burden.

    5.1 SAP BDC Connect for Databricks (Delta Sharing) ,the new strategic path

    What it is: SAP publishes curated data products from BDC. A bi-directional, zero-copy Delta Sharing connector mounts them in your external Databricks Unity Catalog workspace. Generally available on AWS, Azure, and GCP. Secured with mTLS + OAuth (OIDC). Semantic metadata (display names, descriptions, primary/foreign keys, PersonalData governance tags) syncs automatically into Unity Catalog.

    Why it matters for democratization: this is the cleanest path for getting trusted, governed, semantically rich SAP data into the hands of the wider data community without replication. A concrete example: BDC's finance data products natively present the simplified S/4HANA universal journal format (ACDOCA), which means data teams in Databricks don't have to manually reconstruct finance facts by joining legacy ECC tables (BSEG, BKPF, BSIK, etc.) on the Lakehouse side. The business model is preserved at the source.

    Constraints worth scrutinizing:

    5.2 SAP Databricks (embedded inside BDC)

    What it is: a SAP-managed Databricks workspace that lives inside BDC. Sold by SAP. Governed by Unity Catalog. Pre-wired to the curated data products.

    Strengths: shortest path from SAP data to Databricks SQL, ML, and AI. No connector to configure. Single contract.

    Constraints to handle carefully, this is the area most likely to be questioned in review:

    Most enterprises with an established Databricks footprint end up using SAP Databricks for SAP-specific work and keeping their existing workspace for cross-domain workloads, joined via Delta Sharing.

    5.3 SAP Datasphere Replication Flow

    What it is: a managed replication capability inside Datasphere/BDC that pushes data to a target, including cloud object stores (S3, ADLS, GCS), which Databricks then ingests via Auto Loader / Lakeflow.

    Status: this is now SAP's recommended path for bulk replication following the ODP support-note restrictions. Uses extraction-enabled CDS views, supports delta where the source allows.

    Constraints:

    5.4 Third-party ingestion tools (Fivetran, Qlik, Informatica, Precisely, BryteFlow, etc.)

    What it is: validated ingestion partners with purpose-built SAP connectors that write directly to Delta Lake / Unity Catalog. Many integrate with Databricks Partner Connect.

    Status post-ODP-restriction: SAP Note 3255746 restricted third-party use of the ODP RFC API. Tools that previously depended on ODP RFC had to adapt. Vendors including Microsoft (Azure SAP CDC connector), Fivetran, and Qlik have built compliant alternatives, typically using ODP OData (compliant but slower than the RFC path) or non-ODP interfaces. Some vendors (e.g., BryteFlow) emphasize SAP-certified extraction methods (ODP + OData) over RFC-based approaches.

    Strengths: mature tooling, parallelized full + delta loads, dashboard monitoring, broad source coverage including pool/cluster tables and BW objects.

    Constraints: subscription cost on top of Databricks cost; ODP-OData performance trade-off; vendor compliance must be re-verified for each tool against the current support note.

    5.5 Direct JDBC to SAP HANA (Spark JDBC, ngdbc.jar)

    What it is: Databricks notebooks read directly from HANA via the SAP HANA JDBC driver, pulling tables and calculation views into Spark.

    Strengths: fast to prototype, supports filter and column pushdown.

    Constraints worth weighing hard:

    Good for: bulk initial loads, small-table reads, prototypes. Bad for: ongoing CDC or production replication.

    5.6 Hyperscaler-native integration services

    Each cloud has SAP-aware tooling that lands data in object storage for Databricks to pick up:

    Most attractive when an organization is already heavily invested in the hyperscaler's data tooling.

    5.7 SLT and SAP Data Services

    SAP's own real-time and batch replication tools. SLT (SAP LT Replication Server) is well-known for trigger-based CDC; SAP Data Services (BODS) is the classic batch ETL platform. Still valid where already operationally embedded, particularly in BW Bridge scenarios. Increasingly displaced by Replication Flow + BDC Connect for new builds.

    5.8 Side-by-side comparison

    Method Real-time / Delta Preserves SAP business semantics ODP-Note compliant Best for Watch out for
    BDC Connect (Delta Sharing) Live, zero-copy Yes, full semantic metadata sync N/A (SAP-governed) Strategic, governed sharing of curated data products Requires BDC; curated only; cross-cloud egress
    SAP Databricks (embedded) Live Yes N/A SAP-heavy DS/ML/SQL workloads, all-in-one No DLT / Lakeflow / Workflows / Streaming, pair with external workspace
    Datasphere Replication Flow Minutes to hours (snapshot/restore) Yes (CDS) Yes (SAP-native) Bulk replication to object storage Concurrency cap; CU cost
    Fivetran / Qlik / Informatica / BryteFlow CDC available Partial Yes via ODP-OData / non-RFC Mature, monitored ingestion at scale Tool license cost; verify compliance
    Spark JDBC to HANA No CDC built-in No (raw tables) N/A (DB-level) Bulk loads, prototypes, HANA sidecar Source load; HANA license; no business logic
    Hyperscaler native (AppFlow / ADF SAP CDC / Data Fusion) Varies; ADF SAP CDC is true CDC Partial Yes (compliant paths) Already standardized on that cloud Coverage varies by SAP source
    SLT / SAP Data Services Real-time (SLT) Yes (DB layer) N/A (SAP-native) Existing SAP shops, BW Bridge feeds Aging stack; skills dependency

    6. Reference architectures for the SAP Lakehouse

    6.1 Greenfield: S/4HANA → BDC → external Databricks Lakehouse

    The cleanest 2026 architecture. SAP curates' data products in BDC. BDC Connect for Databricks shares them via Delta Sharing into the customer's Databricks workspace as a Unity Catalog catalog. Non-SAP data (clickstream, IoT, third-party feeds, CRM) lands in Databricks the usual way, Auto Loader, Lakeflow Connect, Fivetran. Bronze keeps SAP semantically intact; Silver joins across domains; Gold serves business marts. Unity Catalog governs end-to-end. ML and AI workloads consume both seamlessly.

    6.2 Brownfield: large ECC estate + existing Databricks

    ECC 6.0 still runs core operations. The pragmatic path:

    This is also the architecture that buys the most time for skills transition, BW developers can keep contributing to the BW Bridge side while the Databricks team builds out the cross-domain layers.

    6.3 Hybrid analytical: BW/4HANA as system of analytical record

    Where BW/4HANA holds harmonized, business-trusted data, leave it in place. Use Datasphere to federate or replicate selected InfoProviders / aDSOs to the Lakehouse. Use Databricks to combine those harmonized metrics with non-SAP signals, clickstream, weather, social, IoT, for ML use cases BW was never designed to support.

    7. Cloud-by-cloud: AWS vs Azure vs GCP for a SAP Lakehouse

    Databricks compute, Spark runtime, Delta Lake, Unity Catalog, and the notebook experience are functionally identical across the three clouds. SAP Databricks (the embedded variant) is offered on all three. What differs is the integration layer: identity, storage, networking, billing, native services, and SAP-specific connectors.

    7.1 AWS

    Strengths:

    Watchouts:

    7.2 Azure

    Strengths:

    Watchouts:

    7.3 GCP

    Strengths:

    Watchouts:

    7.4 Cost dimensions that apply everywhere

    DBU rates vary by cloud, region, tier (Standard / Premium / Enterprise), and compute type (All-Purpose / Jobs / SQL / Serverless). Pre-purchased Databricks Commit Units (DBCUs) can save up to roughly 37% on 1–3 year terms. Regardless of cloud, egress between BDC's tenant and a customer-managed Databricks workspace in a different cloud or region is the most missed line item in TCO models.

    8. A decision framework

    Five questions, in order. Each prunes the option space.

    1. Do you already have, or will you buy, SAP BDC?

    9. Making it democratic: the work that doesn't show up in architecture diagrams

    The architecture above is a necessary condition for democratization, not a sufficient one. Three pieces of work matter as much as the pipelines:

    10. Open questions and risks worth calling out

    Honest accounting of what's still moving:

    Here's a closing section drafted in the same voice as the rest of the blog. It ties together the three-domain expertise point, makes the case for thoughtful architecture over template solutions, and ends with a clean call to action for Syren without sounding salesy.

    11. Closing thought: the architecture is only as good as the people designing it

    Everything in this article points to a single, uncomfortable truth: there is no template for SAP Lakehouse. The right architecture for a discrete manufacturer running ECC on Oracle with a BW/4HANA reporting layer looks nothing like the right architecture for a process manufacturer mid-flight on a RISE with SAP migration with SuccessFactors and Ariba already in BDC. Both can be defended in review. Both can fail in production if the design doesn't account for what the business actually does day to day.

    Getting this right requires fluency in three domains that rarely sit in the same head, or even the same team:

    The architectures that scale, that stay within budget, and that actually get used in production are the ones designed by people who can hold all three perspectives at once. That's rare. Most SAP integrators are deep on the first domain and thin on the other two. Most Databricks practitioners are deep on the second and thin on the first and third. The combination is what turns "we moved SAP data to Databricks" into "our supply chain team is using SAP-grounded ML in their weekly S&OP cycle."

    This is the work Syren does. We sit at the intersection of manufacturing and supply chain process knowledge, deep SAP experience, and production Databricks delivery, and we use that combination to design Lakehouse architectures that are scalable, cost-defensible, and actually adopted by the business teams they're built for.

    If you're an enterprise leader looking to unlock real value from your SAP data, whether you're scoping a greenfield BDC + Databricks build, navigating a brownfield ECC estate, or trying to make sense of which of the seven ingestion paths fits your landscape, the Syren team would be glad to talk. Reach out to us, and let's design something worth defending in your next architecture review.

    Scroll to Top