How to Build High-Accuracy Databricks Genie Spaces

How to Build High-Accuracy AI/BI Genie Spaces with Metadata Engineering and Benchmarking

Add a header to begin generating the table of contents

Summary

Co-authored with Anshuman Dasgupta, India Partner Solutions Architect, Databricks this post shares the exact playbook: data modelling techniques, Genie configurations, and the benchmark-driven feedback loop that got us to 100% accuracy even on the most complex and nuanced benchmarking questions.

Anshuman Dasgupta Co Author

The question we wanted to answer

When we first started working with Databricks AI/BI Genie, we were excited about the idea of democratizing data for non-technical teams, who until then had been relying solely on static dashboards. But we also wanted to push it further: just how good could we make it? Because, without high accuracy, adoption across business units was difficult.

Genie is an out-of-the-box solution; users can ask questions in natural language, and Genie translates them into governed SQL and returns answers. No dashboard to build. No analyst to wait for. Just a question and an answer.

And honestly? It starts strong. When we connected our Supply Chain tables to a Genie Space with basic data modeling already in place and benchmarked it against 15 complex representative business questions, we got 8 out of 15 correct (53%) right out of the gate, a solid foundation that told us exactly where Genie needed more context to go further.

This article documents how we took that same Space to 15/15 (100%), and the systematic playbook we used to close the gap.

Think of Databricks Genie Spaces like a brilliant new analyst on day one; very capable, but needing context, examples, and clear rules to perform at its best”

The Data: Supply Chain Order Analytics

We built around a supply chain purchase orders dataset, a common domain with complex, multi-faceted questions spanning delivery performance, supplier quality, and order volumes.

We structured the data into three purpose-built Gold views, each designed to answer a specific category of business question:

The three Gold views in Unity Catalog; each scoped to a specific analytical domain.

Each view is a denormalised, pre-aggregated Gold layer built on top of an underlying star schema, not the raw schema itself. This turned out to be the single most impactful architectural decision we made. When your data is already denormalised, everything lives in one flat table, and Genie simply writes a SELECT. Less complexity, fewer errors, and higher accuracy. We’ll walk through exactly how we did this in Phase 1, Principle 1 below, but let’s talk about the governance and access for business users first.

Phase 0: Platform access make sure users can get in

Before investing in data modeling and metadata engineering, there are two platform-level prerequisites that will determine whether your business users can access Genie at all.

0.1 Enable Automatic Identity Management (AIM)

When you roll out Genie to business users, the first question is: how do they log in? By default, every user needs to be manually provisioned in Databricks before they can access anything. That works when you're onboarding a handful of data engineers, but not when you're trying to reach hundreds of business users across Marketing, Sales, and Operations who are the main end users of genie.

This is where Automatic Identity Management (AIM) comes in. Without it, when a business user types their email into Databricks to log into Databricks Genie, they'll simply see "Username does not exist." AIM fixes this by connecting Databricks to your identity provider (e.g. Okta, Entra ID, OneLogin), so it automatically looks up the user in your IdP and provisions their account on the spot.

What makes this powerful from a governance perspective is that you're not managing users in two places. Your IT team continues to manage identities and group memberships in your IdP as they already do. Someone joins the Marketing team? Add them to the Marketing group in your IdP, AIM syncs it to Databricks. Someone leaves? Remove them; access is gone. Databricks simply follow your existing identity governance.

0.2 Assign the Consumer Role for Genie-only access

Even with AIM enabled, a newly provisioned user lands in Databricks with no permissions. They can log in, but they can't do anything including access Genie.

This is where the Consumer Role comes in. Most organisations don't want business users navigating the full Databricks workspace, seeing clusters, warehouses, Unity Catalog browsers, or notebook environments. They just want them to open Genie and ask questions. The Consumer Role enables exactly that: users access Genie through Databricks One, a streamlined interface purpose-built for consuming published assets like Genie Spaces and AI/BI Dashboards, without ever seeing the underlying platform complexity.

Think of it as the difference between giving someone a key to the entire building versus a pass to the front reception area. Business users don't need the full Databricks workspace; they get a clean, focused experience through Databricks One where they ask a question and get an answer.

Phase 1: Data modeling; lay the right foundation first

Now that access and governance are sorted, business users can log in through AIM, they land in Databricks One with the Consumer Role, and each group only sees the Genie Spaces shared with them. It's time to focus on what makes those Spaces actually useful.

Before finalizing the Genie space, we spent time modeling the data specifically for natural language querying. This is the highest-leverage step. Get it right, and everything above it works better. Skip it, and no number of instructions will save you.

1.1 Denormalize everything

Our underlying Gold layer is a classic star schema fact_purchase_orders joined to the dimension tables for supplier, material, plant, and date. Genie handles star schemas well and generates correct JOIN logic but we found that pre-joining everything into flat, denormalized views makes Genie's job even easier. Simpler SQL means faster responses, fewer edge cases, and more consistent results across a wider range of question phrasings.

SQL

Both of these work. Genie handles the star schema correctly:

SELECT s.supplier_name, SUM(f.order_amount) 
FROM fact_purchase_orders f 
JOIN dim_supplier s ON f.supplier_key = s.supplier_key 
WHERE s.supplier_country = 'India' 
GROUP BY s.supplier_name

But with a denormalized view, the SQL is simpler and more reliable at scale:

SELECT supplier_name, SUM(order_amount) 
FROM vw_purchase_order_kpis 
WHERE supplier_country = 'India' 
GROUP BY supplier_name

Think of it like giving a new data analyst a pre-joined report with all columns in a single view versus raw database access; both work well, but one helps them focus on answering the question rather than navigating database schemas and relationships.

On join-heavy questions, denormalized views consistently produced more reliable results across varied question phrasings.

1.2 Pre-calculate derived fields

Instead of asking Genie to construct a DATEDIFF calculation from scratch, we made the calculations ahead of time. Columns like delivery_delay_days, delivery_status, delivery_delay_bucket, and is_high_value_order give Genie ready-to-use values; no computation required. Every pre-calculated column is one fewer thing that can go wrong.

1.3 Filter irrelevant data at the view level

Our raw Gold table had 85 columns, including surrogate keys, soft-deleted records, and internal audit flags. We filtered those out in the view definition, removed all surrogate keys, and trimmed from 85 down to 30 focused columns. A smaller, cleaner surface area means Genie picks the right column more reliably, every time.

1.4 Use human-readable column names

po_qty became order_quantity, mat_grp_cd became material_group, and dlvry_dt became actual_delivery_date. Since Databricks Genie maps natural language directly to column names, every descriptive rename is a free accuracy improvement, and when your goal is 100% accuracy, you don't want to leave any room for error. That said, Genie performs impressively well even without these changes; these are refinements that take a good result and make it great.

Phase 2: Metadata engineering; teach Genie your business language

With clean views in place, we annotated everything in Unity Catalog. According to Databricks’ documentation, this is the #1 accuracy driver, and our experience confirms it completely.

2.1 Table-level comments that direct traffic

Each view got a description that explicitly tells Genie when to use it. This acts as a routing layer helping Genie pick the right table when multiple could apply:

SQL

COMMENT ON VIEW vw_purchase_order_kpis IS 
'Core purchase order KPI metrics. Contains total orders, on-time delivery 
rate, average lead time, and SLA compliance. Use this view for questions 
about order volumes, delivery performance, lead times, and SLA. Each row 
represents a combination of fiscal year, region, supplier, and material.';

2.2 Column-level comments with synonyms and examples

Every column got a rich description covering what it means in business terms; synonyms users might actually use, example values, and interpretation guidance.

SQL

COMMENT ON COLUMN vw_purchase_order_kpis.avg_days_to_delivery IS 
'Average number of days from order creation to actual delivery. 
Lower is better. Also called lead time, delivery TAT, or time to deliver. 
Example values: 5.2, 12.8, 30.1';

That “Also called…” pattern is critical. Users say, “lead time”, “TAT”, “delivery time”, “how long does it take”, all pointing to the same column. Without synonyms, Genie may grab the wrong column entirely.

2.3: Genie Space configuration; curate the experience

After creating the Genie Space and adding our views, we tuned the data layer directly in the Genie UI. These settings are scoped to the Space they don’t touch your Unity Catalog metadata.

It’s a simple toggle that eliminates an entire class of failures. Enable it on every categorical column; there's no reason not to.

Phase 3: Knowledge Store; encoding business logic

With solid data modeling and metadata in place, we configured the Knowledge Store. This is where you encode business logic that can’t live in column names alone.

3.1 General Instructions: Be specific, not vague

We wrote 11 specific, actionable instructions.

The difference between good and bad instruction:

One instruction we discovered the hard way: Genie was applying a fiscal year filter to every query, even unprompted. A single explicit instruction — “only filter when a year is mentioned” — fixed 3 benchmark failures in one shot.

3.2 SQL Expressions: define your business metrics precisely

Instead of hoping Genie reinvents your business definitions from scratch, SQL Expressions let you define them once and lock them in permanently. In practice, you don’t need to define every metric focus on your most critical and sensitive ones first: the KPIs that appear in executive reporting, the calculations that vary by team, and the definitions that have caused confusion or inconsistency in the past. Those are the ones worth codifying.

-- Name: on-time delivery rate 
-- Synonyms: OTD, OTIF, on-time rate 
-- Instructions: Use when users ask about delivery performance. Exclude pending orders. 
ROUND(SUM(CASE WHEN delivery_status = 'On Time' THEN 1 ELSE 0 END) 
     * 100.0 / COUNT(*), 2)

3.3 Parameterized Example SQL Queries

We added 8 example queries covering common question patterns. Each has three parts: a title phrased as a user question, correct parameterized SQL, and usage guidance. Parameterized queries are marked as Trusted Assets — giving users visible confidence in the answer.

-- Title: Which suppliers have the most defects? 
SELECT supplier, 
       SUM(defect_count) AS total_defects, 
       ROUND(AVG(rca_closure_rate), 2) AS avg_closure_rate 
 
FROM vw_supplier_quality_analysis 
WHERE fiscal_year = '{{fiscal_year}}' 
GROUP BY supplier 
ORDER BY total_defects DESC 
LIMIT 10

Phase 4: The benchmark loop; measure, fix, repeat

This is where real accuracy gains happened. We defined 15 benchmark questions, then ran them, fixed specific gaps, and ran them again.

4.1 How we generated ground truth, with Claude Opus 4.6

Before running a single benchmark, we needed a reliable ground truth SQL to measure against. We used Claude Opus 4.6, Anthropic’s most capable model, to generate the reference SQL for each of our 15 questions, given the full table schemas and column definitions as context.

This gave us a strong, independently reasoned baseline to compare against. Not SQL we wrote ourselves with our own blind spots, but SQL generated by a state-of-the-art model with full schema awareness.

What happened next was the most interesting part of the entire exercise.

Genie outperformed its own ground truth

We were able to improve accuracy to 12/15 through data modelling, metadata engineering, and Knowledge Store configuration, a solid 80% on the benchmarks. But on further review of the remaining differences, we discovered something unexpected

The final 3 “differences” weren’t failures; they were improvements. Genie’s SQL was logically correct but structured differently from the Claude Opus ground truth. In each case, Genie had actually done something smarter:

Case 1 : The Weighted Average (Question 3)

“What is the on-time delivery rate?”

Our Claude Opus 4.6 ground truth used a simple AVG(on_time_delivery_pct) — valid, but misleading in a supply chain context. A supplier handling 1,000 orders should carry more weight than one handling 10. Genie knew this and returned a weighted average instead, scaling each supplier's rate by order volume. It also added WHERE on_time_delivery_pct IS NOT NULL unprompted, excluding incomplete records that would have skewed the result. Two improvements, zero instructions asking for either.

Case 2 : The Ranking (Question 10)

“Compare APAC and Europe by order volume.”

Our ground truth returned a sorted list of regions with order counts — correct but leaving the interpretation to the user. Genie added a RANK() window function, automatically assigning each region a position. Nobody asked for a ranking, but every business user who sees a list of numbers immediately ranks them mentally anyway. Genie just did it for them, delivering the insight rather than the raw data.

Case 3 : Dynamic Fiscal Year

“What is the monthly trend of total spend?”

Our ground truth hardcoded WHERE fiscal_year = '2024-2025'correct today, broken next year. Genie wrote logic that automatically detects the most recent fiscal year from the data itself, making the query production-ready from day one. Having learned our YYYY-YYYY format and April-to-March Indian fiscal calendar through the Knowledge Store, Genie didn't just answer the question it wrote a query that will keep answering it correctly without anyone ever touching it again.

Why did this happen? Because by this point, Genie had something Claude Opus didn’t have when generating the ground truth: deep context about our specific business domain. It understood our fiscal year format, our regional hierarchy, our supplier naming conventions, and our preferred metric definitions, all encoded through weeks of metadata engineering and Knowledge Store configuration.

A powerful general-purpose model like Claude Opus 4.6 generates excellent SQL from a schema. But a well-configured Genie Space generates business-aware SQL — and that turned out to be the difference.

We updated the ground truth to match Genie’s approach in all 3 cases.

Key insight: When your benchmarking tool starts outperforming the model you used to generate the ground truth, you know the configuration is working. Genie isn’t just answering questions, it’s answering them the way your business would.

4.2 The accuracy pyramid: what moves the needle most

Based on our experience, here’s how each layer contributes to accuracy, from foundation upward:

You can’t shortcut the foundation with more instructions. But with solid data modeling and rich metadata, even a minimal Knowledge Store gets you a long way.

4.3 Eight lessons that changed how we think about Genie

Phase 5: Refinement and knowledge extraction; the finish line is a milestone, not an end

Reaching 15/15 on our initial benchmark was worth celebrating, but it wasn’t the end. Genie Spaces are designed for continuous improvement, and the real payoff comes from building a feedback loop with actual users.

We review the Monitoring tab regularly to see what users are actually asking, where Genie is struggling, and which questions deserve dedicated example queries. When a subject matter expert corrects Genie’s SQL in the chat and clicks “Add as instruction,” that interaction becomes a permanent improvement to the Space. That process, capturing real-world corrections, is the most effective way to improve accuracy over time.

Databricks also includes Knowledge Extraction, which analyzes thumbs-up interactions to propose knowledge snippets. Space authors review and approve these before they enter the Knowledge Store, creating a supervised learning loop that continuously refines Genie’s understanding.

Conclusion

When we started this exercise, we wanted to know how good Genie could get and whether it could safely handle even your most sensitive business questions. The answer turned out to be better than we expected, and better than the best AI model we used to benchmark against. That’s not a small thing. It means that with the right foundation, Genie doesn’t just answer your questions; it understands your business. And that’s exactly what data democratisation is supposed to look like.

Summary

The question we wanted to answer

The Data: Supply Chain Order Analytics

Phase 0: Platform access make sure users can get in

Phase 1: Data modeling; lay the right foundation first

Phase 2: Metadata engineering; teach Genie your business language

Phase 3: Knowledge Store; encoding business logic

Phase 4: The benchmark loop; measure, fix, repeat

Genie outperformed its own ground truth

Phase 5: Refinement and knowledge extraction; the finish line is a milestone, not an end

Conclusion

Related Posts