Here’s a quick thesis+market map document I developed as part of an interview process. Prior to Business School, I worked with Enterprise clients to develop custom training solutions to bring data strategy to life. These ranged from practitioner-focused “How to use Python” classes back in 2015 to Enterprise-wide “AI Readiness and Scaling” engagements as ChatGPT hit the zeitgeist.
Though AI valuations are exploding, the truth is AI penetration in Enterprises is still skin-deep with most usage coming from functions where mistakes are especially forgivable: coding, sales, and customer success. Many of the challenges I heard from clients pre-2023 are still prevalent. There remains a huge opportunity for startups that are focused on these. Here’s my thinking on how AI can truly breakthrough to provide Enterprise-wide value, a quick list of some startups doing interesting things, and some questions I’d like to ask these founders in an early call.
Investment Brief – Secure Enterprise AI Data Infrastructure
A. Investment Thesis
As generative AI shifts from novelty to utility, enterprises are quickly realizing that the real competitive advantage lies not in the models themselves but in the data they uniquely own. While foundation models are becoming commoditized, access to clean, contextual, and compliant proprietary data remains the gating factor for delivering useful, trustworthy, and differentiated AI experiences.
Historically, tapping into proprietary enterprise data has been slow, brittle, and engineering-heavy. It has required custom-built connectors, manual schema management, redaction, and ETL pipelines. This is especially true for data trapped in SaaS tools, long-tail APIs, unstructured formats, or siloed legacy systems. Even when technically accessible, this data is often entangled with sensitive information and governed by strict compliance rules, access controls, and privacy obligations—making data activation not just a technical hurdle, but a serious security and governance risk. As a result, much of the data enterprises could be using to drive AI value remains effectively locked away. It’s too fragmented, too risky, or too fragile to operationalize.
The next generation of infrastructure for enterprise AI will focus not on model performance, but on making proprietary data securely usable regardless of its format, origin, or sensitivity. This emerging layer will combine robust data access with embedded security, governance, and context-awareness. It won’t just extract data. It will prepare it for intelligent systems which act confidently and compliantly.
B. Market Landscape: Secure Enterprise AI Data Startups
| Startup | Primary Function | Core Layer | Value Prop | Differentiator | Stage / Backers | Integration Depth |
| Vectorize | Unstructured data activation for RAG | Data prep + embedding pipeline | Converts enterprise documents into clean, chunked, vectorized form | Auto-tuning for chunking, splitting, and semantic search | Seed / True Ventures, Basis Set | API + CLI, supports LangChain and custom RAG stacks |
| Doti | Chat based/Agentic search across companies’ SaaS tools, databases | Retrieval + context layer | Surfaces relevant context from enterprise tools via embedded agents | Autonomous query routing + context injection for agents | Seed / Not publicly disclosed | Plug-ins for Slack, Notion, Confluence; API and UI interfaces |
| Onyx | Secure agent access to multi-source data | Retrieval + orchestration | Lets agents search enterprise systems securely with hallucination controls | Open-source agentic orchestration stack with built-in guardrails | Seed / Khosla Ventures, First Round Capital | SDK + OSS stack; embeddable into custom agents |
| Private AI | Text redaction + pseudonymization, Protects PII in AI | Security + preprocessing layer | Sanitizes sensitive info before model interaction to ensure compliance | Token-level inference engine with developer-first API | Seed / Microsoft M12, Forum Ventures | Lightweight API; integrates into preprocessing pipelines |
| Structify | Structured dataset creation from messy data | Data transformation + structuring | Transforms unstructured sources into tabular, AI-ready datasets | Visual-language modeling for structure inference | Seed / Bain Capital Ventures | No-code UI + export-ready APIs |
| View | Schema-aware metadata layer for AI systems | Governance + context modeling | Abstracts data sources into a unified, queryable context model | Data stays in-place; schema-aware access rules enforced | Seed / Raised in early 2024 | Hybrid search interface + SDK; deploys behind firewall |
C. Diligence Questions for Founders
- What makes your system safe for regulated industries to use in production?
[How deeply they’ve thought about access control, audit logging, and data residency. How will their architecture hold up under real compliance review.]
- What happens when the structure or schema of upstream data changes?
[Look for resilience. The best solutions adapt without manual intervention. Important for SaaS APIs or unstructured sources.] - How do you enforce data-level security or filtering once access is granted?
[Distinguishes between systems that simply move data and those that embed logic for redaction, role-based access, or usage scoping.] - Why can’t a modern data platform or MLOps stack do this already?
[Tests for wedge defensibility and insight. Answer should show how the product is AI-native and structurally different; not just a close-to-API feature tacked onto Snowflake, dbt, or LangChain.] - Who installs it, who governs it, and who actually benefits from its use day-to-day?
[This reveals clarity on buyer vs. user vs. enforcer. Great answers distinguish between the install surface (e.g. DevOps, IT), the governance layer (e.g. security, compliance), and the end beneficiary (e.g. analysts, AI teams, product owners).]
Leave a comment