Here’s a quick thesis+market map document I developed as part of an interview process. Prior to Business School, I worked with Enterprise clients to develop custom training solutions to bring data strategy to life. These ranged from practitioner-focused “How to use Python” classes back in 2015 to Enterprise-wide “AI Readiness and Scaling” engagements as ChatGPT hit the zeitgeist.

Though AI valuations are exploding, the truth is AI penetration in Enterprises is still skin-deep with most usage coming from functions where mistakes are especially forgivable: coding, sales, and customer success. Many of the challenges I heard from clients pre-2023 are still prevalent. There remains a huge opportunity for startups that are focused on these. Here’s my thinking on how AI can truly breakthrough to provide Enterprise-wide value, a quick list of some startups doing interesting things, and some questions I’d like to ask these founders in an early call.

Investment Brief – Secure Enterprise AI Data Infrastructure

A. Investment Thesis

As generative AI shifts from novelty to utility, enterprises are quickly realizing that the real competitive advantage lies not in the models themselves but in the data they uniquely own. While foundation models are becoming commoditized, access to clean, contextual, and compliant proprietary data remains the gating factor for delivering useful, trustworthy, and differentiated AI experiences.

Historically, tapping into proprietary enterprise data has been slow, brittle, and engineering-heavy. It has required custom-built connectors, manual schema management, redaction, and ETL pipelines. This is especially true for data trapped in SaaS tools, long-tail APIs, unstructured formats, or siloed legacy systems. Even when technically accessible, this data is often entangled with sensitive information and governed by strict compliance rules, access controls, and privacy obligations—making data activation not just a technical hurdle, but a serious security and governance risk. As a result, much of the data enterprises could be using to drive AI value remains effectively locked away. It’s too fragmented, too risky, or too fragile to operationalize.

The next generation of infrastructure for enterprise AI will focus not on model performance, but on making proprietary data securely usable regardless of its format, origin, or sensitivity. This emerging layer will combine robust data access with embedded security, governance, and context-awareness. It won’t just extract data. It will prepare it for intelligent systems which act confidently and compliantly.

B. Market Landscape: Secure Enterprise AI Data Startups

Startup	Primary Function	Core Layer	Value Prop	Differentiator	Stage / Backers	Integration Depth
Vectorize	Unstructured data activation for RAG	Data prep + embedding pipeline	Converts enterprise documents into clean, chunked, vectorized form	Auto-tuning for chunking, splitting, and semantic search	Seed / True Ventures, Basis Set	API + CLI, supports LangChain and custom RAG stacks
Doti	Chat based/Agentic search across companies’ SaaS tools, databases	Retrieval + context layer	Surfaces relevant context from enterprise tools via embedded agents	Autonomous query routing + context injection for agents	Seed / Not publicly disclosed	Plug-ins for Slack, Notion, Confluence; API and UI interfaces
Onyx	Secure agent access to multi-source data	Retrieval + orchestration	Lets agents search enterprise systems securely with hallucination controls	Open-source agentic orchestration stack with built-in guardrails	Seed / Khosla Ventures, First Round Capital	SDK + OSS stack; embeddable into custom agents
Private AI	Text redaction + pseudonymization, Protects PII in AI	Security + preprocessing layer	Sanitizes sensitive info before model interaction to ensure compliance	Token-level inference engine with developer-first API	Seed / Microsoft M12, Forum Ventures	Lightweight API; integrates into preprocessing pipelines
Structify	Structured dataset creation from messy data	Data transformation + structuring	Transforms unstructured sources into tabular, AI-ready datasets	Visual-language modeling for structure inference	Seed / Bain Capital Ventures	No-code UI + export-ready APIs
View	Schema-aware metadata layer for AI systems	Governance + context modeling	Abstracts data sources into a unified, queryable context model	Data stays in-place; schema-aware access rules enforced	Seed / Raised in early 2024	Hybrid search interface + SDK; deploys behind firewall

C. Diligence Questions for Founders

What makes your system safe for regulated industries to use in production?
[How deeply they’ve thought about access control, audit logging, and data residency. How will their architecture hold up under real compliance review.]

What happens when the structure or schema of upstream data changes?
[Look for resilience. The best solutions adapt without manual intervention. Important for SaaS APIs or unstructured sources.]
How do you enforce data-level security or filtering once access is granted?
[Distinguishes between systems that simply move data and those that embed logic for redaction, role-based access, or usage scoping.]
Why can’t a modern data platform or MLOps stack do this already?
[Tests for wedge defensibility and insight. Answer should show how the product is AI-native and structurally different; not just a close-to-API feature tacked onto Snowflake, dbt, or LangChain.]
Who installs it, who governs it, and who actually benefits from its use day-to-day?
[This reveals clarity on buyer vs. user vs. enforcer. Great answers distinguish between the install surface (e.g. DevOps, IT), the governance layer (e.g. security, compliance), and the end beneficiary (e.g. analysts, AI teams, product owners).]

Shane Wilson's Blog