Proprietary Data Is Not What You Think It Is
“We have proprietary data.”
If you’ve spent any time in tech, you’ve heard this line hundreds of times. It shows up in pitch decks, board meetings, investor memos, and analyst reports. It has become shorthand for defensibility, for moat, for “you can’t compete with us.”
The problem is that most people who say it don’t provide the full context. And the people evaluating these statements often treat proprietary data as binary: you either have it or you don’t, and if you have it, it must be valuable.
That framing is incomplete. And in the age of AI, it’s getting more incomplete every day.
Proprietary According to Whom?
What makes data proprietary? The question gets asked less often than it should.
Is something proprietary simply because no one else has it? What if a competitor has something functionally similar, but sourced differently? What if data is broadly accessible but has been fundamentally transformed, cleaned, enriched, and integrated into a product that actually drives decisions? Is that proprietary?
Many investors would say no to that last one. And that’s where the evaluation often breaks down.
I’ve spent over 20 years building, scaling, and selling data businesses. I’ve leveraged both truly proprietary data as well as other data which was not always “proprietary” in the way an investor would define it on a napkin. What made data most valuable was how we collected it, how we enriched it, what decisions it informed, and how deeply it was embedded in our customers’ workflows.
The binary framing misses all of that.
The Gap Between Exclusivity and Value
There’s a pattern that plays out often. An investor or data buyer sees a company with a unique dataset and gets excited. The dataset is hard to collect. It took years to build. No one else has exactly the same thing. On paper, it looks like a moat.
But then you ask: Who is buying this? What decision does it inform? And the answers are often vague. The data is interesting in isolation but the link to real customer value is weak.
The flip side is equally common. A company is building something powerful on top of data that isn’t technically exclusive. Multiple vendors might have similar underlying sources. But this company packages it better, updates it faster, integrates it more tightly, and actually solves a problem that customers pay for month after month. It doesn’t pass a strict “proprietary” test, yet the business outperforms companies with objectively more unique data.
I saw this play out recently in the commercial real estate data space. Multiple companies sell overlapping property-level datasets to overlapping buyer segments. And yet several of them generate strong revenue. The market is large, the use cases vary, and execution matters far more than exclusivity.
Data Businesses and the VC Model
Auren Hoffman, who has built and invested in more data companies than almost anyone, has made a point worth internalizing: most data businesses probably shouldn’t be venture funded. They can be excellent businesses with strong recurring revenue and real profitability. But they rarely exhibit the explosive, nonlinear growth that venture capital requires.
Data companies tend to grow steadily over long time horizons. The sales cycles are long. Category creation is often required. The zero-to-one phase is slow and capital-intensive in ways that strain VC timelines. These are characteristics that align better with private equity than with a VC model built around rapid scaling and 10x returns.
That mismatch matters because it shapes how data gets evaluated. If the expectation is hypergrowth, the data itself starts to carry more weight than it should. Ownership becomes a proxy for trajectory. In reality, owning data is one thing. Capturing value from it at scale is something else entirely.
The Identity Question: Data or Software?
Dan Entrup explores a related question in his newsletter It’s Pronounced Data: are you a software company or a data company? The piece highlights the strategic tension between these two identities and the risk of trying to be both without committing to either.
It’s a useful framing, and I’d push it even further. The best data businesses I’ve seen don’t neatly fit into either category. They’re effectively SaaS platforms with a data component. The data becomes valuable because it’s embedded in a workflow. Customers don’t pay for raw tables. They pay for outcomes: insights they can act on, efficiency they can measure, decisions they can make faster or better.
If your business model depends on customers licensing a dataset that they download and use offline, your competitive position is fragile regardless of how proprietary the data is. This is especially true as the “customer” increasingly isn’t a human analyst pulling data into a spreadsheet but an AI agent consuming it programmatically. If your data powers a product that customers use every day, and if that usage generates more data that makes the product better, you have something much harder to displace. The identity question matters less than the integration question.
Not All Proprietary Data Is Defensible
Abraham Thomas dives deep into what defensibility actually means for data in Data and Defensibility. His core insight is one that many investors miss: unique data is neither necessary nor sufficient for defensibility.
For a data advantage to function as a moat, Thomas argues it must be meaningful (the data actually matters to outcomes), rivalrous in value (your having it diminishes a competitor’s position), and without functional substitutes (no one can approximate it cheaply). Most datasets do not satisfy all three criteria.
Thomas also draws an important distinction between genuine control of valuable data and other types of advantages like network effects or feedback loops. Simply having data, even truly exclusive data, does not prevent competitors from building around you if the value delta is small.
There are cases where genuinely proprietary datasets serve niche, declining, or low-monetization use cases. The data is exclusive but the market doesn’t care enough to pay real money for it. On the other side, broadly accessible data sources can become enormously valuable when enriched, integrated, and embedded into workflows that solve urgent problems.
AI Is Accelerating the Decay
The pace of data commoditization is accelerating. Foundation models and AI agents have drastically lowered the cost of collecting, synthesizing, and approximating datasets that used to require years of specialized effort. What once took a team of engineers and domain experts can now be replicated, or closely approximated, with tools that didn’t exist two years ago.
The recent software market selloff is partly a response to this reality. When Anthropic released new capabilities for Claude Cowork in late January, the market didn’t just sell legal tech stocks. The Goldman Sachs Software Index fell 30% from its October 2025 highs. Notably, the damage wasn’t limited to pure software plays. Data-driven information services companies that had long been considered defensible, companies like RELX and Thomson Reuters with decades of accumulated proprietary content, got hit just as hard.
The reflexive response from many software incumbents has been to fall back on their position as systems of record. Salesforce is the system of record for customer data. Workday for employee data. The argument is that if you’re where the data lives, you’re still safe.
But as Zain Hoda argued in a thread that circulated widely among builders and investors, this assumption may be fragile. Hoda’s point is that the data inside most systems of record is actually quite small, and that AI agents, once given API access, can pull a complete copy in seconds. At that point, the agent becomes the primary interaction layer and the system of record becomes a write endpoint. The data doesn’t move, but the value shifts to whatever layer the user is actually interacting with.
This doesn’t mean systems of record disappear overnight. But it does suggest that “we own the data” is a weaker claim than it used to be when the value can be disintermediated from a different layer. The moat isn’t the data sitting in the database. It’s whether you control the layer where decisions actually get made.
For these reasons, I’m increasingly skeptical of the proprietary data moat, at least in the traditional sense. True defensibility now comes from embedding data into tools that customers use every day, creating workflow-level lock-in, and building feedback loops where usage generates better data and better outcomes. Rows and columns in a database are not enough.
That said, the flip side is equally important: in the AI era, companies that possess genuinely proprietary data, data that is truly unique, deeply contextual, and cannot be approximated by a foundation model, will have an enormous advantage. AI makes commodity data less valuable, but it makes truly differentiated data more valuable. The companies that can feed proprietary signals into AI-powered products will build compounding advantages that are extremely difficult to replicate. The bar for what counts as “proprietary” is just much higher than most people think.
How to Actually Evaluate Data
If not binary, then how? A more useful framework evaluates data across several dimensions.
Utility. Does the data inform real decisions with real economic consequences? Or is it merely interesting?
Uniqueness in practice. This is different from uniqueness in theory. Is the data genuinely differentiated in a way that matters to buyers, or could a smart competitor approximate 80% of the value with publicly available inputs?
Enrichment potential. Can the data be combined with other sources to multiply its value? Some of the best data products are not single datasets but the intersection of several.
Scarcity. How hard is it to replicate at scale? Over time? This includes the underlying collection methodology, not just the output.
There are secondary factors: update frequency, historical depth, geographic or vertical coverage, and how painful it would be for a customer to source alternatives. But the four dimensions above are where I’d start.
The Critical Distinction
Proprietary data is not inherently valuable. Its worth depends on how it connects to use cases, how deeply it’s embedded in customer workflows, and whether it can be defended in an environment where AI is making replication and approximation cheaper every month.
The investors and operators who understand this distinction, between raw datasets and actual data advantage, will build and back better companies. The ones who keep treating “proprietary data” as a magic phrase on a pitch deck will keep being surprised when it doesn’t translate into durable value.


