Selling data is kind of a cool business

When we were doing diligence on Sentiment Investor (acq #4) I knew I was going to have to rebuild the thing from scratch. Spending $1400 a month just on Firebase hurts my soul and the whole architecture was super google-ed out (i.e. expensive).

For context, Sentiment Investor scrapes social media sites (and soon news sites!) to do some natural language processing on the content. The idea is that main street can sometimes move markets, so knowing what main street is talking about is something wall street should know. This is certainly true, almost obvious in crypto land. We've seen it this year in equities as stuff loosely placed under the umbrella of "Meme Stocks".

One question I didn't get a sufficient answer to in diligence was just how much data we were buying. From the looks of it, it was maybe a few million rows of social media data I could quickly crunch through and port to a more scalable architecture. Once again, I was wrong (actually this should be the name of this blog).

What we bought was a fuck ton more data than I could have imagined. Perhaps too much data.

This is over 60 million social media posts. 🤯

60 million rows. And counting. The data migration is still going, it's been running on 3 servers at my house and still can't make it through it quickly. Also I should note this is going to be super expensive just to get data out of Google.

But 60 million rows is non-trivial. This is a pretty meaningful sample size. Is much of this stuff in the database garbage? Yes! But there's some signal in the noise and that's this product's job.

Normally as a startup you come to the table with something small and specific. But 60 million rows is a good number to mention on a sales call to hedge funds who would otherwise have to go do this themselves. That's a much better sales pitch than I thought i'd be starting with when we purchased this one.

Data as a business is kind of cool. There's a whole world called "Alternative Data". Google it. It's fascinating. It touches not only machine learning data sets but also social media data like we're doing. Scraping, cleaning, and organizing data is a job at every company over a certain size. And for hedge funds or others that trade on equities, better data can mean higher alpha, and that's the only thing that matters in that world (other than perhaps spinning, but I digress).

Is data the new oil? I'm not totally convinced. In some cases, it's everything. In other cases it's a commodity. There are a ton of cool startups coming up with ways to artificially generate data sets for machine learning that produce results as good as real world data. That's amazing, but it means the real world data is no longer a moat, it's a liability depending on what kind of data it is (PII for example, a la GDPR, CCPA etc).

In the specific case of Sentiment Investor, I think we have a large audience of businesses who actively need this data to do their own experiments with. They can use the machine learning we do so they don't have to process everything themselves, or they can just grab the entire database and do whatever they want. It's a scrape once, sell twice kind of business.

Other than the (massive amount of) data, the primary value of this tiny co is it's current customer pipeline. We're buying access to financial institutions that would otherwise be hard to get into. I could justify the entire purchase price just based on the pipeline.

Alright, back to the data migration. Happy Friday!

P.S. we bought another co already (#5!), just haven't had the time to write about it yet. It's another YC company :)

Next year we're going to start raising funds on a deal-by-deal basis. Our track record so far is pretty good:

toybox - exited for profit (great IRR) ( YC Company)
screenshot api - 6x
sheet.best - 3.5x
sentimentinvestor.com - new
workclout.com - new (YC Company)

If that sounds interesting to someone you know, I'd love to connect with them!

✌️,

Andrew