Exploring the latest in data science with Sarah Catanzaro from Amplify Partners

The below is a full (unedited) transcript of a Youtube session / podcasting episode I recorded with Sarah Catanzaro, partner at Amplify Partners in Q1 2020 shortly before the Corona pandemic hit. You can view the video/listen to the podcast on YoutubeApple PodcastStitcheror wherever you get your podcasts.

Trailer

Erasmus Elsner 0:07 
What’s up everybody? Welcome to another episode of the show where I talk to successful startup founders and their investors about the companies that they built an invest in. And the goal like always, is to give me a sense of what it’s like to be in their shoes other businesses take, how they got to where they are today, to learn from the many successes and mistakes, and my guest today is Sarah Catanzaro, Partner at Amplify Partners.

Sarah Catanzaro 0:29 
Thanks for having me.

Erasmus Elsner 0:30 
At Amplified Partners, Sarah invests in technical founders who are solving really difficult data science problems using machine learning, natural language processing, and good old artificial intelligence. And before making your way into venture, Sarah was actually working on such hard data science problems for self, trying to predict the behavior of Somali pirates as a research director at the Center of Advanced Defense Studies, or parsing and classifying 1000s of startup funding data points as head of data at Mattermark.

And in this session, we’ll talk about her journey from being a data scientist to becoming a partner at a fast growing venture firm. We’ll hear about her take on some recent trends in data science. And we’ll take a deep dive into some of her most recent investments at amplify, including the investments in October Mel intervent, bio sciences, and most recently, Mesa design. But let’s hear it from Sarah cell. And let’s dig right in.

Interview

Today, my guest is Sarah catanzaro. From amplify partners. Welcome Sarah, you ready? Take it from the top.

Sarah Catanzaro 2:01 
I am, thanks for having me.

Erasmus Elsner 2:03 
Before I want to begin, we were actually and I’m dating myself here a little bit. But we were actually studying together in Paris, back in the mid 2000s, at Scienes Po, where I did of all things an Erasmus exchange, and you did a independent research stage during your undergrad at Stanford. And then after Stanford, you took a bit of an unusual path towards venture in that you first took an academic position as a research director for the Center of advanced defense studies where I think you studied, among other things, the behavior of Somali pirates using advanced statistical models. Why don’t you take us back to these days? What was Sarah like coming out of college?

Sarah Catanzaro 2:45 
Yeah, so I mean, I think one thing that was interesting about that time is that we’re kind of peak economic crisis. Initially, I was thinking about using statistics to go into investment banking, or consulting, or finance or something like that. And then the economy fell apart. And so I’m a sophomore, junior in college, I basically thought to myself, What am I really interested in? And for me, having grown up in New York during 911, I was very interested in counterterrorism counterinsurgency. And so I thought to myself, like how can I in fact, apply these computational methods, these statistical methods to reduce the fatality of conflict to make war less lethal, which, which is really what brought me to the Center for Advanced defense studies. Our research really hinged upon two key questions. So one has to do with kind of the the nature of warfare itself. You want to know fight an optimal conflict where you have the best outcome with minimal casualties. But But how can you learn about conflict? It’s not like we’re going to engage in more wars so that we can get better at war like that, that would make no sense. So simulation ends up playing a rather big part. And so one of our goals was to figure out how we can use these models, these agent based simulations to better understand how to optimize conflict. Now another key question, which I think I see still today is given uncertain information given incomplete and uncertain information. How do you leverage statistics and computational models to kind of extract the most insights when dealing with insurgencies? When dealing with asymmetric warfare, even when dealing with state actors, you’re never going to have perfect information. But given the intelligence that you have, you want to make the most out of it. And frankly, I think you know, that that sort of problem statement is something that appears both in military science but also in science. And investing in all of these other domains, for sure.

Erasmus Elsner 5:02 
So you spend some more time in the intelligence industry working later at Palantir. Fast forward to 2014, you joined Mattermark as their Head of Data. And for the listeners out there, Mattermark is a startup, which provides a data platform for venture capitalists. I found this YouTube video of yours from your meta Mattermark days where you explained a little bit about the ways that you worked with venture data there, including how you scraped the data, how you cleaned it, and classified it and how you really fit this this large Mattermark databases. I feel like this disposition of Mattermark was really like the ideal preparation for you, not only did you apply, NLP and ML models, but you also worked on VC funding data. So take us back to these Mattermark days and how this this has really shaped you and prepare you for your role as an investor to Amplify.

Sarah Catanzaro 6:09 
So I think that there was something really interesting about my my role at Mattermark. At the time, we weren’t the only company the only platform that was trying to collect information on private companies. In fact, there were others ranging from Pitchbook to Crunchbase to Bloomberg. But those other platforms, were relying very heavily on human intelligence, whether it was crowd sourced intelligence, or having these kind of armies of analysts that could call companies to collect and verify their their data. I had a data team that ranged from five to 10 people. And so human intelligence in that context was not really going to scale. And so I had to figure out how can we automate these processes of discovering more about companies? And going back to kind of my days at sea for EDS, given the information that we have, what other intelligence can we produce? I think one of the challenges that that I found that Mattermark, and not investors see today is that startup data is just incredibly noisy, you have companies that announced their funding, maybe a year after they actually signed a term sheet, you have companies that pivot, and therefore have multiple names associated with them, you have to resolve that entity there all of these types of challenges that make it really, really hard to take a computational approach to investing. But you know, I think that that information that we collect and how we synthesize it can in fact, inform the other things that we just do as humans, whether it’s interacting with founding teams, or drawing conclusions based on our own operational experiences, or the latter was probably most impactful for me. So at the time that I started at Mattermark, there was no data team. But I lucked out in that, in fact, there was an engineer on the engineering team, who had a background in ml. And together, we identified these ways in which we could start to use machine intelligence and automation, really strategic advantage. So in the course of my time, at Mattermark, we we ended up in fact, deploying deep learning to production for features ranging from semantic search, and LP to extract key facts from articles and startup funding to even simpler things like industry, classification, or business model classification. And all of that was automated. And this sounds like dreamy and grand, it was really hard. We would have issues associated with our ML pipelines. Almost every week, the models were were really tough to monitor or very, very challenging to debug a lot of the kind of model surveying and deployment we we needed to figure out on our own is fairly new, I think I, as an investor now can in many ways, just go out, look at the tools that exist in the world, and invest in the platforms and tools that I wish I had back then as data science Lee that matters.

Erasmus Elsner 9:25 
Yeah, makes a lot of sense. And so I’ve worked a lot with venture capital funding data during my PhD. I know how messy it is. But one question I get asked all the time is, how can you use this data to basically predict the outcomes. The more pessimistic side of this is that you can say it’s like pattern matching and fitting existing biases into the model when you use this. But what’s your general take on using data to sort of guide investments?

Sarah Catanzaro 9:53 
So I think the key word that you just used there was using data to guide investments do Given the limitations of startup data, I actually don’t think we can automate investing, I don’t think you can use data and machine intelligence to pick the right companies, you’ll need some mix some synthesis of human intelligence, data and machine intelligence and automation. I think there are ways in which we can use data, for example, to identify industries that are burgeoning or otherwise ripe for disruption, I think you can use data to perhaps get a sense of what companies are growing very quickly, in terms of headcount, and perhaps that’s a positive signal that you want to pay attention to. But ultimately, these insights that you produce, they need to be used in service of some other human decision. And I don’t think that you can automate that.

Erasmus Elsner 10:51 
In 2017 you joined Amplify Partners. So Amplify Partners is a relatively young venture firm having been started in 2012. I think your first fund was $50 million, the second $125 million in 2015. And you’re now investing out of the third vintage, a $200 million fund raised in 2018. The funds’ thesis is around a distributed computing infrastructure and data analytics companies and with investments in companies such as Fastly, and DataDog, you have already seen some large exits in this vertical. But this is all like press and desk research, maybe give us a little bit of your personal flavor of Amplify.

Sarah Catanzaro 11:40 
Yeah, absolutely. So I think you got the facts, right, like Amplify is the first investor, an early stage investor in companies in machine intelligence, data management and distributed systems. So roughly speaking, ml and data science tools and platforms, as well as vertical applications, and our pies, infrastructure, and developer and designer tools. But that’s what we do not necessarily why we do it. And I think why we do it comes down to two things. One is passion, the other is need. So I met amplify, because like I love data science and ml, it’s what I’ve been doing throughout my entire career. And it’s a community that I want to support. I don’t really care that much about mattress startups, I don’t really care that much about underutilized resources and marketplace models. What I really care about is is data science, and ml, and everybody on the Amplify team is aligned in that we’re investing in these areas, not just because we think that there’s a market opportunity, but because it’s what’s most interesting to us. And if we’re going to be spending a lot of time with founders, we want our passions to be aligned with theirs. Now, the other thing that I mentioned was certainly that the need, I think what we saw in the past seven to eight years is that there there’s been kind of a bifurcation, in the service world, between those companies that are focused on on technical innovation, and those companies that are focused on on business model innovation. So you have companies that are leveraging ml new approaches to distributed systems to create new product categories in in kind of the former category. And then you have some companies that are taking existing products, perhaps with a new distribution channel or selling to a new market segment, that are really more focused on kind of the business model. I think what we realized is that for technical founders, they didn’t really have a great partner, they didn’t really have someone who could say like, Don’t give me the 32nd elevator pitch, I know that there’s no way that that you can distill what you’re doing into a five minute demo day presentation, we’re going to work with you to kind of help you craft your message, we’re going to work with you to help you understand whether you should take a top down or bottom up strategy, how you should think about resolving technical data acquisition debt, then we can focus on these challenges that are really specific to to technical founders and technical products. And in that way, I think we can kind of fulfill a key need that that exists in the market as well. So that’s really what Amplify is about, you know, at the end of the day, we are the first partner for technical founders.

Erasmus Elsner 14:39 
So it’s technical founders solving hard technical problems, if I can summarize it a little bit like this. So let’s maybe segue into a more general discussion about data science. And I want to kick this off was with something that you’ve discussed before, which is the divide between big data and artificial intelligence. If we think about the typical artificial intelligence implementation in a corporation today. This typically starts off with an engineering team being hired to sort of create this data lake using probably a Hadoop instance. And then we bring in a data science team, which is meant to implement a machine learning model training a neural network. And then in the end, this is meant to be productized. But there are a lot of cracks along the lines, like the data could be, could be messy, the model could be overfitted, or the business objective could be ill-defined at the beginning. From your personal experience talk to us a little bit about how you think about this divide between big data and artificial intelligence.

Sarah Catanzaro 15:42 
Absolutely. So I think in the age of big data, we kind of recognized that data could be a strategic asset, we didn’t necessarily know how it would be a strategic asset. But But organizations came to view data as valuable. And therefore we developed a lot of new technologies to collect and store data. So again, but this is around the time where you have Hadoop, HDFS, even things like Spark. So again, we can collect data, we can potentially transform it, and we can store it. Now fast forward a couple of years, and we have this AI hype cycle. So nowadays, companies are hiring a bunch of of data scientists, research scientists, to build algorithms to develop neural network models of the data. But but there’s a lot that happens in between. So you, as you alluded to there, there’s certainly a data preparation step. But even before that, there is in accessing discovery step. And I find within many, if not most companies, that they don’t even necessarily know what data they have, where it is, or how to access it. Once you have some sort of data catalog data lineage system, the next question again, becomes, is this data high quality? I think by now, enough VCs and other thought leaders have have kind of broadcast the message that with ml, you have this garbage in garbage out problem. And that is very much the case. So so how do we take the data that we’ve collected, and ensure that it is high quality enough to produce strong output. Now, frankly, My belief is that with higher quality data, we can actually use very simple models to get good results. So perhaps they’re in many cases, a linear model will do, you can go need deep learning. Even as you kind of progress within this journey, too, you start to see challenges around feature management, feature engineering model serving, and it continues to persist. Now one of the things that is just most mind boggling to me today’s is for those organizations that are deploying ml into production, where they’ve solved some of these problems around data access, data discovery, data preparation, feature engineering, it’s kinda like the wild wild west in production. When we deploy software, we have tests, we are doing monitoring, tracing, logging, we have all of these fail safes. In the case of ml, we deploy models into production, knowing that the world is dynamic data will change, and therefore the model will probably expect or behave in unexpected ways. And yet, none of that exists. So Frankly, I think what I hope to see in the next few years is more emphasis on monitoring and debugging models and kind of performance management within these production environments. I think on the research side, some of the new work on things like interpretable ml is finally enabling this.

Erasmus Elsner 19:07 
Yeah, that’s that’s a great segue into into this interpretable machine learning (ML). People want to know what what’s the training data that forms the basis? And also what’s the model? And only then will they trust these models and the outcomes of these models, in particular with respect to autonomous vehicles and healthcare. I think it’s really interesting, but talk to us a little bit more about this interpretable machine learning (ML).

Sarah Catanzaro 19:35 
So I think interpretable ML means different things to different people. And my hope again is that in the next few years, we will start to better understand these definitions and what interpretability approaches best aligned to each definition. So in some cases, for example, in credit scoring or in making decisions about crime and punishment. What we really care about is fairness guarantees, we want to know that our model is not leveraging protected attributes and therefore not perpetuating social bias. So that’s one category of interpretability. How do I ensure that my model is not using certain attributes, yet another category is uncertainty estimation. If we have these systems like autonomous vehicles, and medical diagnostics that are powered by AI, in order to kind of orchestrate human decision making and machine driven decision making, we need to better understand how certain the model is in its prediction. So if, for example, I have diagnostic, and it says, This person is not responding to a certain cancer treatment, I don’t know if the model is highly confident that that is the right answer, or if there’s just a chance. And without kind of that, that sort of insight, it’s really hard for me as a physician to act on that information. So I think uncertainty estimation is really a critical area of ml research. Yet another area that I think is worthwhile to touch upon, since it relates to monitoring and debugging is error analysis. So I want to better understand on what subsets of data my model will fail. There are other aspects of interpretability that that we can cover. But I think for each of these, we really need to kind of think about what is the key question? And therefore what is the right technical approach?

Erasmus Elsner 21:47 
Yeah, for sure. So let’s move on maybe to some of your most recent public investments. And I want to kick this off with a company called OctoML, where Amplify Partners participated in the $3.9 million Seed round, which was led by Madrona Venture Partners. And I try to wrap my head around what OctoML is doing. In the description, it says the company aims to automate machine learning optimization with open source Apache TVM software stack, I wasn’t familiar with TVM. So I found this 2018 white paper, if you like, titled “TVM and automated end-to-end optimizing compiler for deep learning”. TVM is already in production, at companies such as Facebook, Amazon, and Microsoft. And this paper came out of the University of Washington, which OctoML is a spin-out of and one of the authors is also one of the founders, it seems. And so long story short, my understanding of what they’re doing is that basically they’re bringing machine learning to hardware devices, in particular to low-power CPUs and mobile GPUs. And since this is a really highly technical play, maybe talk to us a little bit about what they’re doing over there in laymen terms,.

Sarah Catanzaro 23:05 
Yeah, absolutely. So I guess, you know, to better put up to OctoML in context, let’s go back and think about five, six years ago. You know, everybody acknowledges that Nvidia played a very critical role in enabling AI to come out of hibernation to come out of its winter. This was possible, not just because Nvidia provided GPUs, but because Nvidia provided CUDA. And I think, you know, CUDA is perhaps the unsung hero of the ML world. Without something like CUDA, it would be very difficult for ml algorithm developers for software developers to actually interface with the underlying hardware. But the CUDA is for Nvidia, now you see all of these new AI chips coming to market, you know, in fact, yesterday, I was looking at the ML perf benchmarks and they’re there. They’re just dozens of chip manufacturers that are now entering the market. But it’s not easy to kind of marry the those new specialized hardware back ends with the software itself, what octo ml does, but what TVM does, it really enables you to take any ml model from any framework and deploy it on any hardware back end. And this is really critical to both the software developers, you know, those that are trying to optimize things like large transformer models, which are really hard to get to work in a memory efficient, low latency way to the hardware vendors who want to support new operations who want to iterate on their hardware platform faster. But you know, don’t really have a company hailer stack that that enables that. So the key takeaway is just with auto ml, you can take any model and you can deploy it optimally on any hardware back end. Now, as it relates to how I sourced it, do I spend a lot of time looking at the ML research community, because frankly, I think a lot of innovation is started there. And so I actually got to know that the team at octo ml Well, they were still at UW. I think from my perspective, I saw all of these opportunities for ml at the edge. Frankly, though, I think one of the most exciting AI products is just Alexa, like, for the first time my 96 year old grandmother can listen to music on demand. And I think there there there’s a lot more opportunity for AI at the edge, but it’s really hard now. And so I was really impressed by by how this team at U dub was working to enable that this new AI market no fortunately there there’s a little bit of like right place, right time I built a relationship with the team through the the U dub research lab. And they they shared this vision they shared kind of fit this view about the commercial opportunity and did decide to take the plunge and

Erasmus Elsner 26:23 
Were they actively raising or were they on your radar? And you sort of said if you want to make this into a company, we will look into into this or what was it like?

Sarah Catanzaro 26:33 
Yes, so I mean, I fundamentally believe that no VC should ever force or try to compel a founder to start a company of some kind, who is taking on the responsibility of a founder, they they need to make that decision themselves. So the most that I can really do is say like I see an opportunity here. If you do too, if we’re aligned, I think amplify could be a great partner. But but it is not my role. Nor do I think it is my business to convince anybody to start something but before they’re ready. So with auto ml, frankly, I knew some of the team members for nearly two years before we made the investment. You know, the nice thing about being at a specialized fund where all I really focus on is ml data science and data management is that I can play the long game I can invest in these relationships and wait for two years if that’s

Erasmus Elsner 27:31 
So let’s move on to the next investment Intervenn Biosciences, which is even more complicated. Amplify here participated in the $9.4 million round led by Genoa Ventures and with the participation of True Ventures and BoostVC, the company’s active in the field of clinical proteomics. And I’ve learned that proteomics is about cataloguing and sort of categorizing proteins. And then gluceomics is about doing the same for carbs. From what I understand it can help with better biomarkers and target discovery. In the Medium article related to this investment, you mentioned that most companies use AI for for biomarker discovery. But they are focused on genomics. And despite the fact that genomics stay relatively stable over the lifetime of individuals. And so my question was, in the beginning, really, why is this a data play? But it turns out that an enormous amount of data is required to do this analysis. And historically, it would take a month to comb through this data, but maybe give us more flavor on Intervenn Biosciences.

Sarah Catanzaro 28:44 
Yeah, so like you mentioned, your genetic code, it doesn’t really change over time. So if I’m trying to think about diagnostic products that use genetics, you can tell me whether I am at risk for breast cancer. But that risk score, it’s not really going to change that much. In comparison, your proteins they they change in response to the environment. And so what that means is that your proteins are in fact, a much more relevant both diagnostic and prognostic biomarker, which is why intervention bio Sciences is focused on on proteomics that they’re looking at something called post translational modifications of proteins. So your proteins they they undergo a process called glycosylation. And by examining this process, again, you have more insight into how the body is changing over time in response to the environment. Now, again, it might sound like a lot of science in not much data, but in the past to kind of examine these Lego pieces. teams, we would use a machine called mass spec with which the intervention team does use. But you would have to have people look at the output of mass spec, which looks kind of like this, this very jagad line chart, and actually, like hand quantify each of the peaks. And so this could take a really, really long time. So just one of the applications of machine learning within the context of intervention is automating this peak picking process with with computer vision. Like I said that that is just one way in which the team though, is actually leveraging machine learning. Another is certainly in the diagnostic context of determining amongst the population who has, in their case of ovarian cancer it and who does not.

Erasmus Elsner 30:50 
With respect to this investment, Genoa ventures is more of a specialized bio sciences venture firm. With BoostVC and True Ventures, I was sort of surprised to see them in this round, where you brought on for the data science piece, maybe give us some flavor on how this round came about.

Sarah Catanzaro 31:06 
Very, very, very big differences between biotech investors and tech investors, both in the way that they think about outcomes in the way that they structure investments, even in the way that they behave in the event of an IPO. And frankly, given those differences. I think that in this category of bio IIT, or tech bio, it’s really important to have a syndicate that includes both perspectives in doing any sort of deal at the intersection of the life sciences. And the computational sciences, we’re always going to look for an investment partner, who has more expertise on on the biotech side in certainly Genoa is one of the best partners we could ask for in that respect. No, I think the other thing is that in the case of a tech bio investment, that the way that you mitigate risk is also different. And so while we understand how to mitigate platform risk and technology risk, we’re never going to be the best at mitigating biology risks, science threats. And that’s just another example of why having these partnerships these syndicates is is so critical. So what’s nice with the interbank rounds is we have a diverse set of perspectives. And we can provide the founding team with as much guidance on on how each type of investor thinks about the future of the company so that they can make the right choices as they move forward.

Erasmus Elsner 32:44 
Yeah, for sure. So let’s talk about another investment of yours, which is one that that is easier to understand Maze Design, an investment in the design space where you led the $2 million Seed round, which was announced in early 2020. And so with companies such as Figma, Invision and Sketch making design work much more collaborative, we’ve not seen the same for the testing phase of designs, and this is what Mays design is really doing. And I’ve learned that costs about $3 to $5k to run usability tests, and what Maze Desing is really providing is a a research platform that makes this much faster and less cost prohibitive. And the company seems to have a lot of traction with 10,000 users already. Talk to them a little bit about how you came about this investment. And what’s the mission?

Sarah Catanzaro 33:36 
Yeah, absolutely. Like, like you mentioned, the user testing process right now, it and usability research processes is incredibly tedious. You have designers, product managers, UX researchers who may spend hours watching videos. Now the alternative is to release a product, see how the market responds and iterate from there. But but that’s really costly to set. So what may is is enabling is more data driven approach to product development and product design, such that companies can iterate earlier in the design process so that they can get the benefits of user testing and usability research. Without the pain of just sitting there and watching hours and hours of video. Like Like You I think that seems like a no brainer to. But I think what what’s really interesting from from an investment perspective about maize is one of the things that we fundamentally believe is that data science data, it’s not just going to change the world, and the workflows of people who are called data scientists or data analysts or ml developers, we believe that people in every goal should be able to leverage data. And I think in that respect, we’re very aligned with the founders of mes, who believe that product designers and product managers, they need more data, they need more high quality data so that they can make the right decisions.

Erasmus Elsner 35:14 
Moving on to the last investment I want to discuss is a company called Bayes. They only have a landing page at this point, my understanding is it’s a no code or low code data analytics platform. What is it that they’re building concretely? I think they went through Y Combinator, and you back them in a in a seed or pre-seed?

Sarah Catanzaro 35:32 
What I can say is that no, but the Bayes team sees. And what we’ve observed is that, frankly, like analysis is not intuition. analysis is a skill that you have to cultivate over time. It involves a mix of defining research questions and projects of leveraging statistics of understanding data visualization. And so the idea that anybody should be able to do analysis with the existing tools that they have, like it’s just false. I volunteer for a couple of nonprofits doing data science and analysis work. And I see time and time again, that these organizations are trying to make data driven decisions. But analysis is hard, and they don’t have the right tools. And that’s really where Bayes comes in. For someone who does not have a background in data analysis, they make it easy to just get the insights. And frankly, you know, I would suggest that everybody sign up for the beta because the experience is just magical. The experience of going from data set to this analytical products. It just it feels like magic.

Erasmus Elsner 36:53 
So where can people find out more about you? You’re on the Twitter then you have the newsletter?

Sarah Catanzaro 36:58 
Yeah, absolutely. So I do author, a weekly newsletter, where we highlight interesting research papers, interesting open source projects, and other pieces of content. It’s pretty concise. There are three of each category. And we released that week over week. I’m relatively active on Twitter, although relative, perhaps not to other VCs. If you want to know more about my interests, ranging from abstract expressionism to military science, you can check out the Amplify website too, which is just amplifypartners.com.

Erasmus Elsner 37:42 
Thank you, Sarah, for taking all the time today and telling us a little bit more about what you’re doing over at Amplify and I’m looking forward to follow your journey. Thank you.