The journey from Chaos Monkey to Gremlin with co-founder and CTO Matthew Forniciari

The below is a full (unedited) transcript of a Youtube session / podcasting episode I recorded with Matthew Forniciari, co-founder of Gremlin in Q3 2019. You can view the video/listen to the podcast on YoutubeApple PodcastStitcher or wherever you get your podcasts.


Erasmus Elsner 0:00 
Hi, and welcome to another episode of Sand Hill Road, the show where I talk to successful startup founders and venture capitalists, about the companies that they built an investment. And the goal like always, is to give you a sense of what it’s like to be in their shoes, how their businesses tick, and sometimes take a bit of a technical deep dive. Today, I’m super excited to be joined by Matthew Fornaciari, who is the CTO and co-founder of Gremlin, which is a pioneer in startups in the space of chaos engineering. To really understand what Gremlin does, we first have to understand a little bit what chaos engineering is all about. So I personally first heard about chaos engineering. When I read Antonia Garcia’s book, the chaos monkey, what is the chaos monkey. So imagine you have a chimpanzee rampaging through your cloud data infrastructure wrecking habit, left and right. And this is exactly the kind of software that Netflix developed in 2010, when they moved to the cloud. The aim was really to test the system stability, by enforcing failures through the pseudo random termination of instances and services. The Chaos monkey system resiliency tool, which was later open sourced by netflix really became the precursor of a whole range of resiliency tools, known as the simian army. But even more so it became the precursor of a whole new discipline of cloud computing systems architecture, known as chaos engineering. And the goal of this new discipline is really to experiment with software systems that are in production, in order to build confidence and resilience into the system’s capabilities. And Gremlin is really one of the first startups to offer chaos engineering on a SaaS or as they call it failure as a service basis. So the company has almost raised $27 million so far from some of the best names in the valley, including amplify partners, index ventures, and red point. And before raising the seed round back in 2016, Matthew and his co founder Colton, and actually worked on this chaos engineering problem space at some of the largest tech companies out there, including Netflix, which is obviously home to the open source chaos monkey project, the largest cloud provider, Amazon, and Salesforce, the company then went on to raise 7.5 million series A back in 2017, led by Mike Volpi at index. And last year, the company was able to close your 18 million Series B, which was led by Mark Tongass at red point. So I’m super excited to have Matthew with me today. So let’s jump right in.


I’m really excited to be joined today by Matthew, he’s the co founder and CTO of Gremlin, which is a pioneering startup in the chaos engineering space. Before we dive into the product and into the company, I want to spend just a few minutes talking a little bit about your founder journey. So Matt, you’re part of this very rare breed of founders who, who had the luxury, I would like to say, of having worked on this chaos engineering problem space for for quite a while for some years with your co founder, Colton, that at Amazon, were you part of the fatals team. And then Colton moved to Netflix, which obviously has pioneered the space with the chaos monkey, tell me At what point in this in this corporate environment, you basically caught the intrapreneurship book and thought that you would take the leap out into into the cold and, and, and hard world of entrepreneurship.

Matthew Fornaciari 3:34 
Today, it’s very important, the cold and hard world. Yeah, no, you’re very correct in that, you know, we were lucky to have already, you know, tried this out at some of the larger corporations, I you know, we wrote this at Amazon and Netflix, and I did a little bit of work at it over at Salesforce. And honestly, you know, the whole idea of diving into the entrepreneurial shibaura world was a lot of just a conversation between me and Colin being like, Hey, I think we could in fact build this in a generic way for everyone. I do, in fact, think everyone could benefit from this practice, you know, and chaos engineering is, in fact, you know, a practice, it’s very much the same as you would write unit tests or regression tests, like this is very much like something you should build into your, you know, development lifecycle. And that’s, that’s sort of what we did at Amazon and, you know, figured, eventually, you know, once people kind of catch up to, you know, the juggernauts of Amazon and Netflix, and, you know, Google, we’re not the only this practice as well. So we decided to take a little bit of a leap. It’s, it’s never easy, I guess, I’ll say, you know, it was definitely leaving a very cushy job for the both of us. But you know, you get the bug a little bit and you got to just you got to take a chance.

Erasmus Elsner 4:50 
So listen to to a podcast where Colton talked about this decision of bootstrapping versus taking VC and so you We’re also part of a very rare breed farmers in that you actually raised the seed free product transitioning right out of the corporate world into a VC backed startup. And I think there was this discussion with Colton and venture capitalists at a conference where he basically talked us through with him. And I think Colton has five kids. So obviously, they’re, the bootstrapping route would have been much tougher. So I’m just wondering what the conversations were like, at this stage,

Matthew Fornaciari 5:27 
where you got five kids, it’s a little harder to bootstrap. I luckily, I’m not in that route. But, you know, there’s, there’s actually always trade offs, right. Like, that’s, that’s part of the industry, right, you have particular trade offs, whether that be tech, technological wise, or, or whatnot. But, you know, we could have gone, you know, the bootstrapping route and just sort of tried to do this on our own, I don’t think we would have gotten this far, frankly, really, what you get by raising money is, you get sort of a, you get a network with you as well. And that’s super helpful, especially in the early days with respect to, you know, getting recommendations getting introduced to people figuring out, you know, who should, who you shouldn’t be talking to, even, you know, we’ve been, we’ve been very lucky in that, you know, we’ve raised money from amplify index, you know, redpoint, they’ve all been fantastic in terms of like, increasing our network and increasing the people that we are able to talk to increasing the number of people that were allowed, we can bounce ideas off of, you know, like, and so yeah, we could have absolutely bootstrapped and, you know, what you trade is really ownership of the company. And, frankly, you know, I, I’m willing to trade a little bit of a credit limit of the company for some some ideas from some people that have been there and done that, you know, it really helps to be able to, to bounce ideas off people that, you know, I’ve seen this Yeah, by Yeah, for sure. Both for first round founders, right. So

Erasmus Elsner 6:59 
absolutely, the first time at the rodeo is always the hardest. So, so let’s, let’s talk about the first days. So you’ve raised this the seed round in 2016, from from amplify, you’ve got the money in the bank. So How were the first days I imagined that you just sit down together and, and really, like, work on the product 100 100% of your time, or maybe you were already hiring people testing the market, walk me a little bit through these early days of, of getting right out of the gate with a seed round, suddenly put in this position of being an entrepreneur.

Matthew Fornaciari 7:36 
Yeah, totally. I mean, first days are PSE, their MVP, they’re good things out the door, you know, make something workable. And, you know, I love I love that you say, you know, just sit down together. I wish I wish that have been the case, you know, I’m in San Francisco, he’s in San Jose, he’s got five kids, he’s got a family, you know, I’m I’m not, I’m not trying to go to San Jose every day, he’s trying to go to San Francisco every day. So it was actually a lot of really remote from the beginning, which is actually sort of, like, seeded the culture for our company, where, you know, we’re actually 52% remote right now, which is, you know, we don’t like to discriminate based on location. But the early days were a lot of just holding myself back and forth design patterns, you know, trying to figure out, you know, how do we how do we actually make this work and, you know, try to espouse our three core product principles, which are safety, security, and simplicity into you know, our original product, and that started with build a CLR. Let’s just let’s start there, start easy build a, you know, a command line interface. Cool. Now, let’s build an API, let’s talk about, you know, we’ll talk about the the technologies involved with that later on down the road, when we actually get to it, we’re just building the CLR right now. And then, you know, build the UI on top of that, and neither of us are designers or UI engineer, so few, I was a little rough at best, but, you know, you do what you can to get to get by and to really be able to get out there and be able to start to sell and actually, one of the funnier things I think about us, in our early days, our early sales is we both read motorcycles. So you would see the two of us roll up to the company and be like, Alright, cool. You want to buy this, believe me? It was definitely interesting. Founder led sales are always crazy, you know, outside of that. So in terms of hiring, we were there a couple of people, we heard somebody in Germany, you know, as one of our first hires, turns out time difference really difficult. And he ended up opting out. We hired somebody in Canada that also helped it out after, you know, be with us for a little bit, but actually, our first hire and still one of our better ones is a guy by the name of Phil, who we found off Angel. It’s,

Erasmus Elsner 9:52 
it’s interesting where you go and sort of your your first days you know, and so I think I’m imagine at this point, you’re used to this really sophisticated cloud infrastructure from Salesforce, Netflix, Amazon. And obviously, these large companies are light years ahead in terms of running Kubernetes clusters or Hadoop instances. So how do you pitch it to companies that might not have a sophisticated enough infrastructure for the value of such a chaos engineering system to kick in? And how were these really early sounding rounds? Before you had validation of the product?

Matthew Fornaciari 10:28 
Yeah, I mean, you know, you ask, how did we know like, we, you never really know, something up in the early days, you know, it’s a lot of kind of trial and error. And I think a lot of that has been, you know, honing our messaging. Anytime that you’re creating, you know, a, an entire category, there’s a lot of education that goes into it. And so we we worked a lot with, you know, what resonates with people, you know, what, what are people actually looking to do? How do they need to prove value? Like, how do they set up the chain? Right? You know, one of the biggest sort of, like, push backs we got in the early days was, you know, we’d be like, Cool, well, you do this controlled chaos, and like, Oh, we’ve got plenty of chaos as it is, why would we ever do purpose? Right? You know, and so, a lot of it became, well, would you rather do it at three in the morning? Or would you rather do it three in the afternoon, where, you know, it’s Herbes, and you’ve got the caffeine coursing through your veins and that sort of thing. So all the messaging actually evolved over time. And, you know, it really helps that Colton and I were both, you know, srts, at Amazon, back in the day that, that really gives you sort of that, that feeling of what people are going through, and allows you to sort of like, build up that grassroots, until, but honestly, unless, you know, we’ve got three kind of qualifying questions, you know, do you measure downtime? Is that downtime associated with $1? value? Somebody owned that? And kind of before you can answer those three questions, you may not quite be ready for chaos engineering, it takes a it takes a concerted effort, okay, now,

Erasmus Elsner 12:09 
so. So maybe let’s, let’s dig a little bit into the product itself. So chaos engineering, it sounds like a fun exercise in a way you, you break things, you break them again, until they stop working, and then you fix them, and you break them again. But obviously, you want to break things carefully, and make sure that that you can revert to the prior state. So walk us a little bit through the architecture of Gremlin.

Matthew Fornaciari 12:30 
So the way we build this out is a little bit different than the way you know, we build things, Amazon and Netflix etc. You know, the idea is that everything is very locked down, we build out like a compiled binary, it’s very safe, you know, safety, security simplicity, I mentioned these, these are our core components are our core tenets, our building things out. So the way we do it is actually by interacting with LS level operations, right? So we actually go and interact with, you know, tools that you already have on your, on your Linux box and use those to basically impose the impact, but every single impact that we impose, we have a rollback form, right. And the hack is really, I think it’s the thing that was neglected a lot, you know, especially, you know, Netflix introduced like chaos engineering, and like a random just throw stuff out there and see what happens. I think that’s a bit of a misnomer, you know, like, you really want to do it in a very controlled and careful way. And so, yeah, that’s sort of what we do, we built out, you know, a compiled binary built in rust, you know, the memory and CPU footprint are tiny, you know, it’s an agent that sits on the host. And it interacts with, you know, it’s a,

Erasmus Elsner 13:43 
it’s a Debian pack package, the agent that sits sits on on the host. Yeah,

Matthew Fornaciari 13:49 
yeah, we’re actually, we’re actually extending that, you know, what’s available to Windows and AI access here. So we’re actually building out, you know, more support based on iOS, you know, you get, definitely a huge difference. difference with Windows, but even AI x, which is, you know, a Unix like system is a little different. So we’re working on expanding our sort of footprint there and make it a sort of available to everyone. But the idea is, if you’re gonna build something that can break stuff, you don’t be able to build, you know, the reverse back end apply. So you can do these things with sort of competence.

Erasmus Elsner 14:25 
So the way I understand it in the beginning, you were building this, this core ramblin, Tails system, and then you built this UX, basically, this this control plane on top of it, how is this this process in terms of like development? What was the first feature set that you built? And then how basically, how did it evolve over time,

Matthew Fornaciari 14:45 
we’re very engineering centric in the sense of like, we build out the atomic building blocks first, right? So we built out the COI first, and then we build an API that it communicates with and they can control everything through. Then we build the UX on top of that, but everything is API first. You know, as a, as engineers coming from the Amazon and Netflix days, you know, we build the API is sort of the, you know, the Word of God sort of thing, you know, where you can, everything goes through there, you know, whether it be UI or not. But we also believe very strongly in simplicity. And if you don’t make things easy, turns out engineers won’t use it. So UX is really, you know, sort of layered on top to combine a lot of the API calls to make things easier. Well, we’ve actually seen a fair amount of API adoption, which is me, that’s amazing. You know, we’ve actually had a couple customers white, white label our site just to, you know, make it a little bit easier for their engineers or whatnot. Yeah. Yeah, that’s sort of the idea. And then, you know, we, in terms of like, what we built out for the product, like attacks are the atomic building block of what you get for gas engine, we’re actually going to be releasing something in the near future called scenarios, which added a lot of metadata around that where, you know, you can specify a hypothesis, you know, an outcome, those sorts of things. So you can actually like track your progress over time for a particular experiment, we build the smallest building blocks first, and then we are things at the top, it

Erasmus Elsner 16:11 
makes a lot of sense. So let’s talk a little bit about the the Dockers container and Kubernetes container orchestration system and how Gremlin fits into this, I think I read that that Gremlin can also be run on it on a container and, and is really well integrated into that infrastructure, I could imagine that for a lot of these, these customers, where the value kicks in, that’s, that’s exactly those kind of cloud infrastructures that they’re running. So the way we actually attack

Matthew Fornaciari 16:38 
containers is by attaching side cars to these containers that are running and then you know, being able to like splice their network or, you know, share their share their, you know, their disk space or something with storage, something along those lines. But the way it works is, frankly, the way it works is sort of like a higher level is people don’t really actually understand containers and Kubernetes just yet, you know, like, especially Kubernetes, you know, like Kubernetes is supposed to be the be all end all for, you know, all container everything management, you know, the silver bullet. But it actually, it has a lot of very interesting sort of quirks. And, you know, making sure that it’s actually doing what it’s doing when you expect it to is very important. So that’s sort of what we’re trying to allow and enable people to do is be able to make sure that, you know, what they expect to be happening is actually, right, so whether, you know, you expect a pod to die, and just spin up new ones, like, make sure that actually happens, right? But Kubernetes, in general, were a little, I mean, I’ll just, I’ll be honest, we’re a little a little lacks on our support. And right now we’re building support, you know, in the coming months for, you know, particular replica sets, pods namespaces services, like make it much easier to actually integrate with Kubernetes natively, there is, I would say that, it’s, it’s still a very new technology that requires a lot of experimentation with people that are migrating to it, and we want to be able to make them comfortable with that.

Erasmus Elsner 18:15 
So let’s think about the product. What I was wondering was, what are the kinds of customers that that it is built for? And you mentioned this a little with the three questions, the qualifying questions, on your website, you have these ecommerce examples of how the downtime minutes really translate into into dollars loss in revenue. So would you say that that’s really ecommerce? or?

Matthew Fornaciari 18:40 
Yeah, it’s an absolutely Good question. Um, I mean, it frankly, it’s anybody who wants to make money on the internet, that’s my, that’s my opinion, you know, it’s anybody who has a footprint, a significant footprint on the internet. But, you know, those those three questions obviously helped a lot. And e commerce, it makes a lot of sense to them, you know, every second that we’re down, we lose X amount of money. But honestly, the target market is, is anyone you know, we we very much believe that this should be a part of like the development lifecycle, the same way you build a, you know, unit tests, and, you know, integration tests, like regression testing, and you know, resilience testing should be a part of your development lifecycle. So you know, what happens when CPU and memory are pegged? What happens when I can’t talk to this particular service, etc.

Erasmus Elsner 19:31 
Those should be things that engineers think about, especially in this, you know, new age of microservices everywhere. And the way that you sell is basically through through trials and these experimentation sessions that you that you do with clients, where you basically get together with the engineering teams, or how’s that process?

Matthew Fornaciari 19:50 
Yeah, no, so we run game days, like the way we do right now is our success team will actually sit down with a potential customer and run a game day. You know, I’ve had I’ve been fortunate enough to fly out And sit with a couple of these teams and see them actually be like, well, this is what we expect to happen. Oh shit that didn’t happen even remotely right, you know, like very much eye opening sort of things around like, well, this is our expectation we’ve never actually tested it before, you know, we expect to be, you know, be able to fail over if USC one goes down, oh, that didn’t happen even remotely, right? You know, we just, we basically ate all the traffic, right? So, it’s very interesting to see people be able to specify what they want, you know what their expected outcome, they test it, it harkens back to my you know, early days of like dealing with the scientific method and whatnot. You know, I was definitely one of those nerdy kids in high school who was very into, you know, D, it was very data driven and was very, like, cool. You say, this is gonna happen, let’s test it out, you know, so, it’s very interesting to see, you know, some of these companies and their big companies, you know, and it’s not just ecommerce, you know, we’ve got airlines, ecommerce, FinTech, you know, we’ve actually started to get into sort of like the medical space, you know, when medical fail, ooh, even a bigger problem, right? So you know, it, everybody has a footprint, and everybody’s interested in making sure they’re a little bit more resilient.

Erasmus Elsner 21:15 
And the last part, I want to talk a little bit about scaling and failures along the way. And as a failure as a service company, I have to ask you, really, what were some of the moments where you thought, well, this is just not working out. And I’m, we’re going to hit rock bottom. And we had like, self doubts maybe about yourself about the company. I’m not sure whether there were any moments from the outside, it looks like you’re hitting all your milestones, but there might have been some moments where you had some uncertainty, and I want to dig a little bit into those.

Matthew Fornaciari 21:43 
Yeah, no, never, never, never, ever would have doubt. No, starting a company. That’s super easy. There’s, you know, you never, you never think about it twice. Now, it’s a it’s, it’s a lot of times where you’re like, God, God, I don’t know, maybe this is not the right thing to do. Maybe we should go back to a nice cushy, you know, corporate job and whatnot. But yeah, I don’t think there’s any particular moment where I thought it wasn’t gonna work. You know, I’ve been I’ve been, I would say, particularly fortunate, you know, in that Colton, and I, we get along really well. And we’ve had a lot, we’ve had some, you know, we butted heads, like heads on a couple things, but like, I’ve heard of some really horrendous, you know, founder stories, and we’ve been very fortunate that, you know, we, we tend to be on the same page for a lot of things. And I think that helps a lot. I don’t know, early days, you know, we were told we might be a little bit early to market. And, you know, we well, you know, we basically told them like, Kara, we’re gonna, we’re gonna kill it, don’t worry. But, you know, there’s a lot of like, education that comes with that. And so, you know, there are a couple times where I’ve been like, maybe maybe the world isn’t ready for this just yet, you know, as we’ve kind of gone along, I think I think the world has kind of caught up. You know, we talked a little bit about it, you know, a couple minutes ago, but it it does, it takes a bit of education. And it takes a bit of people kind of catching up to where you know, the juggernauts Amazon, you know, Google,

Erasmus Elsner 23:04 
building a little bit ahead of the market, actually. So now that you’ve reached scale, I think you have almost 50 employees or even more, I’m wondering, what is it that keeps you awake at night at this point?

Matthew Fornaciari 23:17 
Yeah, that’s a great question. It’s funny, I asked everybody I interview you know, what keeps you awake at night? What gets you up in the morning? Those are the two, my two kind of quit? Don’t Don’t tell anyone. But those are my two cultural questions. But what keeps me up at night? Now? it’s twofold. One is very technical. One is very much, you know, are we ahead of the game? Are we making the we’re ahead of the game in terms of security in terms of, you know, safety? You know, if we were to screw up anywhere in terms of safety or security, you know, are we we lose our customers trust, and our customers are really, you know, that that’s obviously with a lot of companies, that’s sort of your bread and butter, but like, with, particularly with sort of chaos engineering, like you can cause an outage, you can cause an outage for, you know, your customer in production, and that, that reflects poorly on their brand. And that’s, that’s something that really keeps me up at night is how do we make sure that we can make this as sort of like foolproof as possible when people start to experiment a bit more broadly? Right. And a lot of it is education, a lot of is building things in a product, you know, that sort of thing. So, that’s one thing. And the other thing is just, I don’t know how many founders who talked to her about this bug culture, you know, culture as you grow and build the company, especially now that we’re in a growth phase. It’s tantamount, you know, like, it’s, it’s incredibly important in terms of continuing to attract the right talent. And I actually I tell a lot of people this as well, but you know, we just have a three year mark and January, end of January. And you know, I wrote a nice little note for the team and I was like, kind of like cool. While we’re built, you know, a fantastic product, we built an amazing sales and marketing, you know, engine, but really what I’m most proud of is this, this team, you know, and being able to have already be just thrilled with coming to work every day and working on something that they really care about, and that they’re really passionate about. And that’s every single time you hire somebody new, you change that culture, just a tiny bit, you know, keeping it keeping it as kind of close to the, to the vest and as close to you know, what you want, is you lose the ability to do that after a bit of after a while, right? You kind of set the groundwork set the you know, the cultural values and whatnot, then you you kind of see it grow from there. So,

Erasmus Elsner 25:45 
and I mentioned, it’s especially challenging given that you’re a remote company, right?

Matthew Fornaciari 25:50 
Well, it definitely doesn’t make it easier. I’ll say that. Yeah, we’re 53% remote right now. And, you know, we fly everybody out every now and then for, you know, different, different meetings, but like, you know, three years ago, it was eight of us in one, you know, one air b&b at reinvents and you know, we could very much control sort of like, what people were thinking and how he talked about things and whatnot. And now we’re, like, we’re actually about to creep up to like, 65. And like, 52% of that is remote. Yeah, a little bit more difficult to espouse those values. So, so trying to do trading and bringing people out when they come on board. But yeah, it’s difficult.

Erasmus Elsner 26:32 
So thank you so much, Matt, for for giving me the time. And where can people find you and learn more about you and Gremlin and learn more about chaos engineering in general? I saw that you had a conference organised recently?

Matthew Fornaciari 26:44 
Yeah, totally. Um, so we were hosting a conference later on, they’re like, you just said cast calm. So cast If you want to learn anything about, you know, chaos, engineering, or reliability in general, you know, it’s a particularly interesting, sort of SRV centric conferences coming on later on this year in San Francisco. And we’re, we’re thrilled to have it, you know, grow about eight fold this year from our last year or our inaugural year. So it’d be amazing. We’ve got fantastic sponsors and whatnot. So it’ll be it’ll be awesome. We’re really looking forward to it. Beyond that, you know, I’m on Twitter, barely, but I’m on Twitter at call me 40. You know, other than that, we’ve launched Gremlin free to sort of democratize the practice and we’re trying to launch a bunch of different tutorials and whatnot. So Gremlin comm slash free, super helpful if you want to learn about chaos engineering. Wonderful.

Erasmus Elsner 27:42 
I’m so excited and looking forward to following your journey. Thank you so much. I really appreciate you taking the time as well. Thank you. So this is it for today. I hope you found it useful. Gremlin is a super exciting company, I think, and I’m really looking forward to follow their journey. And if you want to hear more about what I’m up to, you can always subscribe to my newsletter on Santo road that IO or just subscribe to the channel and tune in next time. It’s up to you. Cheers, guys.