From SIEM to Detection as Code

Show notes

00:00 - Introduction to Jack Naglieri and Panther 01:23 - Discussing the possibility of keeping pace with modern threats 02:08 - Common reasons organizations seek out new SIEM solutions 03:29 - The concept of detection-as-code and its benefits 05:50 - Challenges of monitoring diverse cloud environments and SaaS tools 07:17 - The importance of identity in correlating security data 08:46 - Best practices for collecting and organizing security data 10:51 - The essentialist approach to security monitoring and alert management 12:46 - Team structures for handling security alerts and investigations 13:48 - Recommendations for fine-tuning SIEM alerts 15:35 - The role of SIEMs in threat remediation and prevention 17:07 - Learning from patterns in security events 19:20 - Views on AI and machine learning in threat detection 19:59 - Log retention strategies and considerations 21:12 - Addressing unknown threats and improving visibility 22:55 - Using SIEMs for retroactive incident analysis 23:31 - Current challenges facing security teams 24:38 - Tips for transitioning to detection-as-code 25:45 - Cost considerations when adopting new SIEM solutions 27:21 - Jack's key tip: Focus on intentionality in security operations

Key Takeaways:

Detection-as-code offers improved governance, collaboration, and scalability Start with a clear understanding of critical threats to your organization Balance comprehensive monitoring with intentional, focused alerts Consider cloud-native SIEM solutions for cost-effectiveness and scalability Regularly review and update security playbooks and runbooks

Resources Mentioned: Panther: Cloud-native SIEM platform Detection at Scale: Jack Naglieri's podcast and blog

Show transcript

S1: .829 Welcome to Access Control, a podcast providing practical security advice for fast-growing organizations, advice from people who've been there. In each episode, we'll interview a leader in their field and learn best practices and practical tips for securing your org. For this episode, I'll be chatting with Jack Naglieri. Jack is a seasoned security practitioner who has turned into entrepreneur. Founding Panther in 2018, Panther is a cloud SIEM backed by detection-as-code. Taking on traditional security information and event management software, along with being Panther's CTO, Jack also hosts a brilliant podcast and blog on Substack, both called Detection at Scale. Similar to Access Control, the Detection at Scale podcast interviews security practitioners and shares how they can manage and respond to threats. I highly recommend giving them a listen. For today's episode, I plan to dive into Jack's years of insights and learning from operating, building SIEMS, learning from supporting some of the fast-growing companies to get their insights from logs and his playbook around responding to threats. Hi, Jack. Thanks for joining me today.

S2: .935 Hello, hello.

S1: .414 So to kick things off, detecting threats is a never-ending arm race between attackers and defenders. With the rise of advanced persistent threats and sort of AI-written malware, do you believe that it's even possible for detection and response to keep pace with the speed and sophistication of today's attacks?

S2: .012 Of course. I have to be optimistic, right? I can't say, "No, we're doomed. There's nothing that we can do." Of course, there's things that we can do, and we love to think about monitoring and detection and controls as a in-depth approach. So it's all about redundancy. It's all about checkpoints. It's about auditing. It's about having visibility about everything at different layers and having a really strong playbook to understand what we're detecting and why.

S1: .026 And then when people sort of come in initially, are they coming in as sort of a protection, or have they had like a sort of an incident and they sort of come in and they've sort of discovered Panther? Or do they have a traditional SIEM that they're looking to replace? What's the main entry point that people come into sort of deploying a SIEM?

S2: .473 I think a lot of teams struggle with just the scalability side. They have a ton of data. Data volumes are going to continue going up forever, and data is the new gold. So we really need a strong foundation to make sure that we're structuring our data. We're doing security data lakes and we're sort of step-functioning the SIEM from a monolithic system to a more pipeline oriented system, something that's cloud native, something that allows you to do more sophisticated analyses like real-time pipelining and all these other different types of things that we're going into. And now with the advent of gen AI, what can we use our SIEM data for to help understand and improve the comprehension of log data? It's really about starting in the new wave of cloud native, building a SIEM that allows you to automate as much as possible, and to sustain the costs and improve your capabilities over time. And you really need a strong data platform for those things to really work.

S1: .799 So I guess, people have that foundation of a strong data platform. They may have standard things that they sort of detect for. And I know in my introduction, Panther is different because it's detection-as-code. Can you just talk about why that sort of context, detection-as-code is better than other solutions out there?

S2: .438 I don't want to say it's better. I think it's more effective, and it takes a lot of the practice that we're seeing in the field and puts into a product that multiple people are contributing into. So detection-as-code is just about, one, automation. Two, it's really about governance as well. And then three, it's about composability and reuse and collaboration, being able to, again, start with intention, what are the things that we need to detect? And why? What's our overarching playbook? What do we do after the fact? How do we declare these things as code and really set it in stone? Because if our SIEM goes down tomorrow, or we blow it away, we want to be able to retain the core logic somewhere. And I think in the past, that's been kind of spread around a lot of different places. So declaring things as code, both from an infrastructure perspective and now from a detection logic perspective, brings everything together into one little nicely created package, and it makes it easier to manage your program over time, right?

S2: .410 When everything's as code, when everything is structured, you can get great metrics on it. You can see, am I growing? Am I getting better over time? Are things getting worse over time? Are we getting better at detecting the things that are core for our organization to stay secure? And detection-as-code is a framework for all those things. This isn't something we invented, by the way. I mean, this had been happening in teams. When I was a practitioner, I worked a lot with other peer companies. So when I was at Airbnb, we worked with Netflix a lot, and they'd contributed some code into our open source project and things like that. But companies like Netflix, Twilio, even Airbnb, when I was there, were really focused on this idea of a SOCless type of SOC, not really having analysts that their whole job is to look at every event manually, but to use automation in both detection and in response, and in all the other things that precede that around data prep and scalability, and all these things. It's really just about, what can we be doing to run with a leaner team with higher scale? And how do we maintain that over a 5 to 10 year period? And that was really the context for starting Panther. It was taking those concepts and pulling them forward.

S1: .058 And I guess, it sort of follows the infrastructure-as-code benefits that we already have. So you can probably stand up one region, but if you want to stand up a region around the world, you also want to take your infrastructure, but you probably want to have the same response and detection that's standard across all those environments, because otherwise, if you don't have a standardized system, you're probably going to miss out on something.

S2: .943 Yeah. In enterprise, there's a lot of different pockets of architecture and infrastructure and products that all have slightly different playbooks and threat models. So detection-as-code is like the base of that, and because you're writing things as code, you get the composability and the modularity and all these reuse concepts. So you can say like, "Well, my baseline threat model are these 10 things, but for these other environments, I'm going to layer on additional logic that's unique for those." And it gives you that structure to be efficient in your creation and management of your program, which is great.

S1: .117 Probably another shift is we've gone from on-premise to cloud. Amazon has the shared responsibility model. But then also nowadays, you don't even necessarily run or own your database, for example. You could be using PlanetScale or Snowflake. And so you could make a majority of your tool just using off-the-shelf SaaS products. And so how do you think about sort of detecting and alerting on these range of tools which another party is running for you?

S2: .234 Well, I think the access to data has shifted slightly. It's become more abstract, right? It's not so much about looking at the packets anymore as it is hitting an API and getting some logs back. So I think we've had to adapt in that way. And also the breadth of different logs has gone up so much since 10 years ago and everything was behind a firewall, and obviously Teleport's played a interesting part in that transition as well, right? Access control and zero-trust and all these things have really changed the way that we do all of our jobs in security. Yeah. I think it's just a different layer of abstraction now. Getting really good at correlating and connecting the dots across has been one of the key challenges that we've tackled as well.

S1: .370 Yeah. And you can tell me about, how do you sort of challenge that problem when people sort of get started with Panther?

S2: .153 Starts with identity. You have to have the source of truth with your identity provider first, and then you need a way to connect the dots together. It's tricky when you go into cloud because you're not always using the same usernames as you are in your identity provider, and you have to have entity resolution, all these transformations and things, but you can make it work on a variety of different attributes. It really just depends on the detection logic that you want to use. Sometimes you want to merge on other things that aren't identity-related, but still represent a user moving through your systems, like an IP or something. Or maybe it's like a derived combination of fields that also helps improve the types of detections you want to do.

S1: .466 I mean, I know there's lots of best practices for Amazon, always to have-- [inaudible] will create more CloudTrail events than using a shared login, which we've talked a lot about shared logins always been a problem in an organization.

S2: .100 What? What do you mean? [laughter] We actually just use one username and password.

S1: .916 Perfect.

S2: .362 Just put it in a text file somewhere.

S1: .331 Yeah, hopefully people don't do [crosstalk]--

S2: .194 That was a joke. [laughter]

S1: .264 And then kind of in the data point, there are lots of different silos of audit logs and data. How do you sort of think about sort of collecting and even knowing where all these sort of silos are and sort of capturing into one system? What's some sort of approaches that you've seen be successful if you are either refreshing your SIEM or you sort of started from scratch? How do you sort of think about planning all of the known entities that you should start capturing?

S2: .209 It's funny you asked this question. We're writing a blog on it right now. That explains the thought process behind it, but I'll give pieces of the SparkNotes. Everything has to start with intention. I think just generally, it's good life principle. If you're going to go into some new venture or whatever it is, what is your intention with doing that? And I think in SIEM, because costs can balloon so much so quickly, and you can get barraged with alerts and all these things that is just noise-- the noise doesn't come from bad rules. It comes from a lack of intention, in my opinion. If your intention is really focused, so, for example, you say-- and this is something we talked about in the blog, what's the worst thing that could happen to the company? What's the most severe breach that could occur? Let's start there. Okay, what do we need to do to both add great controls to prevent that, but have layers of monitoring to make sure that we are looking at every step of the kill chain in multiple dimensions to make sure that we can see this and we have total coverage and eyes on it, and we should feel really confident that if anything gets close to happening here, we're gonna know about it immediately?

S2: .565 When you approach it from that mindset versus the mindset of like, "I want 100% coverage of my kill chain and I want 1,000 rules enabled," there's no intention with that model. It's just turn everything on and let's hope that things happen. And then what ends up happening is people who are triaging those alerts, they have that threat model in their head of, "Well, I know this thing would be really bad, or this atomic thing could lead to a threat model of our production data getting leaked or something." But it's not explicit enough, and it sort of results in us going down these rabbit holes that don't really lead anywhere. And I used to see this a lot as a practitioner. We would do threat intel matching on any log, right? It's like any IP that goes through our system and our system was gigantic. And then what ends up happening is it's not related to a particular part of the kill chain. It's just kind of looking at any match. And then we waste a whole day doing a forensic image on a machine and then learning that it's literally just someone downloading something from uTorrent and they made a few connections to an IP that was in a Node bad list. So again, these things you just can waste so much time on. It just really isn't great for anyone. So we prefer the essentialism mindset when it comes to it. Start with the things that matter the most, work backwards there.

S1: .840 Yeah, and definitely it probably stops alert fatigue, which I can imagine could be a big problem with your SIEM if everything's always on fire, everything's really critical.

S2: .355 Of course, yeah. And I think that alert fatigue is a byproduct of not thinking about what you do with the signals that you create. Going back to intentionality, it's intentionality through every step. What is the thing? Why is it important? How would you detect it? And what do you do when you detect it? And how can you reduce the amount of human steps along each part of that playbook?

S1: .060 Based upon the severity, who do you see becomes ultimately responsible for the resolution of the alerts coming in?

S2: .799 It varies. Every team is different. I've heard CISOs say, "We have a single team that does detection and response." I've heard people say, "We have a split team where we have a set of detection engineers," and their whole job is to instrument systems, get data, write rules, make it high fidelity. And then the other side of the house is for investigation, digging deep into what happened, and then sort of rinsing the learnings and providing it back to the other team. So it really just varies on the culture of the leader in that team, but we've seen both ways.

S1: .700 And then as people are getting up to speed, they're firing detection-as-code, and it's sort of like fine-tuned, it's repeatable. What are some other recommendations you have around sort of really dialing in the alerts that you're getting from your SIEM? At what point should you, let's say, wake up a team? At what point can you review it on a Friday? How do you think about detection and response? This probably also depends upon current teams, but primarily for sort of like startups and smaller organizations.

S2: .243 I think when you take the intentional approach and you have a great playbook for the parts of the threat model that are important to you, then it'll become really apparent when things are not okay. You'll start to see the signs. And if you have enough layering and logging and alerting in depth, then you should see multiple signals that all point in the right direction. But again, you need a really great way of grouping those things together logically. And I think the idea of starting with a playbook or a threat model really helps with that. It helps connect and group those steps to the kill chain so they're actually all relevant towards one particular thing versus looking at them each in isolation when there's no seemingly relevance to them. I think it becomes really obvious when things are bad, but it really varies. Sometimes there's only one signal that you get. I mean, you can't know everything ahead of time, which is, I think, why a lot of people end up just turning everything on. But again, there's no intentionality with that. So it doesn't really help. It creates more signal, then that's not really useful.

S1: .919 At Teleport, we often share a stat that it takes 60 minutes from the initial--

S2: .287 [crosstalk]--

S1: .428 --attack factor to get privilege escalation. I think this is from the Verizon data breach report. And there's lots of claims that there's AI-written malware, there's advanced persistent threats, people always getting access to your infrastructure. So I guess, we can assume zero days people getting access, going to be much quicker. How do you think the future of SIEMs will sort of self-remediate or sort of stop actors? To your point of automation, how can automation sort of stop these attacks before 60 minutes? Because not everyone's going to be able to get woken up at 4:30 in the morning and stop that attack. How do you think about the future?

S2: .768 Well, I don't think it's the SIEM's responsibility to do remediation. I think that would be a weird expectation, I think. I think the SIEM's job is to correlate. It's to correlate and declare the things that you need to monitor, and it's supposed to bring together all the data. It's not supposed to prevent it, in my opinion. I think that the verticalized tooling around different parts of the environment should be the preventative and the contextualized detection controls. And the SIEM should be connecting all the dots and saying, "Well, you had this alert in Okta, and then you had this alert in GuardDuty, and then you had this alert here, and then here's some additional logic that we did on top of that to kind of bring it all together." But the SIEM should not be the primary brain when it comes to the particular vector that was exploited, in my opinion. I think you can get more value from going to the source. So CrowdStrike's a great example of this. They have this amazing network effect where they're analyzing in context of all of the sensors ever in CrowdStrike. So the depth of insight that we can get from that is really great. And we shouldn't try to replicate that in the SIEM. It just doesn't really work well. I think of it more of the-- individual controls should be really good at prevention, whereas the SIEM should be telling you the story of what's happening.

S1: .218 So talking about the story you kind of alluded to, it's often like a pattern of events, so it's like the Okta, it's the GuardDuty. It's someone running a command out to like a weird IP address. What else can we sort of learn from these pattern of events that you see?

S2: .997 You can learn the common sequences in the kill chain, right? You can walk away from it knowing that, "Okay, we've had a few incidents, and it's followed this path." You could even use that to create new playbooks. You could say, "Okay, well, what if this was a new novel path? Would they sort of start one column further to the left of the kill chain or something?" You can just learn from patterns that you're seeing in the wild. And I think because people are looking for new vectors all the time, it just helps you stay up to date, right? Versus looking at things that are antiquated either due to technology or new controls or whatever else it may be. I was actually diving really deep into S3 breaches, and I wrote a blog about that on Substack. And one of the techniques I learned through doing that research was that people just set up bucket replication to their own buckets, and I'd never thought about that before. Or some people were talking about [inaudible] through S3A server access log requests, and I was like, "Whoa, that's pretty clever." So there's just such a variety of things. And I think this is why logging in depth and monitoring in depth is so important, because attackers could do a multitude of different vectors that are known or unknown. So hopefully you detect some of them, and then you sort of fill in the gaps once you have indicators, right? And then you could say, "Okay, well, we know that this particular thing happened. Let's take some indicators and look in the rest of our corpus of logs and try to answer the parts that are blurry for us." Right? And then you can create new controls based on that, and you can create new monitoring based on that.

S1: .179 Yeah. If I think of a specific example-- Log4j, I think, was a classic example, you start to see people trying to attack your servers with a specific string that [crosstalk] out overnight. And not everyone knows, you may not have any Java apps, but that one app that you could be that's open-source could have Log4j in it, and then you're like, "Oh, we never even knew that this had Log4j in it," because you didn't get a full [inaudible] of all the software that you run.

S2: .134 Yeah. Exactly.

S1: .608 And then also, of all these patterns, it seems like there's some new tools around that's really good at identifying pattern matching, large language learning models, various transformers. What's your view on the future of sort of like pattern matching for all of these events?

S2: .652 I'm still diving really deep into it right now, right? Vector databases, I think, are really fascinating. But I also want to make sure that it's the right technology for the job versus just saying, "We're going to shove everything into an LLM and just pray for the best." [laughter] Because it's expensive, right? You can rack up a lot of cost very quickly if you're just trying to be experimental. So I think that a lot of the tech that's come out with the models and vector databases are really promising for some of the stuff that we do in SIEM all the time.

S1: .022 And then for people who are sending all of their events in, how long do you tell people that they should keep all of their logs, whether it's in the SIEM or whether it's in their own glacier bucket?

S2: .491 I think it varies, but generally, for proactive detection, typically those logs should live in your hot storage or your core database. And then for the things you want for just IR purposes, recommend just putting it in cold storage and then rehydrating it in the event of an incident, so things like VPC Flow Logs and ALB and HTTP style logs, the things that are just so massive network logs in general. We see this pattern where people will filter them out or route them somewhere else.

S1: .641 Do you think they should be deleted at some point in the future, like--

S2: .581 Depends.

S1: .843 --five years? It just depends upon the--

S2: .813 It depends. It's fully based on the compliance that the company has to meet and their appetite for incidents. But I think for IR, a year is pretty good. I feel like if you haven't found it within a year, I don't know. [laughter]

S1: .160 Yeah, you could be in trouble. And there's also the risk that seems only capturing the known knowns. There's also exposure to things that aren't monitored potentially. How do you think about sort of helping your customers get visibility into other systems or tools or services which may not even be captured within the scene?

S2: .002 It all goes back to the playbook, right? It's like, are you able to see the things that you care about? And then if not, why? If it's a logging problem, there's probably a tool that will let you get in. And then when it comes to unknown unknowns, I think that, again, the verticalized solutions are going to be better in those cases because they have such a more interesting data set that's highly honed. So I'll keep using CrowdStrike, as an example. They can deploy models into their agents that we can't really do in our SIEM. And also when it comes to SIEM, people are really focused on privacy. So using training data across the customer, so it's just-- I don't know, I don't think it's effective, and I don't think that people would like that from a privacy perspective. So my preference is always around, what can we do locally within that customer account? What can we learn from that that we could apply? But it's less so about trying to auto-magically detect things. It's more about, how can we best stitch together the stories? And I think in that might reveal new patterns versus trying to find the breach vector at the exact moment, right? It's more like, how can we just bubble up all the signals that we do know about? And I think we can get pretty far with that.

S1: .081 And is there any, let's say, case of incidents? Often, large incidents can happen, and one of the reasons incidents happen is because there was no monitoring or alerting on that system. Is there any way in which you can sort of use a SIEM sort of retroactively? Let's say you have these things in sort of cold storage and be like, "Okay, we want to learn more about this incident," but it was six months ago. Are there any sort of tools or techniques that you can sort of retroactively think about a SIEM to sort of inform that discovery or pattern-matching?

S2: .252 Yeah, for sure. And I think this is actually a cool area for AI. And we're seeing some companies pop up that do this exact thing where their whole job is just to go through your data set and run related queries and just answer questions, and sort of go through the web of questions and answers. SIEMs can operate historically. They can operate in real time. It really just varies on the technology that it's powered by and the data that you have available. But yeah, people do this all the time.

S1: .841 And then when you work with your customers, what are some of the biggest challenges going into this year that they're facing?

S2: .566 It's actually kind of shocking because I think the industry as a whole, I think, is probably a lot less mature than people think, which is kind of a scary sentence. But I think it's just doing the basics, ramping into detection-as-code. It's still a relatively new thing, and just making sure that we have the data, right? It's like those are kind of the primitive things that we help everybody with. And some people are warming up to this idea of detection-as-code, but it's going to be a long road. People are just sort of making that change now from the world of everything's in Splunk, everything's a Splunk query, to things are declared as code. They're pushed through a repo. They're using a programming language like Python. And we're operating on top of a data lake instead of some monolithic thing. So just making that transition is probably the challenge, and implementing all these things that we're talking about. It's a big change.

S1: .100 Getting the basics covered. Do you have any recommendations for-- let's say someone's within an organization, they're kind of using Splunk, and they're interested in the concepts of infrastructure-as-code. What are their sort of tips to kind of convince their boss that this is a better method of detection?

S2: .439 I think there's a few things. I mean, the part about governance is huge, being able to just manage it all in a repo and declare it, and have it one place, having the collaboration and the elements there protecting against regressions, all of that is a very easily sellable benefit. And then it comes down to the freedom of expression, being able to more easily declare these playbooks and these correlations, and being able to connect dots in a much more powerful and robust way. That in itself is a huge save. I mean, I didn't even mention the scale and cost side. Everyone's very focused on dollars and cents, obviously. So being able to say, like, "Oh, this is going to be 10x cheaper at scale." That's probably one of the most compelling things to start with, and then, "Oh, by the way, you also get these benefits of detection-as-code and a lot more of these strong foundations that you need for doing AI and really engineering the future of security operations."

S1: .152 Kind of like another follow-up. So if people are transitioning, do you see people sort of shift a few sort of like test systems to start off with, or is there sort of a migration system that you have for people from legacy to more newer systems?

S2: .074 People don't like multiple systems. They just want to run one.

S1: .773 Just run one. Yeah.

S2: .451 Yeah. Especially now when cost considerations are a huge reason for buying software, right? Before it was just more of, is this something that could help us? And now it's like, how is this helping us? Show me. It's become just so much more value-oriented and cost-oriented.

S1: .775 Yeah. What about the cost benefit of deploying? I mean, I know organizations which already have a SIEM, they can have a cost savings, but if someone is bringing it into an organization, it is a cost to their organization in addition to sort of the other logs that they have. Any tips on sort of helping people sell the extra cost of investing in a robust SIEM tool?

S2: .559 I mean, everyone has a SIEM, right? It's not a new thing. It's not a nice-to-have, it's a necessity, right? At a certain stage, you need one, or else you can't answer questions for your audits or really do any monitoring, right, without going to systems individually, which is a terrible world. [laughter] I think the side-by-side comparison of, this is how much it costs in a cloud native system that does all these things that we're referring to, compresses the data, structures it, routes it into cold storage, leaves things in hot storage strategically. There's a lot of easy spreadsheet math that [laughter] you can use to sell. Yeah.

S1: .474 Nice. Well, to close out the podcast, we'd like to close with one practical tip that an organization can do this quarter to secure their infrastructure. What's your one tip that people can give today?

S2: .116 I'm going to keep harping on the intentionality piece, because I think it's just so important. Do you have a very clear sense of what the worst possible breach could be for your organization? And do you have multiple ways of detecting and seeing it? And do you have a great runbook for how you'd respond to it? And have you done the tabletop to confirm that you know what you're doing there? That's the one tip. I think if you're doing that really well and then you sort of stack-rank the playbooks and the list of bad things that could happen, and work backwards and start measuring the effectiveness of those signals, you're going to be in a great place when you do that.

S1: .238 Great. Awesome. Thanks, Jack.

S2: .001 Thanks for having me on. Really appreciate it.

S1: .300 This podcast is brought to you by Teleport. Teleport is the easiest, most secure way to access all your infrastructure. The open-source Teleport access planes consolidates connectivity, authentication, authorization, and auditing into a single platform. By consolidating all aspects of infrastructure access, Teleport reduces attack surface area, cuts operational overhead, easily enforces compliance, and improves engineering productivity. Learn more at goteleport.com or find us on GitHub, github.com/gravitational/teleport.

Show notes

Show transcript

New comment