Editor’s Note: On February 17, 2021, HaystackID shared an educational webcast designed to inform and update legal and data discovery professionals on how organizations should be prepared to respond to cyber-related incidents after a data breach. While the full recorded presentation is available for on-demand viewing, provided for your convenience is a transcript of the presentation as well as a copy (PDF) of the presentation slides.
Only a Matter of Time? From Data Breach Discovery to Defensible Incident Responses
In this expert presentation, cybersecurity incident response, legal discovery, and privacy experts share how organizations should be prepared to respond to a cyber-related incident while also gaining insight into cutting-edge data discovery technologies and proven document review services to support the detection, identification, review, and notification processes required by law after sensitive data-related breaches and disclosure.
+ Incident Response: Overview of Services and When to Engage to Use Them
+ Identifying Sensitive Data Post Breach via Impact Assessment: Establishing Scope and Reducing Risk
+ Post Breach Discovery: HaystackID ReviewRight Protect™ Document Review Services and Technology Enablers to Support Isolation, Collation and Organization of PII/PHII Relating to Breached Entities
+ Notification and Disclosure Reporting: Obligations, Best Practices, and Methods for Streamlining Entity Notification and Disclosure
+ Michael Sarlo, EnCE, CBE, CCLO, RCA, CCPA – Michael is the Chief Innovation Officer and President of Global Investigation Services for HaystackID
+ John Wilson, ACE, AME, CBE – As CISO and President of Forensics at HaystackID, John is a certified forensic examiner, licensed private investigator, and IT veteran with more than two decades of experience.
+ Jenny Hamilton, JD – As the Deputy General Counsel for Global Discovery and Privacy at HaystackID, Jenny is the former head of John Deere’s Global Evidence Team.
+ John Brewer – As Lead Data Scientist, John serves as the Head of Advanced Technology Services for HaystackID.
Hello, and I hope you’re having a great week. My name is Rob Robinson. On behalf of the entire team at HaystackID, I’d like to thank you for attending today’s presentation and discussion titled, “Only a Matter of Time? From Data Breach Discovery to Defensible Incident Responses”. Today’s webcast is part of Haystack’s monthly educational series conducted on the BrightTALK network and designed to ensure listeners are proactively prepared to achieve their cybersecurity, computer forensics, eDiscovery legal review objectives. Our expert presenters for today’s webcast include four of the industry’s foremost subject matter experts on data and legal discovery, with extensive experience supporting all types of audits and investigations.
The first introduction I’d like to make is that of Michael Sarlo. Mike is the Chief Innovation Officer and President of Global Investigations for HaystackID. In this role, Michael facilitates innovation and operations related to cybersecurity, digital forensics, and eDiscovery, both in the US and abroad. Michael is also a leader in developing and designing processes, protocols, and services to support cybersecurity-centric post-data breach discovery and reviews.
Secondly, I’d like to introduce John Brewer. As Chief Data Scientist for HaystackID, Mr. Brewer serves as the Head of Advanced Technology Services for the company, driving the development and application of data analytics, tools, and technologies to identify, organize, and prioritize information insight to support data-driven decisions.
Next, I’d like to introduce John Wilson. As Chief Information Security Officer and President of Forensics at HaystackID, Mr. Wilson is one of the foremost privacy, security, and computer forensic experts in the field of eDiscovery. He is also a certified forensic examiner, licensed private investigator, and information technology veteran with more than two decades of experience working with the US government in both public and private companies.
Last, but certainly not least, I’d like to introduce Jenny Hamilton. Now, Jenny joined HaystackID in 2020 as the Deputy General Counsel for Global Discovery and Privacy. She joined HaystackID after 14 years of leading the development of John Deere’s eDiscovery operations, where she headed the Global Evidence Team.
Welcome, everybody. We will record today’s presentation for future viewing, and a copy of the presentation materials will be available for all attendees. You can access these materials directly beneath the presentation viewing window on your screen by selecting the Attachments tab on the far left of the toolbar beneath the viewing window. Also, the recorded webcast will be available for viewing directly from the HaystackID website and from the BrightTALK network after completing today’s live presentation.
And at this time, I’d like to turn the mic over to our expert presenters, led by Mike Sarlo, for their comments and considerations on post-data breach discovery and review. ?
Thanks so much, Rob, and thanks, everybody for joining today’s webcast, It’s Only a Matter of Time, and I can tell you, for most organizations, that’s the case. You’re going to experience a breach at some point within the next two years, and there may be sensitive data that’s been compromised. So, we’re going to start out from an agenda standpoint, talking about security incident statistics, in general. We’ve got some new ones for 2020. We’re going to talk about ransomware in general, give a little history and just explain at a very high level how ransomwares work. We’re going to then kick it off and start to get into some nitty-gritty and talk about a simulation of what a breach looks like and the signs you’re about to get hit, and then what you should do. We’re then going to kick off over into effective incident response plan design. From there, we’re going to get into really talking about post-breach discovery, which is something we find is missing from most organizations’ incident response plan designs, and how that flows into our ReviewRight Protect service offering.
So, right away, here’s a great chart. Between data exposures and data breaches, we had about close to 300 million individuals in the United States for reported breaches, and there’s a big caveat here, as many companies may not be reporting when they have a breach to certain regulators or authorities that track these things, but these statistics are coming from the Identity Theft Resource Center, which is www.idtheftcenter.org. A really great resource for just on all things data breach, and some really good statistics and reports, but you can see that 2016 and 2017 were certainly watershed years, so was 2018. We had over 2 million individuals impacted by a data exposure or a data breach. Things were a little bit better in 2019 as far as sensitive data exposure, and in 2020 we were down to about 300,000, over almost about 1,100 reported large-scale breaches.
That said, the cost of a breach isn’t really decreasing much. It’s been fairly static. It’s actually going up year-over-year. In 2020, the average cost of a breach in the United States was about $8.64 million, up from 8.19 million. Globally, the average cost of a breach was 3.86 million. These costs include loss of business, it includes a lot of chaos, loss of employees, loss of service lines; it includes all the escalation reporting, and the notification of the breach; you have law firms, you have service providers, and there are post-response activities outside of the initial notification, which include credit monitoring, and things like that, and then also brand management.
So, certainly these are not small events for any organization, and this number is going to go up based on the size and scale of the breach. So, in 2020, the cost per stolen record was about $146, and that was down from an all-time high of $158 in 2016. That said, healthcare certainly continues to be the most costly for its type of data breach at about $430 per stolen record, and oftentimes we find that healthcare breaches are sometimes some of the more complicated ones due to just a string of channel partners and different vendors involved in any type of healthcare organization, and the obligations that the parties have to each other can be robust and certainly easy for a web of breaches to start to occur once an attacker makes their way into any of the organizations that may be supporting each other.
So, starting out with the history of really what is a ransomware. I’m sure everybody knows, but typically speaking, a payload that somehow makes its way on a device and network through RDP compromise, email phishing, attachments, thumb drives, all sorts of ways that a breach, a ransom payload can make its way onto your infrastructure. It usually involves some type of remote code execution. There’s a privilege escalation of credential access on a machine or series of machines. It starts unpacking it, it preps the environment, the payload is executed, and then things start to become encrypted. [Smart] cryptos and/or if a hacker has fully breached the network as well, or threat actors would be the right word, then they start to delete known local backup files, and then you get those nice little ransom pop-ups that usually might tell you that you need to upload some Bitcoin to unlock the file, or, more commonly, now to actually an email address to negotiate with the attackers, and really, ransomware first appeared as a thing in about 2005, and went from 64-bit to 660-bit encryption in 2006. Fast forward to 2009, we’re starting to see more malware-type attacks, evolving into scams, to basically get people to pay money, 40 bucks to help decrypt files. 2010, we started to see about 10,000 different payloads out in the wild. These are different variants of ransomware and this coincides with the birth of Bitcoin, and then screen locking ransom that basically would lock you out of your computer.
Things started to progress as far as an industry from 2010 until about 2013, where we started to see the CryptoLocker ransom hit the world, and that resulted in revenues – this is a business for these threat actors – of about 300 million in 100 days. 2015, we have about 4 million ransom payloads in the wild, and in 2016, we started to see ransom becoming about a $1 billion a year industry. This is when a hospital actually paid about 17,000 in ransoms. And in 2017, we all remember the WannaCry ransom, which is really global news coverage, and that spread to about 34 countries and resulted in about $4 billion of losses for businesses, so really major, and in 2019, I can tell you that the statistics are really about 62% of organizations claimed to have been victimized by ransomware at some point from 2017 to 2019. So, you can see how those numbers of reported breaches versus what organizations actually experienced are completely a little bit out of whack.
And from a payment standpoint for enterprises in 2020, we see the cost of a ransom going up, and I can tell you more organizations are paying them, there’s more mechanisms as far as insurance providers to facilitate, actually, payment of ransoms in Bitcoin, and the average of ransom payouts in Q4 of 2020 was about $233,000, and that was up from Q2 about 31%. So, companies are paying more, threat actors are asking for more, and that’s just one piece of the puzzle, and we’re going to talk about that as we move more into the webinar.
I’m going to kick it over to my colleague, John Brewer, who is our Chief Data Scientist. We all work together very closely with our incident response teams and data teams for clients working on incident response issues and post-breach discovery.
Thank you very much, Mike. So, a lot of the questions that we get in our field are centered around the actual incident itself. So, we’re going to cover that for the moment. In this case, I’m going to be talking about ransomware more as a specific example so that we all have a common ground to talk from, but note that most of what I’m going to say is going to work with, or at least be true for, most types of network penetrations that could exfiltrate or destroy data.
So, let’s talk about right before you get hit. There are a couple of signs to look for. Most large enterprise networks are already monitoring for these kinds of metrics, but not all medium-sized businesses, and I mean, to be honest, not even all enterprises are looking out for these sorts of things. First off, partial MFA logins. Almost everybody has multi-factor authentication deployed for their users at this point. If you don’t, you should get on that pretty quick, but one of the key metrics to look for is when you have users who are repeatedly passing their first check, i.e., their password, but they’re failing the second part of the multi-factor authentication.
Now, Microsoft and Google and WatchGuard and Duo all have these little apps that you put on your phone now. In fact, most security organizations are saying the app is the preferred way to do multi-factor authentication, but one of the dangers there is that people get into the muscle memory habit of just they see an MFA pop up on their phone and they approve it, without necessarily thinking about what that implies. So, once an attacker has a user’s password, and they’re banging away at it, they’re just waiting for a user to slip up and let them in. So, those partial MFAs are one of the most important things to look for that we’ve seen out there.
Brute-force attacks, we are probably all used to the steady rain of just random hits from IP addresses in China and Iran and Russia, and other places, just pounding off of our firewalls day and night. Cloudflare has built an entire industry out of dealing with these, but having an idea of what the normal pattern for those are for your particular organization, and if there’s a shift in those very abruptly, that can indicate that somebody has managed to do something that might be getting them into your network. Usually not through the brute-force attack, but either somebody who was trying to brute-force you found a way in and stopped, or somebody who found a way into your network started brute-forcing you in order to cover up the other activities that they’re taking by just flooding your security logs.
Phishing emails, if you’re monitoring phishing emails recently, you know that these tend to go out in salvos these days, where the phishers will line up as many emails as they can from the same organization and fire off those emails all at once. The goal there being that they don’t want IT to have an opportunity to realize that these emails are going out to everybody, and to send out a warning to users not to open that email or whatnot. They know they only have a couple of minutes to a couple of hours to grab an unwary user, and it’s important to remember any user, no matter how well trained, no matter how technologically savvy, can be a victim of these phishing emails. It only takes a moment of inattentiveness to deploy the payload for a lot of these.
Jump boxes, which is the colloquial term for RDP boxes or SSH boxes that are open to the internet that people connect to, usually into a DMZ of some kind, and move into the rest of the network. If those start spinning up unexpectedly, like they start pulling a lot more CPU and memory than they normally do, that’s a sign that somebody might be preparing to launch an attack from one of those to the rest of the network.
SMB, Kerberos, LDAP requests, really, this is any kind of unexpected traffic coming from appliances, and by appliances, I mean routers, NASs, shared disk drives, even smart switches and thermostats for companies that have those deployed. As a general rule, if your Cisco router is trying to access the accounting database, something has probably gone wrong, and that’s one of the things to look out for, because that can usually mean that an intelligent attacker is using that as a jumping off point to your network.
Broadcast traffic from point to site VPN connections. Especially in the age of COVID, we’ve had a lot more people using VPN, but VPN users are usually very specific in what they do. They’re getting in to access file shares, to access internal websites, or sometimes to RDP into individual machines. If you start seeing traffic that isn’t going to specific sites, even if it’s being blocked by the firewall, that can indicate that somebody has managed to get onto your VPN, who’s not supposed to be there, and just starting to scan around to see what they have access to. Even if you think that you’re secured, that’s something that you really need to examine and react to.
This last one is actually getting more and more common, which is any outbound traffic from your site that isn’t going over HTTPS. Almost everything runs over HTTPS now. You have, obviously, normal web traffic and web applications, but desktop applications like Slack and Teams and other communication and collaboration applications, most desktop applications like your Visual Studios and your Offices, and those sorts of things are going to be pulling their downloads and uploads and communications through HTTPS. Google even is doing a DNS through HTTPS these days. So, any time that you see a machine that’s connecting out on something other than Port 443, especially if it’s like 6666 to 66670, the IRC ports, or anything that’s in the 9000 series, which indicates start web traffic, you definitely want to take a look at whatever machine is originating that traffic. So, those are just the latest metrics that we’ve been seeing in how people can tell that somebody’s inside even if they haven’t started making mischief yet.
The next step, of course, is OK, we’ve been hit. You know that you’ve been hit, sometimes in the case of ransomware, this is somebody’s screen pops up and says, hey, your data has been encrypted, please email this address to begin negotiations for getting the key to decrypt your data. Once this has happened, now we’re actually in incident mode. You know that you’ve been hit, you’re not sure how much data has gotten out. Frequently, the people who discover the impact don’t know what their legal exposures or responsibilities are. So, I’m just going to cover the technology side, and then my colleague is going to cover more of the legal ramifications of that.
But from an IT perspective, you stop the leak. This is one of those circumstances where social mores go out the door. I’m usually not a proponent of interrupting employees when they’re on vacation, or otherwise indisposed. I believe in work/life balance, but this is one of those exceptions, especially in the opening minutes of a breach. If you don’t have the person who can secure the network, if there’s any question of whether or not the people who are currently on duty can handle this, pick up the phone, use the personal line, get in contact with the person who can see seal the leak, because minutes matter when you first discover these.
Change passwords on any accounts that might be compromised. Again, it is much better to change a password on an account that doesn’t need to have its password rotated than to accidentally leave something open that you didn’t realize the attackers had access to.
Halt all systems that automatically rotate or delete logs. This is going to be important for the investigation later, but it’s something that a lot of our customers miss. Essentially, once we get to the more formalized incident response later on, ephemeral logs and things that go away after a day or two can be purged from the system, and those are often the ones that we’re looking for, especially high tempo, high transactional logs like network packet logs, things like that on routers. Having a plan in place to halt those and archive those ahead of time is really important, but even if you don’t have that plan, making sure that you hold on to all the state information you have at the moment of the attack is super important for later on.
And then finally, secure all backups. I cannot count the number of times that I’ve had or that I’ve come into organizations that thought that they had their backups secured, that they were offsite, that they were inaccessible to the attacker, and then went back and discovered that their backups had been encrypted by some ransomware or otherwise compromised. It’s also important to note that some lurking attackers will actually sit dormant for a period of time specifically so that they can get themselves into the most recent backup, so that once the backups are restored, the attackers still have access.
So, let’s talk about the information that we need to collect and the questions that are going to be coming up immediately after an attack. First off, obviously, who was exposed? This is going to be the biggest question. Was just employee data taken? Was customer data taken? Was vendor data taken? Are there other people’s information we’re responsible for? The classic case in this… or the classic example of this, rather, is credit report information for people who aren’t necessarily your customers, aren’t your vendors, but nonetheless, it’s personally identifying financial information that you had in your custody, and if that escaped from your organization, that’s going to come up later on in the incident response investigation.
When did the attacker get in? Now, this is something that may have multiple answers. You’ll have the time that you found out that the attacker was inside, and you may be able to or you may not be able to establish a time when they originally got in, but at the very least we want to find the earliest time that we can confirm that they were in the system.
And then what time were they locked out again? We’re going to be creating timelines about what’s happening later on in the incident response and knowing exactly what time the attacker was locked out, whether the credentials that they were using were changed, or the time that we actually went and terminated all of their active sessions, because those can also be two different things.
Was anything altered? Was any data changed? Was anything installed? Were any new accounts created, i.e., are there any new administrators floating around the system, and were permissions changed? That’s especially important in cloud environments. Sometimes attackers can get in relatively early in the attack, they’ll do things like set S3 buckets or other cloud storage to publicly available so that even once they’ve been kicked out of the main system, they can come back and collect all of these databases or other data stores that are just floating out there now public without the knowledge of the data owner.
And is there a persistent threat? Essentially was somebody added, or was some software, rather, added to the environment that will allow the attackers to get back in after you’ve closed the doors?
What did they have access to? And this is one of those things where we want a best-case scenario and a worst-case scenario. Were they were restricted to an individual box? Did they only have access to one set of files or have access to one relatively unprivileged user account, or were they able to escalate up to somebody who had vast permissions over large sections of the infrastructure or even permissions into adjacent organizations, like partner organizations, vendors, customer organizations? This is a classic incident when MSPs, managed service providers, get hacked, because the attackers not only get access to the MSP itself, but generally speaking, the MSPs have privileged access on all of their customers’ systems. Finding out what the best- and worst-case scenarios are, given what we know about the attacker’s access. is going to be a question that’s going to get asked a lot in the hours and days after an attack is discovered.
And then finally, what did we do? Document everything you do in response to attack, especially in the first few hours or days. Just get it written down, keep your emails, keep handwritten notes if you can. I know that in the moment, a lot of IT workers and business leaders, and frequently even the legal team, will stop for a moment to say, OK, how is this going to look later on? Will I be criticized for this decision? I can tell you that having been through a lot of these, almost any good faith action that you take to secure your own data, to secure your customer data, and really any other data that you’re responsible for, to eliminate the attacker and to take an action that you believe is going to impede or halt the attack is going to reflect well on you in any subsequent investigation. I mean, if it involves unplugging a router at a field office, if it involves pulling the breakers in a data center – I know of at least one incident where a real quick thinking junior employee actually pulled the fire suppression system in the data center to kill the entire floor as their data was being encrypted and shut down. So, when you pull the fire suppression system in a data center, one of the first things that does is it cuts power to the entire floor, and in that case probably saved their company an enormous amount of time and effort because it usually takes about a day to recover from a fire suppression system like that. It can take weeks or months to recover from a large ransomware attack.
And then finally, don’t delete anything that isn’t an immediate threat. That is the one thing that you can do that will raise awkward questions later, especially if you don’t know what you were deleting or what was in it. When we get into conversations with insurance companies and that phase of analysis, anything that you did to delete data is going to be much, much harder to document and to explain when we get to that part of the incident response. So, don’t delete anything that you don’t know is a threat to the system.
And with that, I’m actually going to throw over to my colleague for effective IR plan design.
Thanks, John. So, no discussion about an IR plan is complete without going through the key players, the key workstreams, response and notification. So, let’s start with key players.
I think it’s 101 that you have already established a scalable bench of in-house and external resources, and we’ve listed a few here. Obviously, internal IT lead, the legal lead, outside counsel, forensics firms, there can be key players identified depending on the matter, and depending on the company and type of data. So, we’ve just listed a few, but what’s important to define specifically is who is taking the lead in certain areas and what is their level of decision making authority, and so you may have a legal lead, but is it the general counsel or CEO who’s actually making the ultimate decisions that’s going to drive your communication plan, and what is the criteria when legal should take over? So, if it’s a high-risk incident, think about privilege concerns, and whether a hiring authority should shift outside counsel to manage them.
The other point about defining your key players is keeping the core team as small as you can get away with and understanding you can always have an expanded team, and that the core team and expanded team should be defined by their role, so that as people change, depending on organization, maybe people change in and other jobs very quickly, that you’re not scrambling to figure out why somebody is now in clearly a different department, or a different role in IT was involved. So, having that listed by role is important, but also defining your backup plan. If that person is on a vacation, to John’s point, you should go ahead and try to reach out to them. If they are unreachable then what is your backup? For any kind of plan of this nature, I really like RACI charts, or RASCI charts, where you understand at a glance who is ultimately responsible for that workstream, for that area, and who is assisting, who should be consulted, who should be informed, but again, the most important point here is to keep it as small and simple as you can get away with.
In terms of key workstreams, then focus your efforts in defining the workstreams for the highest risk events, and when you want legal counsel to participate in assessing risk. Obviously, it will be for the higher risk events, trying to get a legal perspective and how that can drive or change the workflow and strategy, which would include areas of data privacy, which come very highly regulated, breach notifications, complying with insurance policies and working hand in hand with your risk management team. And then, again, how you communicate, you’re going to communicate, which becomes critical when all these workstreams need to be run in parallel.
For a communication plan, it’s the central theme here to keep it as short as possible. If you have somebody who has to step into a role who has not been fully briefed or participated in tabletop exercises and read the plan, so memorizing a five-to-eight-page plan is not ideal. So, you want to keep it very short and concise, the fewer players the better. And as much as can be done in advance, the better, because here efficiency is essential, so if you can have a role play multiple hats then that may be ideal. And the example that I’m going to use in the next slide as well is eDiscovery counsel. Having done this role myself, the eDiscovery counsel often has master service level agreements with firms who can do both the forensics and the document review and has a working knowledge, sometimes a very deep institutional knowledge about what the risk is and how to triage.
So, that brings us to data mapping. So, in that way, again, eDiscovery counsel can be your friend. I’ve tried to be a friend to many, and I have had experiences with high-stakes triage, on complex matters, working under short deadlines. Again, that institutional knowledge of company data sources and where the risk may lie, how to do large document reviews and identify specific hot topics, hot button issues that need to be communicated effectively to whoever is running your legal strategy or running point on the IT side. And also, the nice thing about eDiscovery counsel is that their team usually has access and experience with O365 security compliance center, so again, you are being as efficient as you can be with the different areas, subject areas that you have.
Privacy analysts – I don’t want to overlook their role. They have done a lot of work. If you do work in various countries and you have a privacy compliance team, they probably already mapped a lot of high-risk data sources in terms of privacy, and that can get you a lot of basic important information very quickly.
Just to chime in on the data mapping, it becomes really important to leverage your internal knowledge to understand where your PII, your PHI, any of your sensitive data is stored, any existing knowledge within the organization as to those locations is going to really help you move faster down the line towards resolving the breach, and handling the breach notifications, understanding what entities and parties you may have to notify. You may be able to get to that information a lot quicker if you understand your data map and you pay attention there.
That’s absolutely correct, John. It is, again, essential that you can identify areas that can be run in parallel, and so as much as you can triage to the people who have the most knowledge about those data sources and understand the privacy risk, the better off you are while you have your IT teams and maybe your other legal counsel and folks talking to risk management and talking to the C-suite or the board doing what they can do best.
And so, to cap off my portion, talking about response and notification. And speaking of triage, again, as the work is done in parallel, to the extent that you already have relationships that you have, that you can trust, and you have agreements in place with law firms and vendors alike who can perform multiple roles, such as digital forensics and document review, that is ideal. You don’t need 100 tools in a toolkit and 100 different toolkits, you need a small group of players who have as much of this capability as possible under one roof.
With that, I will turn it over to talk about post-breach discovery. Mike, do you want to talk about the post-breach discovery workflow?
Yes, so post-breach discovery is certainly something that we find that many organizations don’t have as a part of their incident response plan. They don’t see beyond sealing the leak and closing any unauthorized access. Tabletop exercises can easily encapsulate the workflows or various needs that start to happen from legal streams, as far as allowing an organization to comply with various obligations from a disclosure standpoint, be it individuals or organizations, business partners, organizations who are customers in various capacities.
Here is the workflow, so to speak, and this is something that the clock starts ticking for most organizations as soon as they become aware of a breach. Oftentimes you have notification requirements baked into contracts and things like that, or you’re regulated, depending on what industry you’re operating in, or just depending on what purposes the sensitive data at issue, what it’s collected for. And really, what we’re starting with is certainly our Preserved Impacted Sources, and this goes back to that sensitive data map. Knowing where your sensitive data is rather than just looking at a massive 20 terabyte share and having to figure out, ‘OK, how do we get to where the PHI/PII is right away before even using tools that can assist you onsite, offsite in various capacities. Knowing where that data is, is going to streamline this process insomuch as being able to say, ‘here’s data sources that contain sensitive data that we know were not hit, and here the ones that do and let’s move them along this process as quickly and efficiently as possible with outside counsel and a vendor like Haystack who does incident response, digital forensics, eDiscovery and document review’.
And we’re starting with these data sources, then we’re asking ourselves, ‘is there sensitive data?’ There may be interview processes that are occurring with business unit leaders based on the data sources. We may have deployed toolkits onsite that are going to allow for kind of rapid assessment to inform us if there’s any type of sensitive data in a mixed data population.
In this chart here, data sources really mean structured data and unstructured data. You may have SQL databases, you may have web-based data, you may have databases with many dependencies, this is where it starts to become complex, and the need to think about just beyond the typical work products, the documents, and to look at all the interdependencies of a modern organization that relies on different types of databases and web forms to interact with their customers or for their business to run or is important because handling those data sources and getting them into an actionable format to go through this process is a process in itself.
Once we have it, we are basically using a variety of tools to process this data. There’s tools that are specifically attuned to more incident response workflows. But at the end of the day, this is going to go through somewhat of a modified eDiscovery workflow, which many of you on this webinar are very aware of and know everything about eDiscovery. And we’re looking to continuously build out high level reporting at every stage, and this is very important for outside counsel and for the organization itself to be able to assess risk and to control spend and to contain risk, so that they’re making decisions that are informed by actual reports and statistical sampling, oftentimes with the goal of eliminating populations of data is certainly something that is pervasive throughout this entire workflow end-to-end.
Once data is processed, you’re either using some type of machine learning or AI, search-based workflows, regex, a combination of all of them, HaystackID uses everything. We have specific workflows in Relatively. We also are partnered with other platforms that provide entire workflows that are specific to breached response and disclosure, which work well as well, and everything is kind of dependent on the situation, sometimes the granularity of reporting that’s required and scalability. We’re trying to get an impact assessment, really something that is kind of the aggregate of everything, and where we’re starting our extraction and why we picked those data sources, and that impact assessment is going to be all of those outputs from the artificial intelligence and search, and our just information gathering.
And at that point, we get a pretty large team that starts going through these documents. Usually it’s a mixture of attorneys, paralegals, it depends certainly the client’s preference. And the goal here is to start to get really all the sensitive data and a collation around individual entities and organizations where their data has been compromised, health data, anything, financial data, socials all that stuff. That gets you to a disclosure list or series of disclosure lists and regulations, and it varies state to state as far as how you need to disclose and what format, within what timeframes. Certainly, globally as well, that’s going to be a much different process and that’s all just going to depend on your data privacy laws as well.
And just to give… we’ll quickly give a high-level rundown of some of the modern techniques and tooling and artificial intelligence and search that is being used from just across the spectrum to achieve these outcomes. John Brewer.
Thank you very much, Mike. So, as Mike was saying, for most processes we have, or for most systems I suppose I should say, we have relatively mature workflows that carry those through the run-up to the preview process, I guess I should say. Although with almost any system, with almost any organization, especially any organization that has more than a few people working at it, there will be some sort of application or other system that we’ve not encountered before, or we haven’t encountered in the way that that customer is using it, whether it’s just in obscure but commonly available system or something homegrown. For that reason, we do keep a full toolset of various data science and analysis techniques in our toolbox, so that we can just rise to the challenge in any particular circumstance.
Now, back in the early part of the decade, well, I guess, now technically last decade, we were using techniques that were being pioneered by companies like Google and Facebook and other kind of social media companies that were funding large amounts of research and gave us tools like Word2Vec and its various progeny and Template Matching systems, which are kind of more traditional AI approaches that you would have been familiar with in years gone by. And we’ve stayed up with the modern literature.
Now, obviously, things that got published in the last year or so are difficult to operationalize, but in the last few years, the things that we are starting to or are in actuality pushing into our environments to help with data reviews are things like augmented transcripts, are things like general purpose transformer or GPT models. Triumvirate Cognitive Models, these are the kinds of models that allow us to leverage multiple different vendors who do similar things and combine the results from their systems to help us provide the best possible result for our customers. Sentiment Detection, which actually is a technology that was originally used for things like Yelp reviews and analyzing huge amounts of Twitter data, but it turns out that it has great applications in discovering PII and especially PHI, just because of the connotations and the, well, sentiment frankly that you tend to get around that data in unstructured format. And then Non-Entity Key Phrases, which in this context, it usually just means – a fancy way of saying that we recognize the titles and names can refer to the same entity, for instance, that the Queen of England also refers to Elizabeth Windsor and that that is the same person.
So, we use all of these toolsets sometimes and we work with our customers and we work with their counsel and whatnot to make sure that we’re matching the most effective form of artificial intelligence or data science techniques to bring the specific information and to reduce the problem of sorting through, as Mike pointed out, potentially terabytes and terabytes or unstructured data to determine how extensive a particular incident might have been in terms of data exfiltration.
Mike, do you want to go over the data breach assessment reporting?
Thanks, John. This is just a stock-sensitive data breach assessment graphic, not actually probably what we would deliver to you. But the key elements of an assessment report, impact assessment report was always going to be a count of sensitive data by type, PII, health data, financial data, much more granular in most circumstances. It would be looking to construct taxonomies that are specific to the organization. They have their own patient IDs, other types of customer data types, the things that don’t fit legally into any model. We would work to come up with methods to locate those and to report on those. Certain types of confident information. Certainly, looking for unique person names across the dataset, so summarizing every instance of “Michael D. Sarlo” in the dataset 50 times to a single entity that’s been affected, likewise to organization. Being able to map a good quantity of those and where that entity overlaps with some type of PII, PHI, other sensitive data type. Those are usually really easy documents to go for where you have a person’s name and a social right in the same document.
And this can get as, you know, along the lines of just certain types of exception reporting. And as we move into the actual extraction phase, we’re doing different types of data analysis here using statistical sampling to, basically, remove false positives from the dataset, while also using statistical sampling to sample populations that don’t hit on anything, because that’s oftentimes where your risk is, the things you can’t see using a machine or using search, and there’s interesting technologies out there for photos and W2s and things like that that can streamline that type of stuff that doesn’t necessarily have text or doesn’t do OCR very easily.
All of this is going to get us, really, to a large-scale, very fast-moving – what I wouldn’t call a document review, the popular term, a subset of industry would be more of a data extraction, entity extraction, but we’ll call it review since we have so many eDiscovery folks on the line today. And ReviewRight Protect is kind of a bundle of services and technologies, it has a functional first core, turns on our ReviewRight reviewer recruitment testing and placement platform. It uses its own artificial intelligence to basically allow our teams in our document review division to recruit, test, measure and place the best possible reviewer on any given matter. And in this case, we’re looking for reviewers using our testing platform who may have specific subject matter expertise or domain experience in relation to the entity that was breached, but oftentimes, we’re just looking for somebody who can code documents quickly, efficiently, and accurately. And that applies, certainly, to breach extraction or any review and what we’re doing is trying to always get a handful of reviewers who are above that Mendoza Line. These can move so quickly that being able to scale from a national standpoint, which we’ve been a leader in remote review for the past seven years, have done well over a thousand remote reviews as our competitors were trying to catch up to remote as a result of the coronavirus, we’re well positioned to scale in the hundreds of individuals to support these types of projects, multiple projects at once. Those numbers can be critical, because sometimes you have 10 days to go through this data and to get a disclosure list, just depending on what your obligations are. It’s especially true with healthcare data.
And certainly, where we like to kick off any extraction is with a gauge analysis. Again, standard for any review we’re doing and also for any extraction. What we do here is we get a statistical sample of the population that’s in scope or a precede set of documents selected by HaystackID Review Management and/or outside counsel, and we basically put all of the reviewers on the same set of documents at the outset of a project. This is a great way to help make sure that reviewers are fully attuned to the dataset and the issues at hand, and that they’re getting what they’re doing, and they understand what types of PII, PHI sensitive data we’re looking for. And it allows us to correct anybody who is off, or sometimes there’s even a disconnect between outside counsel when we start to see that they’re all coding documents or doing extraction differently.
And really, for all these matters, the biggest name of the game here is how can you reduce the population you’re looking at. And that’s something we’re always trying to do for any HaystackID matter that comes into our walls, be it a Second Request or a data breach, wherever we can use any type of data analysis to reduce scope of the document review universe that’s always going to save our clients money. Going back to those original stats, about 150 bucks per document from a disclosure standpoint, 450 if you’re a healthcare agency or a healthcare provider.
And some of these are along the lines that you can use on any matter, but certainly the easiest one out the gate is going to be deduplication, we’re going to always recommend item level deduplication, email threading, near-dupe analysis, domain analysis, insomuch as trying to find domains that we just know don’t contain sensitive data or they shouldn’t [turbo blast] things like that. Search term analysis is important too as well as we start to get into the data, and we start to see more false positives and we can identify where those false positives are coming from. Sometimes regex [is hit finally], being able to combine those with a set of terms to start to remove those false positives, and again statistical sampling. there’s a lot of statistical sampling. You need to be defensible and building a living breathing report of these populations is a major piece of the workflow, so you can put a nice bow it as and if and when you ever start to get investigated or probed, things like that, about how the incident was handled. This is the right way to do it.
And our workflow is pretty much we’re identifying PHI/PII, it has to be extracted from the documents being entered into a form, or we’re using review platforms that specifically allow points and click and collation. That data is being classified and associated with entities, hopefully a single Mike Sarlo. If you have a piece of me somewhere like my address and my social in another document, those are going to be associated together as a single object entity as a part of this process. we’re then focusing on identifying anything where we can mass extract sensitive data. You get big Excels, big databases, things like that that can be handled from a GUI standpoint, handled on disk, which identifies those all en masse, move them separate and aside from the workflow, and we will work with our data science teams, we’re using different tech platforms to normalize that into the overall format of the disclosure list.
And then QC, and there’s iterative QC happening on a daily basis, we’re using also artificial intelligence techniques to clean up the entity list. Also using kind of classical de-dupe methods for – where you’re dealing with tens of thousands, if not millions of individual entities, and oftentimes, it can be in the millions.
And reporting here from a day-to-day basis is always going to be customizable. Again, specific teams are going to be looking for things to be reported on in different ways, and our ReviewRight Reporting modules allow for complete customization of reporting output during the course of any extraction or review. Here, we’ve really broken things up by PII categories, pace, things like that are overlayed there. so, these are great so you start to understand what type of risk is happening on a day-to-day basis, if not hourly using dashboards as a review begins, which may change your strategy as far as how you plan to handle something like disclosure.
And finally, our last slide here and definitely if you guys have any questions, please feel free to enter them into the Question box, we’re kind of right on time here to answer questions. All this rolls up into, really, normalization, getting the mess of humans touching data or a machine touching data, and they miss things, or they grab random characters, and getting it into a format that can go to be used to, basically, send out, most of the time, mass mailers, ‘you’ve been breached, here’s a free identity theft card for the next year, like dotcom’.
I’ll kick it back to John Brewer to go over some of these techniques and then we’ll close it out with Q&A.
Sure, thank you very much, Mike.
So, as Mike was saying, we have a variety of different tools that we use to normalize and deduplicate entities once we have the grand list of everything that could be. So, again, we use classical deduplication techniques like SoundEx, nicknames, common abbreviations, techniques that have been around since, I think, the 1980s, tried and true and can do it in your sleep, very low-cost processing techniques. And obviously, all of this is feeding into human review. We never have a machine just making that decision unilaterally. All of these technologies support the human review aspect of this and that’s important to emphasize any time that we’re working with using AI in a legal matter.
Now, modern techniques, again, we have Template Matching, bringing data together and bringing entities together where we have particular patterns in the data that we’re looking for, that associate a customer or, I guess, any kind of entity with a particular piece of information that might trigger a notification requirement. And we do use Machine Learning Models, both static models that are done by vendors and the internal ability to train models for specific cases where we have large but unique datasets. These models produce confidence and accuracy scores, which again we’re not using the AI to replace the human reviewers, we’re using it to augment the human reviewers, to basically mark in the terrain, these are the most valuable places to dig, these are the places where we should be spending our review dollars in order to get the best results.
And so, in that way, we’ve really distinguished ourselves within the industry and by deploying these technologies, and not only deploying them, operationalizing them and having them ready to go within days of an event happening, so that those questions and those answers that both the end-user and their counsel and the people who don’t even yet know they’ve been exposed are waiting for to get as quickly as possible.
It looks like no questions, and we appreciate everybody joining.
Please do feel free to reach out to us. Everybody is on the website, happy to answer any questions, dig down into this more. We work with organizations to consult on incident response plans, smoke testing processes. We can work with you to simulate these post-breach steps and certainly something that if you’re an organization, you’re expecting a breach or you want to be more prepared, it’s really important to really understand the stresses that come beyond just your typical incident response plans as you start to get into these types of exercises more on the disclosure side.
I just want to thank everybody again and thanks so much for joining. I’ll kick it over to Rob Robinson to close us out.
Thank you, Michael, and thank you to everybody on the team for the excellent information and insight today. We truly appreciate it. And we also want to thank all of you who took the time out of your schedule to attend today’s webcast. We know how valuable your time is, and we’re grateful you shared it with us today.
We also hope you have an opportunity to attend our next monthly webcast and that is scheduled for one month from today on March 17th at 12 p.m. Eastern Time, and it will be on the topic of Remote Security and Data in Legal Reviews. In this expert presentation, enterprise security experts, data and legal document review authorities and some industry-leading technology innovators will share how organizations are preparing and responding to the increasing security and privacy challenges in today’s remote world. We hope you can attend.
Thank you again for attending today. Stay warm, safe, and healthy. And this concludes today’s webcast.
2021.02.17 - HaystackID - Its Only A Matter of Time - Webcast PPT