[Webcast Transcript] Handling Non-Traditional Data Sources in eDiscovery[Webcast Transcript] Handling Non-Traditional Data Sources in eDiscovery https://haystackid.com/wp-content/uploads/2020/07/New-Typewriter-1.jpg 850 400 Marketing Team Marketing Team https://haystackid.com/wp-content/uploads/2018/06/03-2-150x150.jpg
Editor’s Note: On July 15, 2020, HaystackID shared an overview and explanation on the handling of non-traditional data sources in eDiscovery as part of our monthly educational webcast series hosted on both the BrightTalk Network and the HaystackID website. While the full recorded presentation is available for on-demand viewing, provided below is a full transcript of the presentation as well as a PDF version of the accompanying slides for your review and use.
Handling Non-Traditional Data Sources in eDiscovery
You’ve heard about Alexa potentially becoming a witness in a criminal trial, but what about all the other new and unusual sources of data that may be useful as you prepare to commence or defend against litigation? With data originating from Slack to talking refrigerators and from robots, doorbells, thermometers, and other IoT devices, what does the potential deluge of data look like, and how are you going to collect, process, review, and produce it?
Join our panel of experts as we explore some of the non-traditional data sources that litigants are faced with today and what they may face in the future.
+ Non-Traditional ESI Fundamentals and Considerations
+ Audio and Video Discovery
+ Cloud Data, Collaboration Suites
+ Short Message Collaboration
+ The Expanding Internet of Things (IoT)
+ Michael Sarlo, EnCE, CBE, CCLO, RCA, CCPA: Michael is a Partner and Executive Vice President of eDiscovery and Digital Forensics for HaystackID. In this role, Michael facilitates all operations related to electronic discovery, digital forensics, and litigation strategy both in the US and abroad.
+ John Wilson, ACE, AME, CBE: As Chief Information Security Officer and President of Forensics at HaystackID, John is a certified forensic examiner, licensed private investigator, and information technology veteran with more than two decades of experience working with the US Government and both public and private companies.
Good morning or afternoon depending on your location. I hope you’re having a great week. My name is Rob Robinson, and on behalf of the entire team at HaystackID, I’d like to thank you for attending today’s webcast titled Handling Non-Traditional Data Sources in eDiscovery. Today’s webcast, kindly hosted with The Masters Conference, is part of HaystackID’s monthly series of educational presentations conducted on the BrightTALK network, and designed to ensure listeners are proactively prepared to achieve their computer forensics, eDiscovery, and legal review objectives during investigations and litigation, and our expert presenters for today’s webcast include two of the industry’s foremost subject matter experts and authorities on cybersecurity, digital forensics, and eDiscovery.
Our first presenter is Michael Sarlo. Michael is a Partner and Executive Vice President of eDiscovery and Digital Forensics for HaystackID, and in his role, he facilitates all operations related to eDiscovery, digital forensics, and litigation strategy, both in the US and abroad, for HaystackID.
Our second presenter is John Wilson. John is the Chief Information Security Officer and President of Forensics at HaystackID. He is a certified forensic examiner, licensed private investigator, and information technology veteran with more than two decades of experience working with the US government, and both public and private companies.
Today’s presentation will be recorded for future viewing and a copy of the presentation materials will be available for all attendees, and you can access these materials directly beneath the presentation viewing window on your screen by selecting the Attachments tab on the far left of the toolbar beneath the viewing window. At this time, I’d like again to thank The Masters Conference for the opportunity to present today and turn the mic over to our expert presenters, led by Michael Sarlo, for their comments and considerations on the handling of non-traditional data types in eDiscovery. Mike?
Thanks so much, Rob, and thank you for that intro, and hi everybody. I’m really looking forward to our presentation today. This is Mike Sarlo speaking. We’re going to kick off today with (a) if anybody has any questions throughout the webcast, please just feel free to pipe in. There should be a text box there and we will be watching your feedback and questions live, and things that makes sense to be answered live will be answered live, and we’ll wait to the end, we’ll cover them there if it’s more appropriate based on the question. So, please don’t be shy. Obviously, on a webinar, nobody can see each other. It’s always easier for us when you guys ask questions, and hopefully we can answer all of them.
So, today’s webcast, Handling Non-Traditional Data Sources in eDiscovery Investigations and Compliance. I think eDiscovery in this context, this presentation, means a lot of different things. Today’s agenda, we’re going to talk about more of the fundamentals and considerations. We see so many alternative data types, new data types, new data types reborn, old data types reborn, I guess, in our practice, John and I, and most of them don’t have oftentimes repeatable methods for collection, and oftentimes when they do, that repeatable method may be out of date based on the application being up to date, or frontend/backend, or new features. We’re going to go into just some basics, where I still think it is very important, which is audio and video discovery, and the burden for this type of discovery from a cost and computational standpoint has been much reduced in the past few years, and for that reason, I think everybody needs to be more aware of it.
We’re going to then talk about just cloud data and collaboration suites, and just some cutting-edge issues in digital forensics, too, where even for mobile phones where everybody’s like, oh, we’ve been dealing with mobile phones for years, it’s not really an alternative data type, but changes I think fall into the same gamut, and there’s really robust changes that are occurring in map architecture and things like that as well. But then really jump more into just a brief discussion about short message collaboration tools, and then finish off with the IoT, the Internet of Things, and where we see things going with the technology explosion that’s quickly on the horizon with the advent of increased access to 5G and what that means for eDiscovery.
So, any time that we encounter a new data type, we like to break it down, and really, this is the process that we think about. I’ve been doing this for years, John is probably twice as long as me, and these are the questions we’re always thinking about when we get asked the question, sometimes you’re supposed to appear to be the expert and you’ve never handled it, but you can build expertise immediately and build a sensibility and a process, which really follows along the EDRM, thinking about it at a data type level and assessing repeatable outcomes across each of these sections, and where you can build those then, and where you can measure as well.
Somebody just asked a question. Scott Cohen is not presenting today. He had an emergency and could not make it, just so everybody is aware.
So, John, why don’t you run down this list, and give some feedback here?
Yes, absolutely. Thank you. So, when you start talking about new things or different things, your non-traditional data and how that all presents, and how you fit that into the EDRM, it follows that standard paradigm. The first thing is identification. How do you define the scope, not just of the data itself, but what the scope is for your project, what the scope is relative to what you’re looking for? What’s involved in a matter? That identification phase becomes important because that’s where you can figure out things like, hey, we’ve got TikTok video chats, or we’ve got WeChats or WhatsApps, all sorts of different newer avenues and channels of data and information that you have to scope through, but you might be able to figure out early on in a case, during that identification phase, that, hey, we don’t really care about audio chats, or, hey, audio is a key component and most of our custodians are doing audio or video chats with other people involved in the matter of the case, and so those audio and videos can become very important. So, figuring those kinds of things out.
And then next you have to move on to the control aspect. When you’re talking about the control aspect, you have to figure out who has custody, who has control of the data. Was the data created on a personal device that’s BYOD in the organization, or is it a personal device that is not a BYOD organization; they also have a business device? Figuring out the who has that aspect of it, it’s important. Even in the US, especially now you have California CCPA, you have the New York SHIELD requirements that are coming into play this year. There’s a lot of various challenges in the US that are coming to discovery in general. So, making sure—
Let me also add something to that, too, just on the control and just on that gamut. It’s really important in very large enterprises, where they may be headquartered in the EU, or they have robust global business operations, oftentimes files that appear to be on a singular network share, they’re actually on geographically dispersed block-level shares, and you end up with data points that sometimes they appear to be in the US, they’re actually maybe saved in the EU, and this can have pretty profound GDPR impacts. So, I always try to –any time I have a client that has global operations, we’re very aware of this, and oftentimes legal is very disconnected from this, or they just don’t understand it, and it’s something we have to take very extreme caution with. So, that control element is also somewhat where does the data sit as well, and then also, and we’ll get into this as we get into the cloud, who really owns the data and how much [purview] do you have to where that data actually lives, and what are your obligations there?
Go ahead, John.
Yes, and just you continuing exactly in that vein, when you start talking about the international borders things, it’s not only do you own the data and control the data, but where does it reside and who can be attached to it, because the company can own the data. It can be the company’s formula, it can be the company’s information, but if it was done in countries that have privacy concerns, where they place the privacy of the individual over the organization, I’ve worked cases where we weren’t allowed to actually access the formula, the calculated prices for a product, and that was the key issue at stake in the case, and the individual said, well, I created the formula, it’s my information and it’s private, and I’m not going to allow you to use it. So, making sure you’re understanding all of that, all of that comes into play because it’s control, it’s ownership, and it’s the right to access.
Once you’ve then established who has the boundaries around the controls and who owns it and who has rights to it, then now you can start talking about, OK, what can we preserve? What is preservation? What does preservation mean? Some sources of data have very difficult challenges to preserving that data, how do you get that data preserved, and how do you make that data accessible and usable in a matter? And then as we were talking about, as Mike was talking about at the start of this, a lot of it’s how do you come up with that defensible pattern to make that data accessible and usable and defensible. When you have challenges like apps that update and change how you access the data, what parts of the data you can access, on a monthly basis, some of these newer apps are changing constantly, and they constantly are adding encryption, they’re adding additional protections, they’re adding new storage mechanisms that all are architectured to prevent third party access to that information, so there’s real challenges when you get to that in a lot of these non-traditional data sources as to how you’re going to preserve those data sources, and a lot of times it is a bit of testing and we’ll have to actually go out and get similar devices and run similar applications, and create test data sets to validate a new mechanism when they have no published APIs, no published information as to how do you access the data stream or what it looks, and how that data source can actually be utilized.
And John and I, we oftentimes start a two-pronged approach to triaging any new clients’ environment and data sources that are in scope, and what we’ve really learned now is the email, the network shares, things like that, that oftentimes is the easy stuff, or sometimes it can be difficult depending on the type of implementation or auto-delete settings. It’s the custom applications and really what I call application discovery, which even the most rudimentary organizations oftentimes are running their business on custom technology or backbone standardized technology like NetSuite where they’ve customized it enough, where you may have data that’s purging from a system as you speak, that’s highly relevant, and it’s very important to really structure a questionnaire around really these alternative data types early on, because they always exist now, and with a discovery still is something we see all the time in different matters, and why people call us for our digital forensics expertise, it’s incredibly important to understand those data points early on, as oftentimes the challenges can be…there can be a lot of challenges. You might be dealing with staff who just isn’t very accessible or they don’t understand the urgency, or you might be dealing with several different SMEs that maintain individual a backend components of a frontend web application, many different databases that could equal one part, or I think of cases where we even had certain types of mechanical equipment readings that were intertwined with more typical digital data points to create a dashboard. How do you collect that? How do you preserve that?
And oftentimes now what we’re finding is a collection plan may be iterative, you may need to be looking at multiple data points, going into backups, and it becomes a question of how do you preserve that when you have an ongoing preservation objective and requirement, which is oftentimes difficult to put in place and can be very costly when you’re dealing with a lot of these alternative data types.
And somebody has asked a question, if we’ll be covering some of the self-disappearing social media comms. We’ll definitely touch on that.
Go ahead, John.
Yes, absolutely. Thank you. Yes, so as Mike was just saying, the challenges of collection, as you’ve now identified what’s in preservation, is it on legal hold, is it not on legal hold, what do you need to do about the retention policies, now you’ve got to move onto the actual collection of it, so that it can then be produced or utilized, or processed in the case, and a lot of times, as Mike was just saying, it’s going to require some really specific specialized procedures, and sometimes there’s numerous iterations because of what we’ll have to go through and collect, and then that gets that first pass, but then it takes two months to collect some data sets. If they’re large, or depending on how they’re distributed, you may have to go back and re-sweep because you need to be complete as of a certain date. So, paying attention to how that all works and how that’s all balanced, it’s nowhere near as linear is the more traditional data sets. You can go out, you can pull email with a finite date, hey, I want all emails through date X, and that’s pretty easy and straightforward. When you start getting into a lot of these newer systems, it gets… these non-trading additional data types, it gets much more complicated because you may have communication platforms that have gone through three upgrades in the last six months, and each one of those upgrades created a new storage repository, and so to get to those archives, you’ve got to go to the three different storage repositories instead of just the one, because these apps are just moving so fast and changing and evolving so rapidly, they’re not following those traditional application platforms where they they’re making sure there’s a full upgrade path and everything moves forward. A lot of times they are just truncating. They’re saying, OK, yes, that data is there, and then now we’re going to put all the new data as of today into this bucket, and then the next update comes out and they say, well, yes, we made a bunch of changes, and we couldn’t really get it all massaged into place, so we’re just going to put all the new data into this bucket, and so you’ve got to have some pretty good understanding there as to how you run through those iterations, how you develop the processes and procedures to ensure you have a defensible collection of all of that information.
And oftentimes, this is really truly the case with alternative data types, especially web-based, and really, I would say 90% of alternative data types are web-based in one way or the other. You’re collecting and processing at the same time in order to have a positive review and production outcome, and then as far as showing evidence to be used in a format that makes sense, in a deposition, or in the courtroom, or before a regulator. The way that this is done is the collection and what we oftentimes think of processing. It’s completely customized to get you a processed output, oftentimes at point of collection, and that’s where we spend… well, we did spend an incredible amount of time at HaystackID to get a repeatable workflow across our development teams, our web teams, and our processing teams that really goes deep through the processing lifecycle at points of collection to get really flattened content that’s defensible with all available metadata. So, we’re always thinking of that component of it.
As we get into analysis, we think of forensic analysis, maybe a sector trade secret case or you have issues where somebody plugged in a flash drive and copied a bunch of files, which we see 70% of time on different cases here in that domain, and that also with alternative data types can be mission-critical in a different way as it relates to reducing scope of the overall breadth of data repository. I have clients that have…we deal with Slack all the time, and our ability for very large enterprises to analyze 70,000 public channels, maybe in a company with 200,000, 300,000 employees, to see who is communicating when, about what, has been really a huge component of many of our matters, as we aim to reduce the open nature of these new age communication platforms, like Teams and Slack, where they come in scope and the collection mechanism typically has been to go out and collect everything, and then call it, and then you don’t really have a review solution. But there are ways to analyze these types of repositories, get it into tabular format, and it’s more of a structured data analytics exercise at that point to identify the boundaries of responsiveness. We’ve had great success there on different matters. You may have five or 10 custodians or a C suite, they’re not in Slack, but the organization knows it’s there, it’s heavily used, they have access to everything, how do you limit that? And that’s always something that we’re thinking of early on, and that’s a heavily documented process, and that analysis oftentimes plays back into the way that we try to understand a new system or any system to come to a production and collection outcome.
Go ahead, John.
Yes, and as Mike was just saying, we’re actually using the analytics way earlier than ever before. We’re bringing the analytics in at the very beginning of things because, again, you have these just massive repositories of disparate data that’s not easily reviewed, not easily searched, and we’ve been developing processes to actually get in and use the analytics so we can then hone that scope in so that instead of having to review 70,000 channels with 50 million messages, we can get it down to the 20 channels that were talking about the specific matter, and then determining from there, the 15 individuals that were actually the key respondents within that particular matter. Just as an example, but definitely bringing the analytics in much earlier in the process is really a key to dealing with non-traditional data sets.
So, really the rule of thumb here, as you get into non-traditional data pipes, is you’ve got to think about it end-to-end. You can’t deal with it as you go, and as the data moves through the different stages of the matter, and it’s so much more important for tech support and litigation support, and support staff just in general, to be really in sync with what’s coming in the legal pipeline, just so that you can get ahead of these challenges.
And then just to breeze through some other issues, oftentimes certain companies may not even be aware of the existence of these alternative data types, or they’re screened from not being fully aware of if it’s volatile, and it’s always important, as you get into alternative data types, to move beyond legal and to force them to bridge the gap with whoever the SME is in their organization, or set of SMEs as it relates to managing these types of repositories. Volume can be totally out of bounds, oftentimes. We’re going to talk about data volume growth from over the past couple of years and going forward briefly later on, but the volume of information being stored in these platforms cheaply – storage, computational bandwidth for large enterprises is as cheap as it’s ever been – we find it incredibly daunting and doing that analysis, even being able to do a collection, you may need to be on more new-age databases or really using big data type analysis engines to even parse the data that you then want to parse. We’ve seen situations with very large even Slack workspaces where Slack’s own compliance export features can’t export the actual workspace. It’s just too large and it breaks. So, we’ve had to spin up Hadoop on clusters to analyze that data, and then most recently we converted to Elastic Search at the backend to be able to do queries faster to get to the type of analysis that we’re discussing here.
And then meaning, context, how many times have you seen just that garbled wall of text that something that doesn’t look that way when you access it on a website, when a vendor went out and collected it for you? It’s really important, I think, to always get to that production-ready image or copy of whatever data that you’re trying to preserve as soon as you’re preserving it, so you can fully understand the limitations and mitigate those downstream.
We talked a lot about volatility already, but really what this means is that there may not be a mechanism to preserve that data. We often see this with very old legacy systems or things where sometimes the best way is manual, that it’s too cumbersome, or you could have very business-critical systems that doing the type of queries and extractions could shut down an entire enterprise, because they have a backbone weakness with these systems. We’ve seen this in some very large corporations here, especially if you get into organizations that may be responsible for infrastructure in the United States. It’s a big issue.
And then the overall complexity. Data is frequently stored in complicated data structures and formats, and I can tell you that with Relativity processing, they don’t handle it whatsoever for you, so you’re oftentimes less thinking towards custom development.
So, we’re going to jump over to the most basic thing which, in my mind these days, which oftentimes gets left behind, is audio and video discovery. Just some things we see is that 50% of a corporation’s customer interactions, they oftentimes are being tracked and preserved in audio, and oftentimes these systems are very ephemeral. They save recordings for a certain amount of time for quality assurances. In other scenarios, they’re being tracked much longer for compliance purposes. We see robust volumes here as well, and lots of challenges in audio discovery oftentimes underpin on quality of conversion of data. Metadata is often tracked in databases that may also require some type of custom collection, or you need to go out and grab the audio and then grab the metadata, which can be incredibly important, that metadata, as we get into modern day audio and video transcription systems that can learn faster about who’s speaking and how they speak based on knowing who the caller is, or who’s being recorded.
The challenge really is that we see so many different formats. You have your codecs that can be custom, or there are so many compression and decompression libraries, which codec is short for, that oftentimes can be bolted into different types of call center products or security footage, and some of them may be open source, others may be open source was a few lines of code change, which requires you to maybe do more manual conversions to standardize data, which is something that you always need to think about, is how do you standardize from a diverse array of formats into something that can be managed by the audio/video discovery tools we typically see in eDiscovery. I think Veritone and Nexidia are some of the old school players, they’ve been around the block for a while. Authenticity.ai is really the new kid on the block. What’s the difference between some of these legacy engines and older engines is really a capability to more of a multi-approach to translation and transcription; it’s more AI-based, so many different engines working together to give you a better outcome around conversion, and always the goal here is to get you text from audio, and really good text, and Authenticity does a great job to identifying speakers and actors, and from a transcription standpoint, we really like it at HaystackID because it can give us a translated native back, which is great for second request and things that, where you sometimes have these requirements.
From a review and production standpoint, that’s the biggest thing, right? You have so much text and sometimes it’s really only 90% sometimes. Sampling is going to always help to gauge your initial transcription quality. It’s really important to just, if you have the time, take what you get back from any of these engines, give it a look and a listen, and make sure it’s close to what you expect. If it’s not, you should be going back to these providers and asking them to work through it some more with different settings.
Fuzzy search is going to be very important when you’re dealing with audio/video, as far as just establishing a defensible workflow insomuch as you’re searching it. It’s never that cost-effective to have a bunch of reviewers listen to audio, but it’s sometimes required if you need to do redactions. All these platforms have very good audio/video to transcription redaction capability as well.
And then for all this stuff, it’s great to always establish opposing parties and just establishing a dialogue regarding overall burden, and I encourage everybody for every alternative data type to really think about burden, because sometimes if you can get to just preservation, you may be able to reduce your burden and not have to go through collection and processing.
So, data in the cloud, I’m going to kick it back to John, and go ahead, buddy.
Yes, so first we’re going to talk about just the data volumes that are out there, as we go out and collect data all the time. The world’s data is predicted to grow from 33 zettabytes, and that’s a number from 2018, to 175 zettabytes by 2025. There isn’t a specific growth trajectory that’s been updated for this year yet by any of the major sources, but by all accounts, 2019’s growth was significantly higher than predicted. So, that 175 zettabytes is actually probably going to be small. 90% of the world’s data has been created in just the last two years alone, and that’s a trend that’s also expected to keep continuing. Data volumes and data growth, everything is being recorded, everything is being measured, and data volumes are just growing exponentially at a rapid pace.
But when you start talking about these non-traditional data sources, there are some really interesting things here. Over 8 million people use voice control, so talking about Alexa and Siri and Cortana. Over 8 million people use voice control every month, so that number comes from June in 2019. I suspect that number is substantially higher today. Every day, 95 million photos and videos are shared on Instagram, almost a billion text messages are sent every hour, and everybody knows about email, but 156 million emails are sent out every minute, and then lastly is the spam. Every second, there’s almost 2 billion spam sent. So, keeping all of that in the context of what we’re challenged with, when you’re talking about preserving all of these different data sources, and how that volume has really grown and what that impact is going to be to us, you have to also realize this stat of ‘every second’ was, up until two years ago, primarily email, but now they’re coming, those spams are all coming in texts. They’re coming in the chat applications and messaging platforms. They are coming in voicemails and phone calls. So, the all of those data sources are growing exponentially.
And obviously we deal a lot with short message communications like text and Slack and Teams where we’re seeing this massive new age growth, but the email is still there. So, when we think of the cloud, where we’re talking about Amazon, AWS, Microsoft Azure, Google’s cloud as well, and then all the applications we all use lives somewhere in those worlds 90% of the time, and that’s what I see as the cloud, so to speak, and we also have private clouds, which could be common, which can be major data center providers who offer you flexible compute and storage, very similar to the same way you purchase that in AWS and in Azure, as well, and one of the big challenges as we get into not only custom applications that live behind the client’s firewall on their server, what we call more application discovery, but when you get into applications that now are in a client’s tenant, which is usually the way we associate an instance with more of a cloud infrastructure for a corporation, that is somewhere in the cloud, or when you have a hosting provider, let’s say a platform like JIRA, who you can sometimes purchase—I’m sorry, you can sometimes purchase their platform, hosted/managed by them through AWS. The question becomes, where does the data actually reside, and this can be important for some clients who really want a clear track on the segregation of data. At HaystackID we have different military clients or military subcontractors, so ITAR, GDPR considerations where segregation goes beyond the directory structure. It’s down to the desk oftentimes, and this is important and can be important as far as we get into more complex data privacy issues that are arising as it relates to data processing, and everybody does have an ethical obligation. Gordon Partners v. Blumenthal, so this is a case where cloud data became very important, and the organization didn’t necessarily search that and provide it, and really the outcome was that you have an ethical obligation to extract, analyze, and search data that may not be actually in your control.
I think courtrooms have become more aware of the cloud and I think everybody’s more aware of just the cloud as it relates to technology and burden. Again, it’s important to have a discussion and be able to work with somebody where you quantify that, and throughout all of this, and somebody asked a question about chain of custody, I guess this is a good segue. Chain of custody with alternative data types, and this is why I’m not a huge fan of prepackaged out-of-the-box tools that go out and collect certain alternative data types, because the platform’s changed so much that you get a tool that works one day, and it doesn’t work the other day, and when it comes…you get called into a deposition or you need to function as an expert witness, not knowing the way that an API behaves – and we’re going to talk about APIs – and how that gives you an outcome as you are collecting data can really get a lot of egg on your face as an expert, and it’s so important to be able to really understand that as a technologist and expert witness. What you’re doing with these platforms, chain of custody is oftentimes point of sale, so to speak, as when you are connecting to the platform, pinning it with an API and really it becomes defensible based on the credentials of the operator performing that collection insomuch as being able to document start date, end date, queries that were in scope to get to that data source. We keep all of our code; we keep screenshots if we have a GUI-based application as we click into an environment. Basically, we will self-certify ourselves and that becomes the chain of custody documentation that backs these up when there isn’t a physical handoff, really, from an individual, so to speak.
And then certainly self-preservation never is proper. I’ve seen it over and over again, and I’ve seen it even when clients direct me when I’m dealing with an executive who says, ‘all of my emails are in this folder’, you let the law firm know, and they say, ‘oh, OK’, and then four months later they get a motion to compel because all the emails weren’t in that folder. Incredibly important. I think there’s a lot of rudimentary tools on the market that claim to preserve data, but they’re not defensible, they don’t meet the data forensic standard.
Now, for some clients, small cases, this might be OK where you know the data that is responsive, that is there, and the parties know it’s there and they need a copy of it, but I wouldn’t go further than that.
In those scenarios, it’s usually something that you’ve already negotiated with the opposing side that, ‘hey, this is what we’re going to give you from the system’.
Yes, and as we get more into the cloud, we really think about – we talk about Azure, AWS, and Google’s cloud, the productivity suite that on top of them, the success story that is Office 365, on a daily basis as far as new features, changes, new capabilities has been amazing for organizations. If you take a look at Microsoft’s stock, they’re doing really well as a result of the pandemic, as everybody shifts to working from home, and these tools are so much more critical as far as just allowing all of us to do our jobs, all of us to collaborate.
If organizations were moving to the cloud before the pandemic, they’re moving even faster, just massive growth in cloud computing, and these approaches… and thing we should say is just because something lives in the cloud – and it goes back to that identification phase – it’s very common for organizations to be operating in what we call more of a hybrid approach, which is being physical infrastructure that is fully linked with the cloud, and sometimes the lines that demarcate that can be very blurred. Cloud infrastructure, oftentimes, will appear as it’s physical on-prem infrastructure. You really need to understand that stuff.
As we deal with all these repositories, there are certain limitations to search and getting data out of them. Office 365 has done a pretty good job making SharePoint files and basic SharePoint data, where it used to be very difficult to get that, you would always need a specialized tool to get the proper metadata, getting that out using the eDiscovery and Compliance Center, emails. We still don’t really like running search terms in any of these platforms, always we will caution against it. The big thing here is you just don’t know the way that Microsoft or Google is indexing files. You don’t have a good sense into the completeness of indexes. If any of you have ever used Windows search or your own Outlook search, you would know it’s not that great, so nothing is ever going to really fully replace dtSearch syntax, although date calling is always fine. Microsoft is getting there, a little bit better with Teams, being able to get data out. It doesn’t have the same look and feel and it doesn’t have great APIs to capture and flatten it the same way as we would maybe Slack.
A lot of considerations there… go ahead, John.
Talking about the Office 365 platform, with the pandemic and the move to remote and everything else, the Microsoft Whiteboard is a great application and example. There is no export mechanism for it. If you put an attachment into the whiteboard, then it goes to your SharePoint, but the rest of that data lives in the Whiteboard application and there’s no export methodology for that currently. I know that it’s being worked on. Making sure you’re addressing those non-traditional data types like the Whiteboard application Post-it Note applications, all those sorts of things. How do you make sense of that data? How do you collect that data, preserve that data?
Same vein too, another one is OneNote, which you can get data out, but really you don’t get it all out, and trying to produce it can be an even more of a nightmare. Oftentimes, you need to sync a cloud notebook down to a local machine and start to try printing things. These are the types of data types we see clients are always turning a blind eye to. They show up as an exception in a multi-terabyte data lake collection. Same thing for all of your mass data types, right, like Keynote slides, none of the processing tools handle these correctly insomuch as when you get a piece of a file or an output that’s many different files. Some of the old data types are still treated as alternative data types in eDiscovery. The industry has done a great job actually handling these still and there aren’t a lot of good methods for getting those types of data types ready for discovery that aren’t manual.
Likewise, with G Suite, definitely not as robust from a compliance standpoint as Microsoft. Although, surprisingly, they have a much larger market share still, although that’s rapidly declining with large enterprises and it’s cheaper. For start-up companies, that could be important, especially, coding firms, things like that. Wherever we see G Suite or wherever we see Atlassian, we always see G Suite, those two platforms talk quite a bit. I put it here, because this is a great example of an industry, like specific suites of collaboration tools, which is oftentimes where software development or engineering is where we see the gamut of Atlassian tools, but even in medical, for pharmaceutical companies or for medical device companies, they all have their own industry specific collaboration tools that handle different things, like eCTD filings for big pharma.
Really, it all comes down to what APIs are available. Again, being able to understand what you can get and what you can’t get and what’s enough, and also knowing when you need to convert from a format, let’s say, something out of G Suite to something that an eDiscovery tool is going to handle properly, and to really understand where the separation of metadata, that often you’re getting raw file content and capture out of the cloud, but your metadata is always separate. How do you link that up, and making sure that’s a definable process?
All of this usually, we hope, is backboned by an API, which is an Application Programming Interface, which really is the way we hit most web-based platforms, if they support an API. Sometimes these aren’t very well documented and you may think there are certain capabilities there, and there aren’t. Oftentimes too, which is common with the cloud, suites, applications, there are various levels of licensure. You have your basic, your pro, your enterprise, and your enterprise grid. Taking Slack as an example, you don’t get access to the eDiscovery API until you’re on Enterprise Grid, which really only the largest enterprises tend to be on that type of licensure. You then have to maybe use a different API that doesn’t give you the same results, or maybe you can get to the same results, but there’s some more jerry-rigging to get a repeatable outcome.
When we think about using APIs for web, we always have to think about how do we authenticate and what level. Admin, usually, we’re always asking for. The two major modes of collections are data plus metadata, so if you think Box or Dropbox, you get your files and you get your metadata. More and more it’s becoming more difficult to figure out who has access to what, which is constantly changing with the way that the organizations have capabilities to partition that. Do I have access to a file that’s shared to me in one platform? Maybe not.
Most APIs have a hierarchical model. We start at the top and we move down. That’s always a great way to think about them and the way to record your types and your output.
I’ll kick it over to John to talk about chat data and the considerations and challenges.
When we start talking about chat data, this is where we get into some of the data sources are very ephemeral, some of them keep data forever, some of them have very defined retention policies, but getting into those, figuring out what those retention policies are. Skype data, which is becoming Teams, is available through the Office 365, but there’s a lot of settings that can affect it. There are the corporate settings that might have set retention policies, where it gets stored, is it getting archived or journaled into another solution within the organization, or is it… then you have your end-user settings, which can make it where messages are being saved locally on the system all the time or making it where messages are not saved locally on the system ever. When you’re going into it, ‘hey, so if I have Skype repositories, how are most of my users configured? How is our server configured? Where is that data going to be stored and is it going to be ephemeral? Are we going to be missing data, or do we have to go to the journal and pull everything from the journal and then try to reconstruct message chains and chats?’ versus some of the to her platforms, and you start getting into Bloomberg, that’s all xml-based, so it’s very structured and there’s very strict retentions and most of the users don’t really have the ability to have much control over that.
But the challenges with Bloomberg, for instance, where everything is in xml, all the attachments are serialized and come in a ZIP, actually a gzip, and so now you’ve got to figure out how to get the chat messages sorted out of the mail messages, because all these ZIPs, all the attachments all come in one bucket, and then you’ve got to then try to map all that information, find the pieces that you’re looking for and bring all of that together.
Domino, Lotus, Sametime, still out there quite a bit. A lot of larger enterprises still have it. Sametime logging can be a real challenge in most enterprises, because Sametime logging is generally not turned in most organizations, because it creates huge impact on storage and resources, bandwidth, and the performance issues within the Domino platform, and so a lot of organizations have it turned off, but then users may be storing them locally but then they get stored out to a general profile location.
Like we were handling a big project where they did actually allow their users to turn on archiving, and then eventually turned on archiving for all of their users. The problem was all of the chats were being stored to a common location, so then identifying who the user was, was a significant challenge and identifying what messages that user may have participated in was a significant challenge, and bringing all of that information back together.
Then we start getting into the newer chat platforms and then getting into discovery of that, and converting things into the relativity short message format, there’s some challenges there. There’s only a very limited amount of information in fields. The short message format has character limitations, so there’s quite a few challenges of pulling chat messages and making them discoverable, reviewable in a platform, whether it’s Relativity or another platform can be quite challenging when you’re pulling from these disparate sources like the various chat platforms.
I know somebody asked the question about what are we doing about ephemeral chats and ephemeral messaging systems, you have the Telegram and those sorts of things that take the messages, send the message and then kind of make sure that the message doesn’t live very long. Sometimes they don’t live at all, they get sent and it’s gone. Some of them, they live for a very short period on the device. But there’s a lot of challenges with that.
So, understanding, is the chat platform in use, ephemeral? Does it have ephemeral options? Things like WhatsApp, WhatsApp is working on what they call one-way encryption. There’s going to be an encryption key that can only be accessed through the original device that sent the message with the sender—
That’s a great segue into some of the issues with WhatsApp that we see too, John.
Yes, go ahead.
We have a pretty robust US investigations practice. Everybody uses WhatsApp. Nobody really uses a standard SMS/MMS comms. With the way the internet is much more heavily commoditized and access to the internet is much more heavily commoditized in Europe, and Android 8 and above, all of the typical mobile phone collection tools are no longer tools are no longer access… no longer be able to decrypt WhatsApp messages, and WhatsApp has become much better at their encryption. I would say they’re just as much as an encryption company these days as a communication platform. Really, this can pose incredible challenges where, really, the only tool now is you would still want to collect the device and have a copy of that encrypted data, but we now have to do a secondary collection over-the-air using an OAuth token with a different tool that then syncs down a copy of the database through the wire from the phone itself, which is kind of crazy. You’re going through the internet – well, you’re going from a forensic workstation through the internet and back to a phone in your possession.
So, important to really spend some time with custodians when they show up for an interview or anything like that to really understand what platforms they’re using, and don’t go that’s the be all and end all, always check what’s there because you could be missing massive swathes of data. This is currently an Android issue. It’s coming to iOS very quickly. I will stop there with WhatsApp.
With a lot of the other ephemeral messaging, I would like to say that we’re getting – at some point, we will be at the end of forensics. There’s such a major push for right to privacy in the United States, other countries do have that, and there’s so much money to be made on privacy. As much as you regulate more, there’s just more money to be made, oftentimes. You get a lot of special interest groups that really compete and push things at both sides of the aisle, but we’re getting to that point where this type of data, it’s just not going to be able to be extracted.
Certainly, I’m sure Big Brother is always watching. We saw that even with WhatsApp, with the [Khashoggi] case where they used WhatsApp to track him and locate him, the Saudis. These types of vulnerabilities do exist and they can come into play. We see it… I dealt with… we had concerns about this in different pieces of witnesses, we were dealing with the Special Counsel and their investigation, who are our clients, and there were concerns there as well, and these things do pop up with high profile individuals. You need to be aware of them.
One last quick comment, Mike, sorry. Talking about the ephemeral messages, we’re actually at a point in the forensic processes now where we’re almost to where we have to do jailbreaks in order to access a lot of these ephemeral messaging systems when they’re critical and leveraging exploits in order to actually access that information and so there’s a very fine line there, making sure, legally, you can do what you need to do and you can do the actions that you’re asking or wanting to do depending on the relevancy and requirements of those data sources.
When the stakes are high, you’re oftentimes getting more into cybersecurity experts who are the best in their field, who are really more hackers to get to very targeted pieces of data on systems and platforms. We have folks like that that we work with here at HaystackID as well. More and more of that work is becoming a requirement. Sometimes knowing the capability there puts you at risk as well, to being able to have to offer it.
Slack, the world is Slack now. I would say 50% of my largest cases these days have some form of Slack, much more common than Teams, although Teams is taking a massive market share away from them. Microsoft has, basically, turned on Teams as if it was the mail application in Windows 95, meaning admins can no longer turn it off. What this does is it causes employees to start using it as a part of the operating system.
Here, looking at what we get out of these platforms. Slack. This is what a Slack compliance export looks like with a ton of JSON files. You always hear this word. JSON files is a common internet transport file for XML style data. This doesn’t work very well. With eDiscovery, it looks nothing like a website. We have tools that are able to work with compliance export, which is a native capability at certain levels of Slack licensure to flatten it to make it look more like this colorful image here with emojis and high fidelity graphics. We usually break up content by a 24-hour period. Though, in some cases, we may be breaking up content by a 24-hour period, but then creating a txt file that expands maybe several days in the event that we want to massage the data to make it more efficacious to application of technology, so it’s through or analytics, and that’s going to be the new thing is how do you get these short message streams into more of an analytics workflow, and I think HaystackID (my group) is leading – I hope that we’re leading that in certain capacities.
This is not easy to work with and all this little text, when you have a five terabyte Slack workspace of just text like this, the computational challenge of analyzing it is incredible. Some of the pros – and we’re starting to get cut on time, but we will try to squeeze everything – the built-in capability, typically, will collect the entire organization. Also, keep in mind, that organizations can have multiple Slack workspaces. I see this get missed so many times where attorneys who are doing interviews or end clients don’t realize that you have many different Slack workspaces that can oftentimes appear as their own unique instance, and unless you have more of a super admin capability, you may not know they’re there. We also see situations where you could have 90% of the workspaces in an enterprise control, and then maybe you find two workspaces that are outside of the regular enterprise that might be on lower level licensure or a beta, a space like that. So, important to understand that Slack for enterprise, you can have multiple workspaces, and then within those, multiple private-public channels, all that you would expect when dealing with Slack.
Sorry, Mike, just a couple of things. When you’re talking about the Slack messages like that, there’s a couple of other factors you really have to consider and that has to do with the average lifespan of an email chain is five or six hours. When you start getting into Slack, the lifecycle of a chain can be days and weeks and months, so understanding that those conversations just continue to flow can be important.
But also, like as you saw on the example that was on-screen a minute ago in the JSON, there was a message that was deleted by the user’s request that sent it, if the receiving person didn’t think that message may still exist in their chain, but it may not exist in the source chain, and so you’ve got to understand that there can be multiple places you have to look for the same messages, you’ve got to look at the recipients, you’ve got to look at the senders and all the people that were involved in that communication chain or part of the group.
Just looking at this JSON file, let’s just go back for a second, a great callout, John. Who is “U345442”? That’s a username. That lives in a completely separate table than what’s in this file, so you have many different files and tables that all link to each other to create something that is reviewable.
Just to kind of breeze through over kind of built-in exports. Pros is that you get everything. Depending on your licensure, you will get all historical edits and deletion not available through an OAuth acquisition method, which we’re going to talk about, which is more at a custodian level and can be very useful for certain cases. Really, all your attachment data that’s posted to a link, usually, and then you can access that and that’s not really encrypted, it’s randomized, so that’s always something very interesting where you just have an organization’s data sitting outside in a randomized link from Slack, that all then needs to be downloaded and then reintegrated and attached back to however you’re parsing and creating reviewable content.
So, we attach attachments every 24 hours, keep in mind, though, you’re not going to get links. More and more we’re developing tools that allow us to – if we see a G Suite link or an Office 365 link, we can scan your data, locate those, download those files and reattach them, I’m sharing my secrets, but that’s it. It’s kind of a new thing we’ve really tied up here as well. We’re seeing this so much more. People don’t actually drag and drop attachments anymore, they just share links, since everything is web-based, and something to be aware of, because some organizations, they’re not attaching anything anymore, they’re all links, so your data collection can be extremely deficient, as a result, to you production.
Export cons, the scope of exports vary by plan. All these different platforms, different licensure levels. It does collect the entire organization indiscriminately, so you’re getting the golden jewels of the organization out to a third party vendor sometimes, and some organizations may not like that. We, sometimes, will operate inside of an organization’s network to cull this data to meet their security requirements. Large instances can have extremely unwieldy export sizes. With methods we call it OAuth collection, you can access our app via the unofficial Slack store and we can, basically, authenticate at a user level. We have a specific link for that user, it gives us access to their public channels that they can see, private channels, and there’s a lot more than public and private channels. Basically, it’s where they all fall into. And also, their private and direct messages. It isn’t going to give you anything that’s been deleted and that can be incredibly problematic and sometimes it makes it incredibly difficult to resolve names of users for incredibly large collections, where users may – their permission levels may have changed or we can’t discern who a user is because we only have one post from them, and we can’t access that API. We can’t, oftentimes, access that using an OAuth method. Much more targeted and precise, so rather than going out and getting a 10-terabyte workspace, it might just be five custodians.
Where this becomes a process is tracking all the custodians and making sure they’re clicking on the right links and actually reading their emails. I try to schedule screen-shares with everybody to go through this as well, and that makes things a little bit easier. For Enterprise Grid Slack clients, we also have an eDiscovery API application, which allows to do full-blown targeted collections and analysis of data points to see who is talking to who. But again, very expensive and most enterprises are not going to have that. This is often a very good method. We were the first organization I know of that actually have a Slack collection tool before Slack even offered the compliance capability.
We’ve talked about Teams. We already kind of went over it, I will skip over this. There is export capability under the compliance capability in Office 365. It’s the main competitor to Slack. Not as much integration with chatbots and things like that yet, but we will see more and more of this. Microsoft keeps changing their platform all the time where we see blends of Teams and also video chat are really the same platform. They’ve gotten rid of Skype or they are getting rid of it, Skype for Business. Now, you’re getting this team capability with really good chat capability and also really good video conference and collaboration capability, which is going after Zoom and all the other tools out there, like GoToMeeting.
We’re going to finish up with Internet of Things, which you don’t encounter this as much in civil litigation, it just isn’t necessarily the case, although I have had cases where we want to see did somebody use a badge to enter in a room via a security system. More and more we’re dealing with video where we might need to identify a person’s face. There’s really god technology out there that’s not necessarily in our industry that we use for stuff like that, it’s more in the security industry now. We’re getting more into Minority Report.
The framework that John and I presented earlier is the most critical element when you deal with IoT, because you’re dealing with pieces of data that is stored everywhere. Really, data is usually not readily accessible. You have devices that need to be analyzed. There’s network forensics that, oftentimes – when you think about network forensics and how does an Alexa communicate with your lightbulbs. My Alexa just turned on; it thinks I’m talking to it. I’ve got to turn that off. Really looking at networking equipment, and then also the application services that IoT might be working with. It’s always a science project and it’s so important to apply a framework that creates that Daubert sense of repeatability, which is always the forensics standard, it’s the Daubert requirement, and it’s based on repeatability, and really having to document that now down to the version of everything you’re dealing with.
John, I don’t know if you have anything to add here as well.
Especially when you start talking about IoT devices, some devices store the data locally, some devices store some of the data locally, some store none of it locally, and then you have all the interstitial places, the network communications, the data storage that may be internal to the company on a server, the data storage may be out on a cloud server, a service that’s connected to that device, and that may be data that’s stored outside your control, so understanding that IoT data – when you start talking about these IoT things, there’s no standard, there’s no, ‘this is how everybody should do it’, they’re all different, they all function differently and that data can be all over the place. It’s so many more devices that people realize in today’s day and age, because you’ve got video cameras, you’ve got the equipment that comes… your biometric systems that connect into the office, the coffee pots, the automation systems that turn the lights on in the office in the morning. There’s just an unbelievable array of things that can be relative or not relative to a matter. But you’ve got to figure out what that data is and where it is and if it’s relevant to a matter, can be quite the challenge, and then figuring out where all of that data is stored and what you can or can’t get access to.
When you start getting into the new Internet of Things, which may communicate, may not necessarily be through a wire. They’re over-the-air. Where we’re about to see a massive explosion, which is going to be really the equivalent of when we went from having DOS computing to Windows GUI-based computing, is with 5G and augmented reality, which I don’t think anybody really is fully aware of how much that’s probably going to change our lives going back to the Google Glass that came out many years ago and got taken of the market. There are companies that have been working on products that are going to transform the way we interact with our environment, some of the biggest private companies in Silicon Valley that nobody knows about that are working on this technology, and all of it really comes down to 5G capability and faster transfer rates and things like that.
We’re getting really close on time. We’ve got less than a minute left. I just want to thank you all for joining today. Reach out to John or I if you have any questions. I think our contact info is here. It’s firstname.lastname@example.org and email@example.com, and I will kick it over to Rob Robinson to close us out.
Real quick, I will try to jump in with the answers to – there were two questions. “How easy is it to link the attachments to the main export in Slack Exports?”
It’s not easy. It is doable. There are JSON IDs and then the ZIP files that have to all be brought back together. You’ve got to run through some parsing and things to make all that happen.
“Can you address how to access ephemeral systems or ephemeral messaging of opposing parties?”
That’s very difficult. Usually, the ephemeral systems are going to require some sort of access to the device or the system and usually require permissions, credentials to make those things happen. They can be very challenging. Not to say they can’t be done. There are different things, again—
You’re going to be bringing in a third-party expert to handle that. You’re going to be bringing in an independent third party expert, that’s the way you access those.
Thank you all. Thanks, John. Go ahead, Rob.
Thank you so much Mike and John. That’s excellent information and insight and we just want to thank everybody today for attending, especially The Masters Conference for their support. We know how valuable everybody’s time is and appreciate you sharing it with us. We also hope that next month, you will have an opportunity to attend our monthly webcast, it’s scheduled for 19 August and it will be on the topic of eDiscovery Case Law, and we will have some experts, technologists, and authorities share a little bit about significant eDiscovery cases in the first half of 2020, and any key case law developments, so we hope you can join us.
Again, thank you everybody for attending and we hope you have a great rest of the day and this concludes today’s webcast. Thank you.