[Webcast Transcript] Getting Things Done with GAI

Editor’s Note: During this webcast, industry leaders discussed using generative AI (GAI) in the legal and cybersecurity fields. The speakers, John Brewer, Anya Korolyov, Chris Wall, and Bernie Gabin, all experts in AI and data science, discussed the safe and responsible use of GAI, the implications of information governance, and the potential privacy issues.

During the presentation, the speakers emphasized the importance of understanding how data is used, especially in AI services and ensuring that sensitive or personal information is not inadvertently exposed. They also discussed the legal implications of using GAI, including copyright and liability issues. The speakers stressed that while GAI is a powerful tool, users should not rely upon GAI entirely and should always validate the information it provides.

Access the presentation’s on-demand version to learn how GAI is transforming products and services for legal professionals.


Expert Panelists

+ John Brewer
Chief Artificial Intelligence Officer and Chief Data Scientist, HaystackID

+ Bernie Gabin, Ph.D.
Senior Data Scientist, HaystackID

+ Anya Korolyov
Vice President, Cyber Incident Response and Advanced Technologies Group, HaystackID

+ Christopher Wall
DPO and Special Counsel for Global Privacy and Forensics, HaystackID


Transcript

Moderator 
Hello everyone and welcome to today’s webinar. We have a great session lined up for you today. Before we get started, there are just a few general housekeeping points to cover. First and foremost, please use the online question tool to post any questions that you have and we will share them with our speakers. Second, if you experience any technical difficulties today, please use the same question tool and a member of our admin team will be on hand to support you. And finally, just to note, this session is being recorded and we’ll be sharing a copy of the recording with you via email in the coming days. So without further ado, I’d like to hand it over to our speakers to get us started.

John Brewer

Thank you very much. Hello and welcome to another HaystackID webcast. I hope you’ve been having a fantastic week. My name is John Brewer and I’ll be your expert moderator and lead for today’s presentation and discussion titled “Getting Things Done with Generative AI.” This webcast is part of HaystackID’s ongoing educational series designed to help you stay ahead of the curve in achieving your cybersecurity information governance and eDiscovery objectives. Today’s webcast is being recorded for future on-demand viewing. After today’s live presentations, we will make the recording and a complete presentation transcript available on the HaystackID website.

Our presenting experts, Dr. Bernard Gabin, Anya Korolyov, Chris Wall, and I have combined decades of experience exploring, understanding, and applying AI in data discovery and privacy compliance. Additionally, we’re all actively involved in current industry and company GAI initiatives ranging from establishing industry standards for usage to developing applications and leveraging GAI.

In our discussion today, we’ll be exploring how generative AI is transforming products and services for legal professionals. But first, a little background on our experts.

So beginning with me, I’m John Brewer. I’m the Chief Artificial Intelligence Officer at HaystackID. I’ve been with HaystackID in one capacity or another since 2015. My background is in computer science, specifically data and artificial intelligence. I’ve been working in that field since the late 1990s and I’ve been heading out the AI program specifically at HaystackID since August of 2023 and the data science group since March of 2021. So moving on. Anya Korolyov, do you want to introduce yourself?

Anya Korolyov

Thank you, John. Hi everybody. I’m Anya Korolyov. I’ve been with HaystackID for about seven years and currently, I’m the Vice President of Cyber Incident Response at HaystackID. I’ve been mostly working in the eDiscovery in the antitrust area up until moving to the cyber incident response. I’m an attorney. I have some practice under my belt and also I’m a Relativity Master with a technical background, which gives me quite a unique perspective on some of these topics.

John Brewer

Thank you very much, Anya. Chris, could you introduce yourself?

Chris Wall

Yeah, thanks, John. Hey, my name is Chris Wall and I’m HaystackID’s Data Protection Officer and In-House Counsel. I chair our Privacy Advisory practice. And my job at HaystackID is to help our clients and HaystackID to navigate its way through the privacy and data protection thicket. Whether that’s part of cyber investigations or information governance exercises or traditional discovery or anything that involves data, frankly. My job is to make sure that how we use data, how we store data, how we move data, or even how we get rid of data, it is done in a compliant way and in a defensible way.

John Brewer

Okay, thank you very much, Chris. And Bernie, why don’t you introduce yourself to our attendees?

Bernie Gabin

Hi, I’m Bernie Gabin. I’m probably the most junior person at HaystackID here. I’ve been with the company coming up on two years. My background actually originated in physics, but I’ve gone from that into brain-computer interfaces and then into AI development and had a somewhat interesting path getting here. I was brought into the company to help with our data science department and to help with our AI initiatives and moving forward as we develop new tools there.

John Brewer

Fantastic. Thank you very much, Bernie. So the first thing that I think we want to cover today is the safe uses of public generative AI. Now this is the part of generative AI that most of you would be familiar with from ChatGPT and other tools like that. Anything that’s freely available or available for a small subscription fee online. These are the technologies that tend to be making headlines in various news outlets whenever you hear somebody getting action taken on them for having hallucinated case studies and things like that. What we want to discuss now is what are some actual safe uses that you can do with confidence.

Chris Wall

Practicing safe AI, John?

John Brewer

I’m sorry, Chris.

Chris Wall

How to practice safe AI, maybe that’s what we should call it for our discussion.

John Brewer

How to practice safe AI. There you go. So I’ll start because I actually use generative AI in my day-to-day work. Unsurprisingly, I write a lot of computer code in my job and one of the tools I use is something called GitHub Copilot, not to be confused with Microsoft Copilot, which just gives me suggestions as I’m writing computer code to say, okay, I think that you’re about to write this and gives me the opportunity to hit space or to accept its recommendation or reject it. It’s that kind of very tight human-in-the-loop generative AI approach that we see. And sometimes you’ll see this being used in other applications, especially grammar or social media posts anytime that the window is running ahead of you and asking you whether or not you want to accept its changes. And that’s one of the safest applications that I’ve seen of generative AI so far because it is very tightly looped with the user being able to see a small comprehensible piece of text and the system asking, do you want to accept this or not? And that is probably the most, it is the part of generative AI in this kind of public domain that I’ve had the most experience with and I think is a very safe way to experience and use generative AI. Does anybody else have an example they want to give of how they’ve used this in a safe and responsible way?

Chris Wall

Sure. Hey, look, I tend to look at the world through a privacy regulatory and compliance lens. So when I look at some of the potential safe uses for GAI, I think GAI has a lot of promise in the privacy context. And that may seem a little counterintuitive I guess because GAI and AI in general I guess is often billed as representing an Orwellian of privacy apocalypse. And in the public sphere anyway, there’s a very healthy, rightfully so I think, a very healthy tension between GAI and data protection. But it doesn’t have to be that way and rather than an apocalypse, I think GAI can in a lot of ways be our salvation.

For example, when we look at the greatest risk to personal privacy, to our personal information, any place really in the real world, that risk tends to come during the transfer or movement or sharing of personal information where access is given to one entity or moved from one location to another. And at each stop along the way, maybe the data is used in a different way and a lot of times the personal information contained in that collection of data is less important than the aggregation of that data overall. And that happens a lot.

I think about marketing metrics or anything like that, GAI can be a fantastic tool to help anonymize or de-identify sensitive personal data. Your or my data and that de-identification piece ensures that no one who’s looking from a macro or from a micro perspective can pick out any individual or that individual’s personal information from that data and do anything bad with it. So from a research standpoint, that means organizations can share data sets for analytics and useful metrics without any risk. Well, a lot less risk to an individual’s privacy. But that’s not all because when we look at sharing personal information for research, we just have to look at some emerging GAI techniques and such as secure multi-party computation, SMPC, which takes that anonymization function to another level.

Look, SMPC is a cryptographic technique that’s used to enable multiple parties to jointly analyze, and maybe we’re going to talk about this a little bit, Bernie, I hope, to jointly analyze all of that data and it might contain personal information from multiple private individuals. They take data points like name, address, gender, political affiliation while keeping all of the private information secret or private. So it presents a huge amount of opportunity for us to, contrary to our popular belief here as opposed to destroying personal privacy, it can really help us augment it.

John Brewer

No, I think that’s a great insight and I know that we’re going to be discussing a little bit more about the public-private divide in AI in a few minutes. Bernie, do you have anything you wanted to call out, for example, of good uses of public-gendered AI?

Bernie Gabin

Well, similar to what you were saying with the Copilot stuff, there are common uses. I’m sure you guys have seen ads for things like Grammarly, which are supposed to help you complete emails or write text. Again, it’s a very human-in-the-loop close-circuit type tool, basically Copilot except instead of code you’re writing emails. There are also some tools that have been seen around, this goes to the RAG stuff I’ll be talking about later, which helps summarize emails or summarize websites. Those tools can be very helpful for getting the gist of something quickly, but you have to be careful with them because you need to verify and validate. When you have a human-in-the-loop like that, what Chris was saying, you need to be sure that you’re checking and making sure that, don’t take anything the AI gives you just on face value. It may make a great suggestion, it may make a really dumb one, so you have to keep your eyes open.

John Brewer

Yeah, absolutely. I think that that makes a lot of sense and reiterates the point that we’re making there. So let’s move on to the next part that we were discussing here because I know that Chris had a great lead in there, specifically that we want to talk about how to access public and private generative AI and what the information governance implications there are.

Chris Wall

Yeah, John, if you get me started talking about privacy, you may not turn me off here.

John Brewer

You have about seven minutes.

Chris Wall

I’ll take that long, I promise. So I mentioned before there is a very real tension, at least in the public perception between GAI and our rights to determine how our personal information is used. And that’s especially true in the legal industry. And I’m a lawyer and for me, I see GAI simply as the next innovation in the legal space. And I’m old enough to remember the big move from putting tape recordings into the dictation pool, moving to speech-to-text software, and I remember the move from WordPerfect 5.1 to Word not to mention the innovations to discovery, like moving from the review of literal bankers boxes full of paper photocopies to doing that review in about a third of the time using technology-assisted review and structured data analytics. So GAI I think is just another tool, but using GAI in the legal workplace or any workplace for that matter, especially in these early phases of its development can be really, really hard.

And I think we just have to look at the recent realization earlier this month for example, that your privacy may not be all that when we’re using some of the AI tools out there. And I mentioned specifically Microsoft Azure and the publicity around that is a good illustration because it’s so public. So Azure of course is Microsoft’s cloud for their platform, their cloud platform, and Microsoft is heavily invested in OpenAI upon which ChatGPT is based and Microsoft has incorporated OpenAI into Azure and maybe you’ve heard their marketing tagline, “Build your own copilot and generative AI applications,” which is what a lot of people do when they use Azure. As a matter of fact, almost 56% of organizations worldwide use Azure for their cloud services, and that includes about 85% of Fortune 500 companies. So that’s pretty wide acceptance out there. You can probably understand the concern with the realization that under standard Microsoft Azure terms, Microsoft may have the right to review your Azure OpenAI prompts or those AI responses to those prompts. So obviously that presents privacy and confidentiality issues.

What makes it a little bit more confusing is that if you scour Microsoft Azure, well their marketing materials and their website, it makes it clear that in their documentation the info that you input, your GAI prompts, and the responses that you get from the GAI are not available outside of your organization and they are not used to train OpenAI models. I mean, that’s a relief, right? In other words, your Azure environment is a walled garden of sorts. Your roses, your orchids, your specialized breeds of flowers within that garden are yours and yours alone, except when they’re not. Microsoft has an abuse monitoring loophole in their user agreement in case they think that you’re growing maybe a special kind of weed in your garden, for instance. So under that loophole, Microsoft will keep your prompts and your responses for a month, and if they think there’s potential abuse, Microsoft’s human reviewers can access those prompts and those responses.

So of course big Microsoft clients, those big customers are able to opt out of that oversight loophole. But if you’re not a big player, then you really, really, really need to be mindful of what you input as a prompt because not everybody can opt-out. Personal information, client information, health information, grades, financial information, and names, are all potentially reviewable. So interestingly, Microsoft Copilot, which I think we’re going to talk about here for Office 365 is automatically opted out, and Office 365 data isn’t collected or stored by OpenAI. So Microsoft itself has opted out.

So John, sorry for that long lead-up in the long cautionary tale, but I’d say there are three takeaways from that little Microsoft revelation. First, you need to know how the entity that is hosting or has access to your data is using your data, especially if they’re providing AI services. So interestingly, just on Monday of this week, I received an inquiry from one of our education clients at HaystackID and they asked us to confirm that HaystackID’s AI does not use their data to train or to inform our AI models, and that access to their data is restricted to just their case team.

And it was refreshing for me, look, I’m the privacy guy, to see that kind of awareness late as it may be, and maybe it was prompted by this Microsoft issue, but it was even more refreshing for me to be able to simply point to our data processing agreement in the explicit language that we have in there. And that’s the language that does not include any Azure-esque loopholes. So we only process or apply or access your data, our client’s data, including prompts and responses at our client’s direction. No one else has access and you should expect the same of any provider, frankly, whether they’re applying AI or not, but especially if they are.

So that leads me to point two. And the second point is if your GAI vendor has not articulated in your agreement to make sure that it’s in there. You want to make sure that any prompts that contain sensitive or personal information don’t inadvertently expose confidential data or personal data to third parties or compromise the security of individuals or your organization. So you have to be mindful and if you have the opportunity to opt-out, unless you have a good reason not to, you should.

So then that’s finally the third point here. If you cannot opt-out, maybe because you’re too small to be able to opt-out from Azure’s GAI, for instance, be mindful of what you include in your prompts and consider that your walled garden, to go back to that metaphor, your walled garden is fenced in by chain link. So keep in mind everything that you input when you’re using GAI, I think it’s a good rule of thumb anyway, but especially if you’re not aware or you’re not able to opt out of your vendor’s use of that data elsewhere.

John Brewer

Yeah, absolutely.

Chris Wall

Within my seven minutes there, John? I hope I stayed within my seven.

John Brewer

Just about dead on actually. So yeah, one thing that I will throw out there is that there a lot of agreements that we’re seeing out there with the big models, the Microsofts, the Googles, the Amazons of the world do say explicitly in them that they will not use your input to train their models. However, the issue that cropped up with Microsoft last week in the news demonstrates that that’s not really good enough, that even if they’re not using it to train their models, any human review, at least in a legal context potentially is a major concern.

Chris Wall

Violates that confidentiality that you expect, especially in the legal arena, but elsewhere in the workplace as well. You should have an expectation of privacy and confidentiality for the data that you are paying Microsoft to host or process for you or Google, whoever you’re using there.

John Brewer

And I will say, and I think Bernie can back me up on this, Chris is absolutely right. HaystackID has never used anybody’s data for training or other AI purposes without getting explicit and very unsubtle clearance to do so. And if any organization pressures you and says, oh, well we can’t do this without being able to train on your data, there are ways. It isn’t particularly easy, but you can work around it and you should absolutely expect anybody who’s taking this technology-

Chris Wall

Absolutely. If there are some follow-up questions about why not and if not, maybe look elsewhere for your provider.

John Brewer

Yeah, absolutely.

Bernie Gabin

I’ll also stress at this point that especially if you’re using a publicly accessible system or playing around with it, remember if the product is free, you are the product. A lot of these places have free front ends because you’re beta testing it for them. They’re harvesting your input and they’re harvesting your outputs because that’s how they’re training, that’s how they’re growing. The biggest LLM models are all from various companies that have been scraping the internet for people’s content for years, and if it looks like it’s free, it’s not, you’re paying for it with your data. So always be careful about that and always think about it before you type anything into that prompt.

John Brewer

I think that’s a great point. I think we should expand on that because Bernie, one of the questions that we’ve been getting a lot is for an explanation of how to use models, especially publicly accessible systems that people actually have access to, to do effective things, to avoid hallucinations, and I was hoping that you could talk a little bit about retrieval augmented generation, RAG, which is the buzzword that people have been hearing, but don’t necessarily have a clear understanding of either how it’s used in an automated fashion and how you can use it in your day-to-day operations.

Bernie Gabin

So yeah, RAG is retrieval augmented generation. Sometimes I’ve seen it as retrieval-assisted generation, but I believe the industry is standardizing to augmented instead of assisted, and it is probably one of the most common under-the-hood techniques that a lot of these things implement. The standard basic way that you can think of most elements working is the top three bubbles there. You type your prompt into something like ChatGPT, you send it off, it goes and processes it in the LLM, and then returns a response. RAG adds the steps that are below there. Specifically, you’ll type in a prompt. It will send that off to whatever RAG system is working, which will query some sort of data store, which will add context to your prompt before sending it to the LLM.

A very simple example of this, I’m sure you guys have seen ads for things that, we’ll summarize websites for you, summarize 10 websites or something like that. What they’re really doing under the hood is you’ll ask some questions, tell me about, I don’t know what happened in the news today, and you’ll have 10 tabs open. It will scan those 10 tabs, add them to the prompt, and then send it off for the result to be sent back.

Now if you go to the next slide, actually. We’ll summarize it there. You can think of LLMs as having a layered memory system. They have long-term memory. before I continue, I want to stress something. AIs are not people, AIs are not biological, AIs do not think like us. However, it is useful to use human thought processes as a mnemonic and as a way of thinking about it because it can put it in context and that’s easier for people to get. So in terms of long-term memory for AIs, when these LLMs are trained, they have a huge amount of data that’s baked into them and this information can be retrieved directly. That’s that standard to just throw the prompt in and it will return an answer. You can think about this as the stuff you learned back in school. I’m pretty sure if I asked anybody on this call, what is pi, they would be able to give me an answer. If I ask you, where did you learn that? They probably wouldn’t be able to tell me because it’s something you picked up at some point. The LLM model can have a lot of information that’s baked into it but is not necessarily associated with a specific source or with specific origins.

Chris Wall

Can I ask along one of the questions that somebody’s put in the chat there is if we don’t have citations and we don’t know where certain of the training materials for our LLM come from, how can we trust their output? And maybe you can cover that as you go along here.

Bernie Gabin

Well, yeah, that’s part of what RAG directly addresses. Let me get to that in one second. I’m building up to it. The short-term memory is the context window. It’s the thing you are currently talking with the LLM about. One of the things that people find freaky about them is that they can have conversations with you where they will reference things you said before or be able to carry on a thread for multiple questions and answers back and forth. That’s the short-term memory. That’s the current conversation you’re having and all of that is being fed into the context window that the LLM has. Basically it’s feeding itself the conversation and remembering the conversation as it goes. RAG tries to add a medium layer to this, a medium memory where it is information that you are adding in. It’s the stuff you read in the paper this morning or the stuff you saw on TV yesterday that you can add to the conversation so that you have context for the things we’re talking about, but it’s not stuff that we mentioned immediately. And the way it does this is by augmenting the prompt. So actually, can we go back one slide real quick?

John Brewer

Back, yeah.

Bernie Gabin

Yeah. Just back to my little graphic. Again, using this example to beleaguer the point, you put the prompt in and it can go search or something. So in the case of pi, and this is going back to, I don’t know the citation for my information. If I put in what is pi, it could go off and search a data store. Now, the data store could be something like the internet. You could just Google it. It could be a curated data set, or it could be a private data set, which obviously is very applicable to a lot of stuff we do, get responses to that, add them to the prompt, add the context for the question I’m asking, allow the LLM to process it and then present a response. It also allows the LLM to cite where in that data store it got the information by directly going to, oh, okay, according to this source, pi was invented by Pythagoras or whatever. That being said, if your data store does not have the right answer, it cannot provide you with the right answer, an important point. I saw this question come up earlier, use your head. AI is not a replacement for human intelligence. You have to think about the answers you’re being given. You have to think about the stuff that it is providing you. If it smells like a fish and it looks like a fish, it’s probably a fish, and make sure you’re checking what’s going on there.

Chris Wall

Yeah, so Bernie, to your point, if I get a response about who invented pi and I get this response, maybe it’s Pythagoras and the history behind the development of pi, and I’m actually looking for the best blueberry pie or cherry pie recipe out there, I’m going to have to use my head to recognize that, hey, that’s probably not the correct answer.

Bernie Gabin

I’ve heard that Pythagoras was really into rhubarb, but that’s not really what you’re going for here. So yeah, you want to make sure that you’re thinking through this stuff. If something comes back and you’re like, I think somebody said something about copyrighted material, again, be careful about that. It is very possible, we know that some of these LLMs have copyrighted material baked into them, and that goes to that long-term memory that’s baked in and it is reproducing this stuff without even knowing that this is coming from sources. This also goes to your data store. So from a privacy perspective as well as from an ethics perspective, the stuff that you are providing for the data store, you have to make sure is okay to be feeding into the LLM. If you don’t want copyrighted material, do not put copyrighted material into your data store because that is something that will be able to be reproduced.

If you do not want this information being shared with whoever is hosting said LLM, be it Azure or Google or whoever, do not put that into your data store. Now, generally for us in the public who are using these, we don’t have access under the hood. We’re hiring Grammarly or one of these other companies to do this for us. Read the fine print, and take a look at their privacy documentation. You can get information from them about what they’re doing. If you’re using this in any company or corporate context, make sure it lines up with what your company is comfortable doing. And if you’re providing your own information, make sure you look at the privacy statements about how this is going to be added. So can we go to the next slide?

We touched on a few of these things already, but there are four big issues with LLMs that come up a lot and RAG helps address all of it, why it’s proliferated to the point it has.

Domain knowledge. This is if you’re asking very specific questions about niche topics or if you’re trying to query a specific data set. Can you please summarize all my emails from this morning? Can you tell me which email had the thing? It’ll have no idea. That’s not baked into the long-term memory of the LLM. It’s not in the model. It has no way of referencing that. Chris mentioned citations. It’s very important that we be able to cite where our information came from. Can’t necessarily do that if you’re just going straight to the base LLM.

There’s also timeliness. It takes time to train these models. I think GPT-3 is up to 2022. Anything that happened after that, it will not be able to reference. And then, of course, hallucinations, which is right, and we’ve all heard horror stories about it. The biggest issue with hallucinations is that it’s very difficult to get an LLM or an AI system to just say, I don’t know. They are designed to replicate patterns and they will come up with something that sounds very convincing to your answer. For example, Pythagoras liking rhubarb. The way that RAG addresses all of these things is you’re literally adding domain knowledge. By selecting or by providing the data store, you can add the information on the topic that you want to query about.

John Brewer

This is pretty abstract. Do we have an example that we could give to [get into the specifics]?

Bernie Gabin

If you go to the next slide? I did provide an example. The example that I did, I was just using ChatGPT because readily accessible, anybody can try this. I wanted to know about the James Webb Space Telescope, and this goes specifically to the issue of timeliness of the information where if it’s been traded up to 2022. JWST has done a lot of work in the interim years that it cannot reproduce. So when I ask it, can you tell me about what JWST has done? Anything after its training date that was finished will not be in that. I then conducted a manual RAG by going and finding some articles about JWST. In this case, I found five short articles about various discoveries it had made. I threw them in, as you can see on my little model on the side there. Here are some relevant articles. Put the text of the articles in and then say, can you please now tell me about JWST? I know the font is small here, but if you read the new summary, it’s much more up-to-date, it’s much more on-topic and it allows me to directly access that stuff. I could then for this example, say, could you cite that? Where did you find out about this specific thing? And it could go back and reference the specific article that I provided that had that information.

This is the basic principle of RAG. Now, again, it’s usually automated and there’s usually something that they’re providing, some special sauce about how they’re accessing the data, how the search is being conducted, how the prompt is being engineered to access all that information. But in a crackerjack model, this is the simple way that a lot of these things are working under the hood.

John Brewer

So essentially instead of you going off and grabbing some articles and saying, “Hey, please use these as your knowledge base to tell me about this subject.” You’d have some application sitting underneath that goes off, queries the database, or something?

Bernie Gabin

Exactly, yeah. And if it’s a publicly open thing, it could literally just go to the internet and pull up information. If it’s a private or closed system, it could access a specific database of my own information or my company’s information. And yeah, that’s basically doing it. An example that you see with this is any help chats. If you’re on a website and it’s like, I’ve got a little chatbot that will help you navigate our help text, it’s referencing their corpus of help documents as it goes through and tries to give you pointers on how to fix whatever your problem is.

An obvious concern on this that should be pointed out is there are non-technical, non-legal, but ethical concerns about just searching the internet and having an AI regurgitate stuff to you. It’s a bit outside the scope of what we’re talking about right here, but that is something to consider with certain of these products if it is just taking… In this case, I grabbed a bunch of articles that were freely available online. However, there are concerns about that. So depending on how you want to use this or how you want to use the output, you have to be aware of it and again, think about what you’re doing. Don’t just blindly take the output and run with it. Make sure that you’re validating. Make sure that it follows whatever guidelines you or your company need to use to make sure you can use that output.

John Brewer

So if you were doing this by hand with, for instance, ChatGPT, perhaps don’t copy and paste a confidential company document into ChatGPT, ask it to summarize it.

Bernie Gabin

Yes.

Chris Wall

Nobody would ever do that.

John Brewer

I’m actually curious if we have any thoughts on privacy and generative AI from Anya and/or Chris. Anya, we haven’t heard from you very much yet. Why don’t you give us your thoughts on privacy in the generative AI sphere?

Anya Korolyov

Well, it’s a two-way street. On the one hand, the more information you have and more places, coming from somebody who deals with incident responses every day, the higher the risk for that information to get out there. So as exciting as it is to use it and to build something that you can reuse for different purposes, again, you’re creating more footprints with that information. And from the other perspective to what you were just saying, John, is you have to be very careful what you put in other than just copyright law or whatever it is that you were inputting into the model. Have to make sure that A, you actually do have the right to put that information into the model, that your customers, your third parties, whoever you are dealing with, have given consent for that information to be used.

Chris Wall

So I want to come back…Go ahead, Anya.

Anya Korolyov

No, go ahead Chris. Go ahead.

Chris Wall

I was just going to piggyback on what you said about what the model’s comprised of, because there’s what the model is made up of, how it’s trained, and then there’s how it’s used because many on here are probably familiar with the EU AI Act, for instance, recently took effect.

John Brewer

You want to tell us for people in the audience, just very quickly, Chris.

Chris Wall

Yeah, so the EU AI Act, it’s law now in the EU about how AI can be used. That’s effectively what it is. It’s fully anticipated to have what’s called the Brussels effect, where there will be follow-on legislation around the world. Maybe we’ll get it here in the US at some point, that regulates how AI can be used. And look, the AI Act, well before the AI Act, even under the GDPR, it could not be used… Well the GDPR, both the EU and the UK GDPRs, you couldn’t make decisions that are based solely on automated processing like on AI, including profiling if that decision has legal effects on a person or if the decision significantly affects, and that’s a very, very broad and vague term, affects those individuals. But there are some very practical ways that AI can help us use AI and comply with laws like the AI Act more wisely and more ethically, frankly. It can help us develop a framework. It can help us to identify bias in those LLMs. It’s asking AI to police itself in some ways, but that’s using the tools in a smart way.

Biometrics is a big deal these days, and we can use GAI just to secure authentication of fingerprints, retinal scans, palm prints, gait, and body movements in general through federated learning or secure enclaves so that we distribute those individual characteristics and nobody has access to them. That makes that authentication far more secure, both in its use and in its development. Sorry, I cut you off there, Anya. I didn’t mean to cut you off.

Anya Korolyov

No, no, you summarized what I was going to say and you actually segued me into the next point of discussion is just outside of privacy, what are the courts and the laws doing with this?

John Brewer

Did you want to talk a little bit about recent rulings or decisions?

Anya Korolyov

Yeah, and there aren’t that many recent rulings and decisions. Again, it’s very much a developing area. Chris just mentioned there are not that many Sag laws. It’s mostly regulations and guidance and industry standards at the moment that we’re dealing with. And there is also just like the public has, there’s tension and a perception of this AI, so do the courts have the same thing. It’s something new, it’s something they haven’t seen. It’s an unknown and they have to fit it into existing laws. So the areas where they can fit it in the laws that they can fit it again is regulatory. But in the U.S., we don’t have as much. Hopefully, we will soon. We have ethics that Bernie alluded to a little bit earlier, and are we creating bias? Is it transparent what the process is and the decisions that are being made in these models and is the prompt creating bias in some way?

The other, of course, big area is privacy which we talked about quite a bit. And with that comes liability too. Who is responsible for any harm that might come out of the use of AI, whether it’s to an individual or group of individuals? It’s a gray area at the moment. The most obvious one, and I think Bernie also touched on it a little bit, is intellectual property. Who owns the rights to the output? Does the user even know what was put into it and what the copyrights were? We’re to go back to the use of our pie recipe, if I asked for a blueberry pie and a rhubarb pie and then I said, well then combine them because I want to be the first person that creates an original pie, if my prompt gives me the copyright because I provided the correct prompt that gave me something unique. Or am I using something that somebody already owns the rights and I’m just mish-mashing it and I have no rights to it?

There are some court decisions reason, and it’s mostly to the negative side of the use of AI. The harm that it’s causing or the privacy that it’s enbreaching on, it’s not being looked upon very favorably. But I do think that that’s where the very beginning of the developing stages of this as far as the public perception and the court perception, and I do think those laws will change, and I do think that laws will come around that will regulate this a little bit more.

Chris Wall

So Anya, interestingly, the U.S. Copyright Office has said that where AI assisted with the creation, the work can be copyrighted, but works that were wholly created by AI cannot be copyrighted. I think I got that right. So that’s not a lot of guidance. Where do you draw that line?

Anya Korolyov:

Yeah, it’s…

Chris Wall

Absolutely. Intellectual property space is ripe for jurisprudence to develop in a lot of different ways.

Anya Korolyov

In that instance, it’s a very exciting time to be in the jurisprudence and close to creating the regulations and the laws. But it’s also a scary place to be.

Chris Wall

Can I harken back to something that Bernie pointed out before? And look, if there’s one takeaway from this webinar that I hope everybody gets is that you can’t 100% rely on generative AI, because a lot of you may have heard about that case in New York, Avianca. That’s a 2022 case where this guy sued the airline Avianca alleging that a cart struck his knee during some international flight. And during the motion practice, Mr. Mata’s lawyers, while they set off this cautionary chain of events because they couldn’t find… After they submitted their pleading, Avianca’s attorneys went and tried to look at all the citations in his pleading and they couldn’t find them, they just weren’t there. So they went to the court and the court looked for those cited cases and the court couldn’t find them. And so they were probably so hard to find because Mr. Mata’s attorneys used ChatGPT to draft the pleading and ChatGP made them up. So the lawyers and their firm were sanctioned, obviously, and lest we think that that case and the Southern District of New York cases is an outlier. You’ve got Park v Kim out of the Second Circuit, and there the plaintiff’s attorney used ChatGPT again to craft an appeal and submitted at least one, if not more non-existent cases to support her argument. So the attorney was referred to the grievance panel and sanctioned obviously.

And then there’s another case out of Colorado where a lawyer used ChatGPT to find case slots to support his position, but it was again, pure fiction from ChatGPT, and that lawyer again, that lawyer was suspended. So look, the opportunity is to use it and I think it solved that blank page problem that we might have as lawyers. Come on. I don’t know where to start with this pleading or I don’t know where to start with this document or this piece of work, but you can’t rely on it entirely. And in the courts, if I can just take one more minute here, in the courts, unsurprisingly, there was a knee-jerk reaction from the bench with some courts moving to bar the use of generative AI altogether in the court.

We have a judge in the northern district of Texas who requires counsel to file a certificate that attests that no portion of any filing that they’ve put before the court has used AI, whether it’s ChatGPT or Harvey or Bard to generate any part of their pleading. There’s a judge in Illinois, Northern District of Illinois, who requires you to declare or disclose what AI you used and what portions of your pleading were used. I just don’t know how that’s going to be a way going forward. I mentioned before, I remember when we used tapes for dictation to generate our documents. We’re in the nascent stages of GAI today, so I don’t know how courts are going to enforce it going forward any more than we could bar the use or enforce the use of word processing programs or speech-to-text software or even the use of first-year associates to draft your documents. I don’t know. I don’t know how courts can govern that. It’s simply another tool that we have to carefully monitor how we use it.

Anya Korolyov

In the legal community, the defensibility concept is not going away just because we have a new tool to play with and to help us. It’s still everything we do has to be defensible and has to be within those reasons, not just reliance on some magical program that will answer all of our world problems.

John Brewer

Unquestionably. And I think it’s important to remember that AI of any flavor is a tool. And ultimately the person using that tool is responsible for the output of that tool, whether it’s a hammer or whether it’s a large LLM. But yeah, I think that Chris, you actually bring up a great point that one of the things that is going to have to happen in the legal space especially is that we’re going to need more vocabulary than we currently have to discuss the differences between creating documents wholesale using generative AI and having it just predicting the next word or correcting your grammar. Because underneath the hood, those are both generative AI applications that we’re seeing in production today and now that Microsoft Copilot is getting out there and is in Office in a large number of organizations, it’s actually going to become very difficult to say, I think in the next months and years conclusively, I can certify that at no point did generative AI come in contact with this document.

Bernie Gabin

Becoming a lot like spell check. It’s just going to be integrated into what you’re doing and you won’t even necessarily think about it.

Chris Wall

You’re right, Bernie. I think that’s what Microsoft envisions by incorporating it into Azure to make it as commonplace and used on an everyday basis as possible. Just like all of us think about spell check and predictive text when we use our mobile devices, that this second nature for us to use those tools. That’s where we’re going with this. Go ahead, John.

John Brewer

For our last few minutes, I just want to go field the questions that are still in the queue here because I think some of us are going to have some discussion associated with them. Let’s see. We have a question that is, “Can you use the ChatGPT team capability for sensitive data since it does not train the LLM?” Bernie, you have a strong opinion on that. Clarify.

Bernie Gabin

Unless you have a specific contract and an internal deal with whichever company you are providing it, be it GPT or whatever, do not be sticking… Because it’s not part of the training set, you’re not training the long-term memory, but that data is still getting sent to them, it’s still you don’t know who’s reviewing it, you don’t know who has access to it. Do not put it in, and it may not be baked into this current LLM, but it may be baked into Version 2. So do not put sensitive data in unless you have a specific contract or knowledge of how that data is going to be used and where that data is going.

Chris Wall

How do you think Bernie, you got that in writing?

John Brewer

RAG can do [that]. The follow-up question is, should you?

Bernie Gabin

Should you?

John Brewer

If you could stick just the technical capabilities there. Can you?

Chris Wall

Keep in mind that one definition probably is the app definition for the cloud, which is where all this AI sits, well, most of the AI sits, is somebody else’s computer. How wise is it for you to take confidential or sensitive information and put it onto somebody else’s computer? Now I recognize we use the cloud for a lot of sensitive information. I’m not saying that, but I think you have to be extra sensitive when you’re literally putting sensitive information, confidential information into the cloud, unless to Bernie’s point, you know exactly what controls are in place, what security is in place, and what walls are around that garden that you just planted that sensitive personal information versus sensitive information overall.

John Brewer

I think we actually have a question here that digs into the nature of PII. So I’d be interested to hear what people think about this. “What is the difference in privacy if I Google a name versus asking ChatGPT about it?”

Chris Wall

Depends on your jurisdiction, I think. So in some jurisdictions…

John Brewer

The jurisdiction that the system is in?

Chris Wall

Yeah, so what is the difference in privacy? So the reason I mentioned different jurisdictions is some places, some jurisdictions, you have a privacy right even if your personal information is in the public domain. In other jurisdictions, you don’t have that same privacy right, so if your private information is out there in the public, your privacy rights are lessened to some degree.

John Brewer

I think the specific, oh, okay, we got the clarification. The jurisdiction is Switzerland.

Chris Wall

All right, well I was admitted to practice in Switzerland.

John Brewer

Talk about rights.

Chris Wall

Thanks, Joel. Thanks, Joel.

John Brewer

I was going to say, hold on. I need to get my chart of how the EU’s overlapping.

Bernie Gabin

Overlaps. Oh, boy.

Chris Wall

Switzerland’s not part of the EU, but we’ll go with Switzerland. What is the difference in privacy if I Google a name, in other words, you take Chris Wall and Google my name or you ask ChatGPT about the same name.

Bernie Gabin

Yeah.

Chris Wall

Well look in theory, well, I don’t know what ChatGPT, I don’t know what OpenAI is drawing from in their LLM and what Google is drawing from, but they’re going to be drawing from different data sets, I would guess.

John Brewer

I think from my perspective, the big distinction that you’re going to see is that ChatGPT will answer you and it will tell you about a Chris Wall even if it has to fabricate one.

Chris Wall

That’s right. He’s the country music singer, right?

John Brewer

Yes.

Chris Wall

It’s not me, by the way.

John Brewer

I remember I think about a year ago, we asked about every member of our team, and I think I was a film producer. I think Anya was an Olympic athlete.

Chris Wall

Did I win a Nobel Prize?

John Brewer

Yeah. Yeah. I forget what your thing was, but yeah, the point is that it would give us an answer, which is an interesting question. Is it a privacy violation if it answers about you incorrectly?

Chris Wall

So Joel, I just want to take that question. I’m not sure what the root of that is. What is the difference in privacy, in terms of my ability as an individual to exercise my privacy rights for data that might appear in a Google search as opposed to an OpenAI search? There, depending on where you live or what rights are applicable to you, you might have the same rights for deletion or access or portability, for instance, for either one of those. For either one of those effectively as search queries and search responses, you might have the same rights for both. But in terms of privacy, what you can do with those results, I think that might be the same. If you take action on a ChatGPT result or a Google result and you take some action, your responsibility or your duty in the use of that personal information might be identical. I mean, again, I’m not [going] to practice in Switzerland, but I think your rights would probably be similar for you seeing that information coming out of either. It’s a great question though, Joel.

John Brewer

Fantastic. So I’m afraid that we’ve really come to the end of our time here. So I wanted to thank our expert panel for sharing their insights and information. And I also want to thank everyone who took the time out of their busy schedules to attend today’s webcast. We truly value your time and appreciate the interest in our educational series.

Don’t miss our April 24 webcast scheduled focusing on navigating mobile forensics’ future. That’s going to be led by our chief information security officer and president of forensics, John Wilson. He will be exploring innovative forensic methodologies for handling mobile data. You can learn more about this webinar and register for upcoming webcasts and explore our extensive library of on-demand webcasts at our website HaystackID.com.

Once again, thank you all for attending today’s webcast and I hope you have a great day.


Expert Panelists’ Bios

+ John Brewer
Chief Artificial Intelligence Officer and Chief Data Scientist, HaystackID

John Brewer, HaystackID’s Chief Data Scientist, focuses on bringing the latest advancements in internet technologies to the eDiscovery and incident response markets. Having worked with HaystackID since 2015, John has been a software engineer and information technologist for more than two decades and has worked for dozens of Fortune 500 companies in technology leadership roles ranging from eDiscovery and data migration to information stewardship.

Prior to joining HaystackID, John worked at BackOffice Associates (now Syniti), helping to advocate for the importance of data to enterprise businesses. He was instrumental in helping bring a data-driven approach to dozens of the world’s largest firms, and developed practical, hands-on experience with both the business and technical implications of Big Data just as it was coming into to the forefront of information technology. He left Backoffice in 2014 to start his own venture, Deep Core Data, LLC, where he continued to provide technical expertise to data-centric firms.


+ Bernie Gabin, Ph.D.
Senior Data Scientist, HaystackID

Dr. Bernie Gabin is currently the Senior Data Scientist on the HaystackID Data Science team. In this role, he works closely with the company’s Chief Data Scientist, John Brewer, to apply data-driven metrics to improve our procedures and develop custom AI/ML-empowered solutions for our clients.

Prior to joining HaystackID, Bernie received his Ph.D. in physics from Brandeis University. His doctoral work in brain-computer interface systems and machine learning/artificial intelligence led him to work on AI/ML-focused projects for the US Patent Office, Northrup Grumman, and the National Security Agency. At HaystackID, he brings his expertise in signal processing, AI design, and data modeling to create novel data-driven solutions.


+ Anya Korolyov
Vice President, Cyber Incident Response and Advanced Technologies Group, HaystackID

Anya Korolyov, the Vice President of Cyber Incident Response and Advanced Technologies Group at HaystackID, has 18 years of experience in the legal industry as a licensed attorney, including 15 years of experience in eDiscovery, focusing on data mining, complex integrated workflows, and document review. In her role at HaystackID, Anya works on developing and implementing the strategic direction of Cyber Incident Response. She is one of the industry’s leading experts on Data Breach Incident Response, Notification, and Reporting, with a solid understanding of machine learning, custom object development, regular expressions manipulation, and other technical specialties.


+ Christopher Wall
DPO and Special Counsel for Global Privacy and Forensics, HaystackID

Chris Wall is DPO and Special Counsel for Global Privacy & Forensics at HaystackID. In his Special Counsel role, Chris helps HaystackID clients navigate the cross-border privacy and data protection landscape and advises clients on technical privacy and data protection issues associated with cyber investigations, data analytics, and discovery.

Chris began his legal career as an antitrust lawyer before leaving traditional legal practice to join the technology consulting ranks in 2002. Prior to joining HaystackID, Chris worked at several global consulting firms, where he led cross-border cybersecurity, forensic, structured data, and traditional discovery investigations.


About HaystackID®

HaystackID solves complex data challenges related to legal, compliance, regulatory, and cyber events. Core offerings include Global Advisory, Data Discovery Intelligence, HaystackID Core® Platform, and AI-enhanced Global Managed Review powered by its proprietary platform, ReviewRight®. Repeatedly recognized as one of the world’s most trusted legal industry providers by prestigious publishers such as Chambers, Gartner, IDC, and Legaltech News, HaystackID implements innovative cyber discovery, enterprise solutions, and legal and compliance offerings to leading companies and legal practices around the world. HaystackID offers highly curated and customized offerings while prioritizing security, privacy, and integrity. For more information about how HaystackID can help solve unique legal enterprise needs, please visit HaystackID.com.


*Assisted by GAI and LLM technologies.

Source: HaystackID