Your Data, Your Models, Your Intelligence. All Private. (AI Demo)
Summary:
Join Kwasi Ankomah as he delves into the critical aspects of data privacy, model output privacy, and ownership in the AI domain. With over 15 years of experience in AI and machine learning, Kwasi brings a wealth of knowledge from his time in the banking sector, government roles, and tech startups. This talk covers the importance of transparency, record-keeping, communication in AI projects, and the challenges of navigating data privacy, model accuracy, and ethical AI development. Discover how to mitigate risks in using commercial and open-source AI models, ensuring your data remains private and your models remain accurate and ethical.
Transcript:
Kwasi Ankomah (0:17)
Hello everyone. Welcome back. My name is Kwasi Ankomah and I'm the head of AI at Formula.Monks. We are a technology consultancy aimed at providing AI methodologies, custom builds, and also AI consulting services. So today I'm here to talk about your data, your models, your intelligence, or private. So the key of this talk is to talk to you a little bit about some of the challenges in data privacy and model output privacy and kind of ownership in the model space as currently. Try to click. Okay.
(1:04)
So as we know, AI is constantly evolving. It's moving very fast with the advent of foundational models, of large language models and with the great speed at which things are being done. So before I kind of dive into the kind of meatiness of this topic, I'll give you a little bit of background about myself. I've been in AI and ML for about sort of 15 years now. I'm an applied statistician. I spent my time in the early days working with banks, with the Bank of England, with the financial conduct authority and with the UK government, as well as working in consultancies and most lately working for a Silicon Valley startup who worked on the accelerators, essentially the GPUs and also the software to go on top of it as well. So what I'm going to talk to you about today is the fact that the algorithms at the moment are a little bit like children and the fact that when you raise children, you need to give them good values, you need to give them good guidance. And that's what we need to do with algorithms at the moment because we're in a key stage now where we can really guide algorithms to be what we want in terms of alignment. So I want us to work together to do this. I think this isn't a problem for data scientists, machine learning engineers, for developers or for researchers. I think it's a problem for everyone. I think the value chain to make this work is going to need policy makers and everyone in there to kind of work together to make this a better set of models and outputs. So I'm going to talk to you about kind of three key... Three key items that we're going to look at when working with AI in general.
(2:50)
The first is to be transparent. This is a key one in terms of working with AI solutions. A lot of people at the moment aren't transparent with the data and the methodology of the model. So essentially you end up with a black box approach, which people don't like. And of course, the more highly regulated the industry, think of government, healthcare, finance, the worse that is, You know, you've got to be able to have a output that you can run and rerun and get the same result. The second is, keep a record. This is, again, a very boring but necessary part of machine learning. Everyone wants to talk about the fancy outputs, but you really need to be able to keep a record of what you've done. And a really good way of doing this is with machine learning ops. So you know, you've got ways to have model registries, to have input logs and the more emergent area of LLM ops where you're able to look at what's going on through a large language model, through the prompts, through the databases, and through any context that's passed through right through to the output. So I think this is a really key one as well. And the last one is communicate, right? So communicate with your stakeholders, your policy makers, and make sure you are talking to them about what you're doing and what your roadmap is. I think these are really key things to kind of set the agenda going.
(4:16)
So some of the things that we're going to talk about are kind of what are the key issues when we look at privacy, right? So we've got things like limited or no ownership, accuracy of outputs, claims substantiaion, data privacy, bias, and ethical checks. So I actually ran a poll before I did this talk to ask people what was the most relevant one to them, and no surprises, that the accuracy of output was almost 50% of what people cared about, which makes sense. You've got to make sure your model's accurate. But of course after that, it went to kind of data privacy and security because people are really, really worried about losing PII or your citizen's or your client's data. So here we're going to talk about some of the challenges and the risks and how to mitigate them. So, limited or no ownership. So this is a really, really key problem at the moment, especially with generative AI. It's who owns the output, right? So if you, no one likes doing it, but if you look at the T's and C's of many commercial, large language models or foundation models, they, you may not have as control, you may not have as much control of the output as you think. And that's really important, especially if you are in something where you are using that output downstream. You have to be really careful about who owns it because this is going to end up with quite high profile legal battles. So I'll give you a example here. So the company, Stable Diffusion, that makes the diffusion computer vision models are being sued by Getty Images because a lot of the data that went into that model was Getty images' data, right? And so you're going to see more of this going forward as these models grow and people kind of rush to get as much data in as possible to make these models as performant as possible. So it's really key that people have that kind of ownership as well.
(6:17)
Now, what can you do in terms of the risk mitigation here? I think checking contracts, making sure that you have the usage rights, and, of course, where possible using open source model, open source models that are licensed for commercial use. And there's a couple of good reasons why this works for enterprises. One of them is that you can see how the model was trained, so you can see exactly what the audit trail and the lineage was to get your output. And the second is that once you have that license for commercial use, you should have the ownership of it as well. Now, the most important thing, I think, is the accuracy of the output of the machine-learning model. Now, machine learning models are not perfect. We know that you only have to do a quick, you know, prompt on the latest GPT to see that sometimes it does definitely give you the wrong answer. Now, there's lots of things that could mitigate this. And I think one of the key things is monitoring the output.
(7:23)
Now, I always believe, especially in highly regulated industries, that having a human in the loop is really, really effective. I think where you have a process, be it a healthcare prediction, a financial prediction, a output around casework, having someone to review that when you are on the borderline of a decision is super important. So I think that can be probably the single best way to mitigate against poor accuracy. The second is to redact sensitive data, right? So where wherever possible, take out sensitive data. So you don't want your sensitive data to be in there and you don't want any protected characteristics to be in there, either. And then of course, where you can, use other models. So this would be something like a technique called "LLM as a Judge." And this is where you have a human written answer as your reference, and you have the output that the model comes up with and you compare them using some sort of statistical methodology. And you use that to basically quickly automatically validate the output of your model.
(8:35)
This is becoming really popular, of course it doesn't fix all the issues, but it does mean that if you do get an answer that could be labeled dangerous or negative, you can find that quickly and block it before it, you know, makes any kind of reputation of damage to your firm or the government. Now this one, claims substantiation, is I think a nuanced risk for this type of model. And the reason why it's nuanced is because it comes where the model is making a claim about something that it shouldn't be, right? So let's take an example, right? You know, "88% of dentists recommend this toothpaste," right? So that they can say that on the ad, but if you look on the ad, there's always an asterisk, which basically says "We've surveyed three people." You know, like, so it's not the best sample to actually get accurate data here. And models are also capable of doing this, right? And there's been some really good examples, especially in the pharmaceutical industry, where they've tried to use chatbots to talk to patients to recommend them treatments. And it's, the model has basically come up with some really, really kind of outlandish claims, saying that certain, you know, treatments are going to kind of cure cancers and things like that. And you absolutely don't want your model to be making those sorts of claims. So again, being able to keep your model within guardrails, ensuring that you review it for inconsistencies and explicitly, sometimes, fine-tuning the model on things such as advertising claims and ways to respond to the user with such questions is super key. 'Cause I think this particular one is one that catches a lot of people out.
(10:26)
And the the next one is kind of the rights of publicity. So can you portray a person's name or likeness and infringe their copyright? And we're seeing an example of this right now because a lot of actors in Hollywood are on strike, right? And they're on strike because of this very reason they're on strike because they're worried that AI models are going to infringe their likeness and they won't be able to do anything about it. So the reason why they're standing up now is because they know that they need to make a stand, because if not the, it's very... I think the precedent is going to be set for whether people need to be able to obtain this permission explicitly. And so here, you have, you have basically the chance to avoid using such images of famous people. Now, not all companies are doing this, right? And you can see the way that actors have reacted that they see a future where AI companies may not necessarily do this. So they've taken the action themselves to take themselves away. And you are going to see this more and more with companies blocking certain crawlers from actually getting their data, right? So they'll make sure that instead of waiting for policy, they'll actually take defensive action to make sure that can't be done.
(11:43)
And then, of course, the one of the biggest concerns is kind of data privacy and cybersecurity, right? So here, there are so many things that can kind of of go wrong when you're trying to work with PII data, sensitive data, classified data, you know, the risks of people using commercial tools without the enterprise's knowledge is going up and up. And the reason for this is quite simply that a lot of these tools speed up the workflow of what the person's trying to do, right? So if someone is trying to write an email or they're told to write a brief, you know, they can go and put it into Anthropic's Claude, or GPT-4 and just get the thing written right? And that's a really, kind of, that's a really great thing for them because it saves them work. So you are always going to get people who are going to use that. Now, the responsibility of the enterprise is to make sure that they can mitigate this as much as possible. So using things like OpenAI, it's making sure that you try to not upload any internal or confidential info, right? And the demo I'm going to do later, we talk a little bit about redaction and that's one of the things that we built into our tool. And then again, be aware that some tools actually make their outcomes public. So if, again, if you look in the fine print of Chat GPT, when you don't pay for it, that data then gets used to fine tune the model, right? So essentially there was some really great, well, not if you are the company, right? But really good examples of this happening with, I believe Samsung in Korea, where some of their, some of their plans about a product were coming up as the output of the next version of Chat GPT, because someone had used it in the first iteration, it got used in the retraining data. And when you asked about that specific thing, you then suddenly got all this great internal knowledge from Samsung.
(13:39)
Now, of course, they learned their lesson the hard way, and I think a lot of corporations are very, very wary of this. And there's some great things that you can do to kind of mitigate against this. One of the greatest ones I talk about is redaction, right? Make sure that you are redacting PII data if you are dealing with commercial LLMs. And the second kind of way is to use open source LLMs to ensure that you own the weights and there isn't a third party who owns that output and can use it to kind of benefit their model. And then of course, one that's super important, is biases and ethical checks, right? You don't want your model making or perpetuating harmful biases, stereotypes that are generative of gender biases, racial biases or anything like that, right? And there's been countless examples of this. I think, you know, I don't have to name them. You've got models that, you know, the Microsoft released a chat bot, I think. That was the famous one. And within like, you know, a week it was spouting complete nonsense and, you know, just essentially perpetuating stereotypes on the data that it was trained on. So I think here some of the things that you can do to address it, start in the data set, right? So I believe that the group of people who builds models needs to be as diverse as the data set that goes into models. So if you, it's not just about having a very diverse data set, which you should, and make sure that it's well curated and well cleaned, but you also need a team that actually is representative of the world around you, right? Researchers, policy makers who can be in the room to make sure that bad decisions won't be made when you are kind of looking at the data that goes into these models. And people really need to understand the pre-processing checks that go into these models.
(15:25)
So, a kind of quick anecdote about some of these models, the single biggest factor for the improvement of large language models and foundational models now is not necessarily the architecture or the kind of parameter count, it's more around the quality of the data. And that's actually what's driving big performance gains, especially in open source, is the fact that better curated data sets, better balanced data sets are actually allowing more tasks to be done with a more accurate result. And you're going to see more of that, I think, going on into the future. So I'm going to talk a little bit about the risks of using commercial LLMs specifically, because I think a lot of companies now have commercial LLMs running, I think, OpenAI did their dev day on Monday and said, "92% of Fortune 100 companies use OpenAI." And I think that tells you enough, right? That tells you enough that, if enough people are using such a tool that goes out to a third party, then you really need to know about the risks of using them and also how to mitigate them, right?
(16:30)
So we've talked a little bit about quality control. Language models struggle with computational tasks, there's a lot being done to address that, but there's been a couple of horror stories where people have put in, kind of statistical mathematical information, expecting the model to do maths for them, only to get a error that's kind of like, back down, propagated later down the line. So you've got, you know, you've got store opening hours that don't make any sense. You've got things that really, any human would look at and say, "How did you come up with that conclusion?" So there, again, there really needs to be some quality control in here. Now there's also contractual risks, right? So this is... The more that folks use these kind of models, the more that the AI companies are protecting themselves via essentially legal rights. So sharing client's data might actually violate the terms of the actual usage of the model itself. And a lot of people don't realize that the downstream effect of them outputting, say, something that's been proofread, or something that's generated output to their client, actually holds them at some sort of legal liability.
(17:48)
An interesting update on this is that just recently, this has been such a problem that OpenAI guaranteed legal support for companies who were using OpenAI, because I think they've had so many issues with people are getting sued about the outputs of that. But I imagine no one in this room wants to leave it until that stage where you are being brought up in front of a judge to kind of protect yourself against these sorts of things. And then we've talked a little bit about privacy risks as well. There's a lot here about data deletion rights. So a lot of the data retainment and the kind of the data privacy and policy needs to be addressed for large language models. So because you tend not to see how the data was trained, you tend also not to see how long that they can retain your data for. So I think this is something that people should be very wary of, as well. And IP, I think, is also a really, really key one. So in open source licensing of models, there are different types of model licenses. So you have some open source models where you can use them for research purposes. An example of this would be, I believe, the original Llama model, was only available to use in research. And now what you have is more and more open source models that can be used for commercial use as well. But again, it's a lot to kind of expect a user to know. And as these models become easier to use, you are at risk of having a user who just goes and starts playing with the model without really understanding whether he or she can actually use it, even when it's open source. So I think again, this kind of education for users is really key.
(19:35)
Now, what we did at Formula.Monks is that we looked at this in a few ways, right? We looked at the actual process of AI. So we run these things called spark workshops where we talk to organizations about this, we talk to 'em about data readiness, we talk to 'em about compliance, we talk to 'em about models, and we talk to 'em about how an AI project is going to affect the end to end of a business, right? Because for me, it's about solving the business problem and not just putting a model in. And so when we were looking at how we could look at document intelligence, one of the things that we thought about was, well, let's think about redacting the data, if it's going off to a commercial platform, and let's think about giving the user options to change to more open source models as well as using commercial models, right? The models that run on these platforms are highly interchangeable. So it may be GPT-4 today, it may be Llama 2 tomorrow, but in about six months it'll be something else. And so we built this platform so that these models were hot swappable and that users had high privacy in terms of, we didn't keep any of the file names, we didn't keep any of the files. And essentially it was a way of us putting privacy at the center of the kind of document intelligence experience. So I'm just going to run this video and talk you through how the system works and some of the features that we decide to put in to kind of help with that. So let's see if it works. Ah, there you go.
(21:09)
So here, what you see me doing is, I'm uploading a file here and what's happening on the right hand side is that once we upload the file, we actually redact it. So we actually redact the PII data from the file and you'll see how this works. So you'll see that Gartner appears there in purple, and that's because that gets swapped in for a kind of placeholder, you'll see organization_xyz, and there you go. And that's what's actually sent to the model. We don't send any kind of private data, we redact it and then just send the placeholder. And this is, again, to just to make sure that even though when we're giving our clients the tools, we're able to give them the tools in as safe a way as possible. We know that people are going to use large language models, so we are just trying to make sure that they use them as safely as possible. Now, of course, this works just like any kind of document intelligence platform would, we can chat with enterprise documents, all that good stuff and kind of have citing the sources. But for us, privacy was the key element of this. So the files on the right hand side that each user has their own database, right? So this means that the databases are also... The databases are also kind of sectioned off, so it means that you are not, your data isn't being contaminated by other users.
(22:41)
Of course you can do that in terms of different parts of an enterprise. The concept of a user can be one person, it can be a group of people, it depends, right? So again, I'm going through here and I'm making sure, like with all goods solutions that I'm checking my sources. So we always add citations, like a lot of tools do now, and we allow the user to quickly verify what's going on. Now for me, the interesting part is the AI model. So here you'll see that we are using Azure GPT-4, 32K content length. And while that is good for some use cases, it's not good for all of them. Now the key for me here was giving the users the option to switch to open source models for their use case, right? And when it's an open source model, it means that we are not sending the data anywhere, right? We have the model weights where it's all enclosed in the ecosystem, versus here where I'm using a commercial model and I'm redacting it to kind of minimize the risk. You just saw a kind of, so here we we're using Llama, right? So we look at its... We look at the Llama model, we start a new chat and we upload a different file. So when we... The reason for this is we've had a lot of users who said, "We just can't have our data going to any commercial LLM." So we said, "All right, we'll give you the option to switch to a open source one." So here we're using Llama, and the cool thing here is that you'll notice that we aren't redacting. And the reason we're not redacting is because we only redact when it's going out to a commercial model. When we have our own open source model and we have the model weight, there's no need for us to redact because the data is not going anywhere. It's just staying with us locally. So this was a really cool thing that we wanted to give, again, the users the choice.
(24:29)
And there were some users who actually really wanted, even in the open source, to have the model being able to redact as well. So here, what you'll see in the next piece of the demonstration is that we change models again, we use a model called Mistral, which is a new LLM, it's around 7 billion parameters, so it's on the smaller side of things. And in this part of the demonstration, the user can toggle which documents he or she wants to work with, right? So you can really narrow the scope. And again, this was all done in the name of privacy. We didn't want data being available to the model that the users didn't want. So when we actually upload this white paper, we will tick it. But more importantly, in the backend, Mistral 7B is a model where we've asked it to also do redaction. So we've said, even though it's an open source model, we want you to redact. And now there might be some use cases where people want that because it kind of is a belt and braces approach. It's kind of like we're using open source, but we're also going to redact just in case. And so here you can see that I'm chatting with the different documents, but when I ask about an entity, a person, something that could be sensitive, it knows to redact it. And you'll see there again, you can see that it's redacting the organizations and the individuals even on the open source model. So we wanted to give as much flexibility to the user as possible. So this is for me an example of how organizations can use AI in a kind of safe and responsible way. My thinking is that people are always going to use LLMs, they're always going to use commercial tools. It's making sure that they use them in a way that is safe and is responsible and making sure that your citizens' and your companies' and your enterprises' data doesn't end up in a third party tool such as OpenAI or Claude or anything else. So that was the bulk of the demo. And I have a little bit of time for questions if there are any.
(26:52)
Yes?
[Attendee] So when you say like, tentative databases of... So you mentioned like separate databases, like-
Yes, so there's posts, there's like instances of Postgres, PG Vector.
[Attendee] Gotcha, gotcha. Okay, so this is a fully internal...
Yeah, I recon, yeah. So this is deployed using PG Vector as the database. So again, we had options where people wanted it to be fully on-prem. We had it where people wanted to use a bit of a cloud service. For me, the key is a choice, right? Like I'm, we're trying to make sure that people, where they want absolute isolation, they have absolute isolation, right? And there's reasons for that, so yeah. Any more questions? Nope, people are ready for lunch. Awesome, awesome. Thank you very much.