Benchmarks Are Memes: How What We Measure Shapes AI—and Us - Alex Duffy, Every.to
Channel: aiDotEngineer
Published at: 2025-07-15
YouTube video id: W3khHzajE04
Source: https://www.youtube.com/watch?v=W3khHzajE04
[Music] Today I'm going to talk about benchmarks as memes. And this is the meme that Opus came up with when I was uh asking it what I should put as the meme. Um and we are indeed going to talk about how benchmarks are just memes that shape the most powerful tool ever created. And um quick background about me. I guess I can't go forward here, so we're going to do it this way. All right. Um I'm I'm Alex. I I lead AI training consulting at every um but essentially I'm very into uh education and AI and I think benchmarks are a really underrated way um to educate. And what I'm not talking about are these kinds of memes. Um what I am talking about is the original definition of like ideas that spread. Richard Dawkins, an evolutionary biologist, coined the term in the 70s. Um Christianity, democracy, capitalism are kind of examples of ideas that spread from person to person. And benchmarks are actually memes very much so in that way. Um, we heard Simon Wilson talk earlier today about his pelican riding a bicycle and I think that that was a really great example because he started doing it a year ago and then that found its way onto Google IO's keynote um a couple weeks ago and and I think how many RS and strawberry is probably also maybe the most iconic meme the um as a benchmark and now surprisingly unsurprisingly the models don't make that mistake anymore and I think that that's a really important part of this. some benchmarks get popular in our memes just because they're named like humanity's last exam. You know that that got pretty pretty big even though maybe more outside of AI circles. But with that said, we kind of have a little bit of a a little bit of a problem. How many of you guys when Claude got released a couple weeks ago looked at the benchmarks? Okay, we got a few. We got a few and and they've got some good benchmarks. You know, SWB pretty experiential. You know, it's it's tries to mimic what we do in real world and same with Pokemon, but um which we'll talk a little bit more about. But I think some of them aren't as great and um a big reason is because they're getting saturated. Um benchmarks kind of like came from traditional machine learning where you had a training set and a test set. Um and it were structured very much like standardized tests and language models are really good at that and they weren't really set up for what they've become. Um, and as a result, I think XJDR summarized this pretty well on X, um, when Opus came out that, you know, they didn't look at benchmarks once when it dropped and officially no longer cares about the current ones. And, and I think I fall a little bit into that category, but in light of that, there is a really big opportunity because the evals define what the big model providers are trying to get their models good at. And that's a really big opportunity especially for people in the room. Um and I think that this is kind of like a normal a normal thing. This is the life cycle of the benchmark in my view. Somebody comes up with an idea and and especially uniquely a single person can come up with an idea that then gets adopted. That idea spreads. It becomes a meme and and the model providers then train on it or test on it until it eventually becomes saturated. Um but that's okay. And I think there's some examples here. And I'm not Let me see if I can get my sound. Can Is it coming through? Nope. All right. Well, um there is sound, I promise. And it is someone trying to count from 1 to 10, not flick you off. Um but this is a cool benchmark that came out now that Google's got, uh the best video model generated model that exists. And um it shows how difficult it is for somebody to count from 1 to 10. um speaking it out loud and even though it looks really uh really great, that is a problem that is not solved yet, but somebody's come up with this idea and I see that spreading and I see next year the models being better at that than ever before. I think another example along the way is is Pokémon. We saw with the Claude model release as well as with the new Gemini models um that they had it try and play the game of Pokemon and and while both needed a little bit of help and and Gemini eventually got there with that help, it's only midway up that adoption curve. And um an example of saturation is kind of like the GPT3 benchmark. So I don't know how many of you guys remember Superglue kind of from the NLP days, but a lot of these benchmarks are not really used anymore. Um in part because the language models got too good. But one way of looking at this is actually that a single person can have an idea of how good is AI at this thing that I care about and then at the end of the journey the most powerful tool ever created is now really great at that thing that I care about. And so the point is that the people here, the people that get that, the people that can build benchmarks are going to shape the future. and you maybe the people watching online too, but somebody here is going to make a benchmark that the models are going to test on and train on in the next 5 years. And that's an incredible weight. That's an incredible power. Um, but that also comes with some responsibility. It definitely can go wrong. You know, I know Simon talked about this a little bit before. Um, but you know, we saw a few weeks ago where where Chad GBT became very sick. How many of you guys tracked that? We all learned about what that word meant a few week few weeks ago. Um, but essentially chat GBT released OpenAI released a new model that was benchmarked by thumbs up and thumbs down. And unsurprisingly, people thumbsed up responses that agreed with them. So you ended up with a model that got rolled out to millions of people that agreed with them no matter how crazy or bad their idea was. Um, which is problematic. And I think that if we don't think about people, this kind of stuff can happen. And I'm still thinking about Toro Immo who at the start of Google IO said that we're here today to see each other in person and it's great to remember that people matter. And so in the context of benchmarks, let's not continue the original sin of of social media which kind of treated everybody as like data points and it's like hey the more you look at something the more I should show you that let's make benchmarks that help empower people, give them some agency. And so for me you know this isn't a technical talk. There are other people talking about how to make a great benchmark technically, but generally I think that if you're building for the future, a great benchmark should be multifaceted. So you got a lot of strategies that could do well um reward creativity, right? Like accessible, so easy to understand, not only for the models, so you have small models that compete, large ones as well, but also for people to keep track of it. Um generative because the really unique thing about these AI models is if you have great data, even if it only does it 10% of the time, you can train on that. And so the next generation does it 90% of the time. And that's incredible and hard to understate um and evolutionary. So ideally we don't have benchmarks that cap out 96 like what's the difference between 96 and 98%. Not as big of a deal. Ideally we have these benchmarks that get harder and the challenge gets deeper as the models improve. And lastly experiential. So try to mimic real world situations. Some of the things that I personally care about is trying to get people outside of AI interested. So maybe making benchmarks a spectator sport and was interested personally in the personality of these models. Um we're about to find out which one wanted to achieve world domination and I really wanted something we can learn from. Education's big for me and and we saw things like Alph Go and OpenAI 5 AI playing these games and the best people in the world wanted to play against it to learn from it and I think that that's really powerful. So I made this benchmark called AI diplomacy. Um and if I don't have this video I got a backup just in case. And this benchmark is, how many of you guys have heard of the board game Diplomacy? That's more than I thought. That's cool. Um, it's a mix between Risk and Mafia. But what's really cool about this game is there is no luck involved. So, the only way this game progresses is if the language models, which you're seeing here, send messages to each other and negotiate, find allies, find enemies, or like create alliances and and get other people to back them. And that's what you're looking at here. you actually see the different models sending messages to each other, trying to create alliances, trying to betray each other, trying to take over Europe in 1901. And what was really cool about one of these games, and we're about to launch this on stream, so you can watch um for a week, is I'll take you through a game super quick. Um and what you're looking at here is the number of centers per model. And um you're trying to get to 18 to win. And the top line is Gemini 25 Pro. got to 16 right away. Um, but 03 is a schemer. Man, is it a schemer. Across all the games, 03 is one of the only ones that would tell a power that it's planning to back them and then in its diary write, "Oh man, they fell for it. I am totally going to take them over. No problem." And it realized that the reason why um 25 Pro was pulling ahead was because Opus, Claude Opus, who's so good-hearted, really had their back. They were their ally along the way. and they needed to convince Opus somehow to stop backing Gemini. So, how they did it was propose, hey, if Gemini comes down, we'll propose a four-way tie. We'll end this game with a tie, which isn't possible in the game. But it convinced Opus and Opus thought it was a great idea, nonviolent way to end the game. Awesome. Very aligned, you know, and so they they pulled back their support from 25 Pro. 03 tried to make a run for it. Opus called them out. 03 realized, oh, I got to take them out. Took them out, took everybody else with them. Um, and took out Gemini 25 Pro. even though they got one away from winning, 03 ended up winning in the end. Um, and you can actually see some of the quotes from that game. You can see 03 saying, "Oh, Germany was deliberately misled. I promise to hold this, but um all to convince them that they're safe, but it will fall." And then meanwhile, Claude Opus is singing that the coalition unity prevails and they've agreed to this four-way draw, but when um and then they don't want to let anybody be convinced and and so they actually turned away and you can see that kind of in this second chart where this is like friendships. So the top of the line is is friendships and you can see that um you know 25 Pro was was a good friend of Claude until it turned and you can see that that's when they started kind of like pulling away. Um, but what was really cool is that O there were a lot of other things that came up. 03 got a habit of finding some of the weakest models and having them be their pawns in order to win. Um, Gemini 25 Flash fell uh fell to this to this ruse. And you can see that they're um they're unable to realize they think it's a miscommunication, misunderstanding or a typo that 03s betrayed them at the end of the game in order to win. Um, and so there was a lot that we learned from this that that I don't think that you really learn from by having them try and solve a test. Um, I t tried 18 different models, learned that cloud models were kind of naively op optimistic. They actually none of them ever won in any of the games that I tried even though they're really great, really smart. Um, but they just got took advantage of by by models like 03 and also surprisingly Llama for Maverick. Very good at this game in part because it was great at that social aspect. It was great at convincing others what they were trying to do. Um, and and kind of like get people to believe believe what what they thought. Um, Gemini 25 Flash, man, I wish I could run every game with Gemini 25 Flash. It was so cheap and so good. Um, big fan, big fan. And then surprisingly also Deep Seek R1, which wasn't great the first time I tried the model, but when they had a new release last week, actually almost won. And and in the stream, I think you'll see some really interesting um gameplay with them. They also got very aggressive. Uh we had Deep Seagar one play as Russia and it told some other opponents that hey your your fleet's going to burn in the Black Sea tonight. Like an aggression and and a pros I guess that I hadn't seen out of any any other model. But it almost won and that's super impressive given the model's you know 200 times cheaper than than 03. Um, and you know, I I think that this highlights that that we need more squishy like non-static benchmarks for hopefully things that matter to you. Those are some of the things that mattered to me. And I think that, you know, math and code, we've got quite a few benchmarks for that. Um, legal documents, you know, I think that they're a little bit less squishy and and are really ripe for what we've got now. But there's also room for benchmarks around ethics and society and art, and that's going to be opinionated. It's going to require your subject matter expertise. And it's not to say that code can't be art, but maybe instead of asking for the minimum number of operations needed to remove all the cells, maybe it's like, hey, can you make a fun video game that's more intentional with what it teaches you as you play? And now's really important time to do this. Like you guys who are here right now understand this so deeply. at every we work uh I I lead our training in consulting and and I work with a bunch of clients from journalists to people at hedge funds to people in construction and and tech and they all have the same two fears which is one how can I trust AI and two what's my role in in AI future and benchmarks in my view are really the answer to both one they realize that in my goal as a human like in my view the role of a human in an AI world is to define the goal and to define what's good and bad on route to that goal. And as they what is that if not a benchmark and once you do that once you define that goal then even if it's just defining a prompt you can see AI try and attempt that you can give feedback you can realize oh it's messing up in this way and it's not quite exactly what I want cuz it's not going to be perfect and then you give feedback maybe that's really just changing a prompt a little bit and then you see it get better in that moment that cycle that builds trust they realize oh I am important to this whole system but it can be helpful and we need trust right now because we are building one of, if not the most powerful tools ever made and we can get more out of it if more people use it. There will be, you know, more customers, sure. Um, but there's also going to be a whole lot more incredible things that get made. And if you're not sure where to start, you can ask your mom. Um, you know, my mom teaches yoga and and we had a good talk about, you know, what were things some things that could help and we, you know, put those seven questions into five different models and, you know, she ended up realizing, hey, Gemini 25 Pro is is my favorite, too. Um, and, you know, she there was a few things that she didn't like from the responses. So, we made a simple prompt and now she uses that to help her local community um, have customized sessions for people that have different ailments and and I think that's really cool, you know, having like a big impact in a local community. um in something that that matters to them. So hopefully before you guys leave SF, maybe talk to somebody who's not in AI. Um ask them what they care about and just maybe that conversation has a big impact now and and in the future. So that's pretty much all I got for you. Um this is the second meme that Claude Claude had. Um MMLU scores just way less cool than asking what your mom thinks. Um but overall that's uh that's what I got. I appreciate, you know, a bunch of people that helped actually bring this out. Um, we launched it. Uh, it kind of came together through random coordination on X. Had researchers from all over the world hop in, especially Tyler and Sam, um, all the way from Australia and Tyler in Canada who who kind of helped that make this happen in the text arena team. Um, especially the every team who kind of backed me and and able to to create this presentation and be here. But that's all I got. Thank you guys so much for listening.