I’ve been away for a couple of weeks, and the AI landscape shifted again while I wasn’t looking. Pretty typical at this point. I spent a solid two hours this morning running a proper comparison between Google’s Bard (which I’m fairly certain was serving me Gemini Pro responses, not Ultra), GPT-4, and Claude 2. Same prompts, same context, copy-pasted across all three. Here’s what I found.
The Setup
I fed each model around 6-8K tokens of context to kick things off — enough to give them something real to work with, not just a toy prompt. The goal was to see how each one handled a substantive, multi-turn conversation where nuance actually matters.
I’ve written before about how the real differentiator between these models isn’t the party tricks — it’s what happens when you push past the first response and start actually WORKING with them. That’s where the gaps show up.
Gemini Pro Surprised Me
I’ll be honest — I went in expecting to be underwhelmed by Bard, and I wasn’t. Gemini Pro is genuinely good. Good enough that I’m going to start routing a lot more of my daily workflow through it.
The integration story alone is pretty compelling. Export straight into Google Docs. The ability to pull context from your Drive or reference email exchanges. If you’re already living in the Google ecosystem — and most of us are, whether we like it or not — that’s a meaningful productivity advantage. It’s not just about the model’s raw intelligence. It’s about where the output GOES and what you can do with it next.
Gemini Pro clearly outperformed GPT-3.5 and Claude 2 in my testing. That’s not a controversial take at this point, but it’s worth stating plainly. The gap between “free tier” and “paid tier” AI is widening, and Gemini Pro is now firmly in the upper bracket.
But Then I Pasted Into GPT-4
Here’s where it got interesting. When I started running the same prompts through GPT-4, something shifted — not just in the outputs, but in what I WANTED to ask.
GPT-4 was hitting nuances that Gemini Pro missed. It was grasping my intent at a deeper level — enough so that I realized my prompts had been shaped by Bard’s limitations. If I’d started with GPT-4, my entire user journey would have been different. More advanced. More efficient. I would have asked better questions because the model was meeting me at a higher level of abstraction.
This is something that doesn’t show up in benchmarks, and I think it’s one of the most underappreciated dynamics in AI right now. The model you use shapes the conversation you have, which shapes the work you produce. It’s not just about getting a better answer to the same question — it’s about arriving at better QUESTIONS entirely.
GPT-4 also offered actual visual mockups and images as part of its responses, which added a tangible dimension that text-only models can’t match. When you’re working through design concepts or trying to communicate something spatial, that’s not a gimmick. It’s genuinely useful.
The Creativity Question
One area where Bard actually impressed me more was creativity — specifically around design thinking and visual concepts. There was a looseness to its suggestions that felt less constrained than GPT-4’s more structured approach. I’d be curious to see if others have had the same experience. My sample size is one morning, and I know it.
I think there might be something to the idea that different models have different “creative signatures.” GPT-4 is precise and thorough. Gemini Pro is a bit more willing to take swings. Neither approach is universally better — it depends on what you’re building and what stage you’re at.
Where This Leaves Claude
I’ve had a Claude 2 subscription for a while now, largely because of its longer context window. In theory, being able to feed more text into a model should be a significant advantage. In practice.. I’m not sure the context window is worth the tradeoff in output quality.
I’m likely dropping my Claude subscription. That’s not a permanent verdict on Anthropic — they’re doing important work, and I wouldn’t be surprised if Claude 3 changes the equation entirely. But right now, in December 2023, the value proposition just isn’t there for me when I’m comparing it head-to-head against GPT-4 and a rapidly improving Gemini.
The Bottom Line
I’m keeping my GPT-4 subscription — that’s the clearest takeaway from this exercise. When you need depth, nuance, and the ability to push a conversation into genuinely productive territory, it’s still the benchmark.
But I’m going to use Bard/Gemini a LOT more. The Google integration alone makes it worth incorporating into daily workflows, and the response quality has crossed the threshold from “interesting experiment” to “actually useful tool.”
Here’s what I think matters most: we’re entering a phase where the right move isn’t picking one model and going all-in. It’s understanding what each one does well and routing your work accordingly. GPT-4 for depth and nuance. Gemini for integration and creative exploration. And keeping an eye on everything else, because this whole landscape could look completely different in three months.
The competition is working. These models are getting better fast. That’s pretty great for all of us.