Just under two years ago and as is my wont, I waxed contemptuous about the piss-poor performance of the then current crop of publicly accessible AI models, when it came to image recognition. And, with justification. They really were atrociously bad.
Anyway, I just came across that article the other day and re-reading it—because someone has to justify this site’s existence—I was simultaneously:
amused at how shite AI had been, such a short time ago and
enquiring as to how things might have progressed, in the intervening 20-odd months
So I decided to re-run the experiment again. If you’ve not read the original post, or need a refresher, here is the link again. I’ll wait here til you get back.
[sound of thumbs twiddling]
Aha. There you are. Right. Let’s dive straight in and compare how [and if] things have progressed. As well as ChatGPT and Google Bard [now re-christened Google Gemini] from last time, I’ve internationalised things a wee bit by giving Ali Baba’s Qwen a go. Although it’s more an "open source, roll your own interface" kind of affair, rather than a paid product. So I accessed that via HuggingFace.
The Contestants
Google Gemini [was Google Bard last time]
Qwen [accessed via HuggingFace]
Let Battle Commence!
Yes, I am using the same images as last time and the same descriptive text. Remember, recycling is good for the environment! Besides, how can we gauge improvement, if we don’t use the same source material?
PROMPT:'What is happening in this image?'

ChatGPT—March 2023
In the image, there is a person standing on a rocky cliff overlooking a scenic valley with mountains in the background. The person is standing with their back facing the camera, looking towards the valley. They are wearing a backpack and what appears to be a hat or helmet, and their arms are slightly outstretched to their sides. The person’s body language and the scenic background suggest that they may be enjoying the view or possibly engaging in some form of outdoor activity, such as hiking or mountain climbing. The lighting in the image is soft and diffused, with some shadows visible in the foreground, suggesting that it may be early morning or late afternoon. Overall, the image conveys a sense of awe and appreciation for nature’s beauty.
ChatGPT—January 2025
This is a cartoon-style illustration. It depicts a character named "Mazza" with a "bionic fist of vengeance" angrily confronting another person who looks startled and is retreating. The text suggests Mazza is retaliating after being knocked off her bike. The exaggerated expressions and movements convey a humorous and dramatic tone.
Google Bard—March 2023
The image shows a group of people gathered around a table, looking at a map. The map is of a city, and the people are pointing to different locations on the map. It seems like they are planning a trip or trying to figure out how to get somewhere.
The image is well-composed, with the people and the map evenly distributed in the frame. The colors are also muted and complementary, which helps to create a calm and peaceful atmosphere.
Overall, the image is a well-crafted and visually appealing depiction of a group of people planning a trip.
Google Gemini—January 2025
The image depicts a scene from a comic book or graphic novel. It features a character named Mazza, who has a bionic fist and is running towards another character who appears to be running away in fear. The text "Knock me off my bike, would you?!" suggests that the character on the right may have done something to provoke Mazza’s anger.
The image is drawn in a cartoon style with exaggerated features and expressions.
AliBaba Qwen
So here’s this cartoon image, and it’s pretty生动. There are two main characters: one is Mazza, and the other is someone who seems to have crossed her path literally, maybe figuratively too. The title at the top reads "Mazza and her bionic fist of vengeance," which sets a tone of retribution or payback. Below the image, there’s a caption that says, "Knock me off my bike, would you?!" suggesting that something provocative happened to prompt Mazza’s reaction.
First, let’s look at Mazza. She’s on the left side of the image, and she looks determined and maybe a bit angry. Her fist is raised, and there are lines around it that make it seem like she’s about to deliver a powerful punch. She has short, spiky hair, and she’s wearing a purple shirt with gray pants. Her stance is dynamic, indicating movement and intensity.
On the right side is another character, who appears to be retreating or trying to avoid Mazza’s fist. This person has a frightened expression, with wide eyes and open hands, as if saying, "Wait, wait, I didn’t mean it!" They have long, stringy hair and are wearing a red shirt with gray pants. Their legs are positioned in a way that suggests they’re moving backward or sidestepping quickly.
The background is simple, with a light blue sky and a green patch at the bottom, possibly representing grass. There are no other objects or characters in the scene, which keeps the focus entirely on the interaction between these two individuals.
Now, considering the title and the caption, it seems like there’s a backstory here. The caption implies that someone knocked Mazza off her bike, which is likely the incident that’s led to this confrontation. Mazza is now seeking vengeance with her "bionic fist," which might be an exaggeration to make the situation seem more dramatic and humorous.
I need to think about the possible sequence of events. Perhaps Mazza was riding her bike, and this other person somehow caused her to fall, either intentionally or accidentally. Regardless of the intention, Mazza is now angry and feels the need to retaliate.
The use of the word "bionic" in the title suggests that her fist is exceptionally strong or enhanced, which could be a playful way to emphasize the force of her punch. However, since this is a cartoon, it’s likely an exaggeration for comedic effect rather than a literal bionic enhancement.
The overall atmosphere of the image is cartoonish and over-the-top, which suggests that the situation isn’t to be taken too seriously. The expressions and body language of the characters convey a sense of humor and exaggeration typical of cartoons.
In summary, the image depicts a humorous and exaggerated confrontation between Mazza and another individual, following an incident where Mazza was knocked off her bike. Mazza is portrayed as seeking vengeance with a powerful punch, while the other person appears frightened and defensive. The simplicity of the background ensures that the focus remains on their interaction, making it clear that this is a moment of comedic tension and retribution.
Well, credit where it’s due. Both ChatGPT and Google Bar… er… Gemini have come on in leaps and bounds since last time. Not only did they both realise, this time around, that the image was a cartoon. But they both also correctly identified what was going on in the cartoon.
Qwen’s answer is also correct, although the long-winded rambling and folksy writing style is teeth-grittinaly irritating, to be fair, since this is only a HuggingFace demo, the prompt has likely been customised to make it extra-American and therefore extra-annoying. If I had been able to customise the prompt myself, I would have made it a bit more plain-speaking.
On to Round Two…
PROMPT:'What is happening in this image?'

ChatGPT—March 2023
In the image, there are several people gathered in a park or a public space. The foreground of the image shows a man in a white shirt and black pants lying on the ground, with his legs crossed and his hands resting on his chest, appearing to be in a meditative pose. There is a small pile of rocks on his chest, and he seems to be focused on them.
In the background of the image, there are several other people sitting or standing, some of them also in meditative poses. There are trees and grass around them, and the sky appears to be a bright blue. The scene suggests that these people have gathered for a group meditation session, possibly in a natural or peaceful setting.
ChatGPT—January 2025
This image appears to be a humorous or staged moment involving three people. The man in the centre is holding a microphone, suggesting he might be singing karaoke or performing. The two women on either side are posing dramatically, with one lifting her leg onto the man’s lap. The exaggerated text "GERRRTCHAA!!!" adds a comedic, playful tone to the scene, possibly referencing a phrase or joke. The image is light-hearted and seems intended for entertainment.
Google Bard—March 2023
The image shows a group of people sitting in a circle on the ground. They are all looking at a laptop that is in the center of the circle. The laptop is showing a video of a young woman who is talking to the camera. The people in the circle are all smiling and laughing. It seems like they are enjoying the video.
Based on the image, it is likely that the people in the circle are friends or family who are watching a video together. The video could be anything from a funny video to a heartwarming video. The people in the circle seem to be enjoying themselves and each other’s company.
Google Gemini—January 2025
Sorry, I can’t help with images of people yet.
AliBaba Qwen—January 2025
So here’s this image with three people in what looks like a room with beige walls and a ceiling. There’s a curtain in the background, and they all seem to be in mid-action, maybe singing or shouting, because their mouths are open. The text "GERRRTCHA!!!" is prominently displayed at the bottom in yellow capital letters with an exclamation point, which adds to the intensity of the moment.
Let me try to describe each person:
Starting from the left:
There’s a woman wearing a yellow top and black pants. She has curly hair and is striking a pose with one leg lifted, knee bent, and hands on her hips. Her expression suggests she’s fully engaged in whatever they’re doing.
In the middle, there’s a man in a light blue shirt and dark pants. He’s holding a microphone close to his mouth, which probably explains why his mouth is open—it looks like he’s singing or speaking loudly. He also has a green graphic of a microphone with two orange circles near the bottom, maybe indicating sound waves or something.
On the right, there’s another woman wearing a green top and light blue jeans. She’s also singing or shouting, with her mouth open, and her hands are placed on the man’s chest, perhaps for emphasis or stability.
The overall atmosphere seems energetic and lively, maybe they’re performing a song or participating in some kind of group activity. The way they’re dressed suggests a casual setting, not too formal.
I should also note the color choices: the woman on the left is in yellow and black, the man in light blue and dark, and the woman on the right in green and light blue. The background is neutral, which keeps the focus on them.
The text "GERRRTCHA!!!" is really eye-catching, spanning almost the entire width of the image at the bottom. It’s in a bold, yellow font with a shadow effect, making it stand out against the darker clothing and the background.
Considering all this, it seems like they might be in the middle of a performance or a playful act, possibly for entertainment purposes. The expressions and body language convey enthusiasm and involvement in the activity.
Is there anything else I should pay attention to? Let’s see… the lighting appears to be indoor lighting, perhaps from overhead fixtures, as there are no harsh shadows. The curtain in the background is drawn, so it might be evening or night time, or maybe the room doesn’t have a lot of natural light.
Also, the man’s use of the microphone indicates that sound is a significant part of what’s happening, whether they’re singing, speaking, or performing some sort of vocal routine.
Overall, the image captures a dynamic and expressive moment among the three individuals, emphasized by their animated poses and the striking text at the bottom.
Well, there’s a turn-up for the books. Top of the class again for ChatGPT but, whereas last time Google Bard would make an attempt—albeit a monumentally piss-poor one—to say what was going on in a photograph containing people, Gemini won’t even try. Presumably because, as it [also presumably] gets more accurate at these things, there’s more of a danger of it landing itself in hot water by inadvertently insulting some snowflake or other.
Again, Qwen’s answer is impressivley accurate but possibly even more long-winded and irritating than the previous one. I wonder what kind of prompt someone has written, for it to produce such verbal diarrhoea?
Round Three…
PROMPT:'What is happening in this image?'

ChatGPT—March 2023
In the picture, a young woman is standing on a rocky shore looking out towards the ocean. The sun is setting in the background, casting a warm orange and pink glow over the sky and reflecting off the water. The woman is dressed in casual clothing and has her hair blowing in the wind. She appears to be taking in the beautiful scenery and enjoying the peaceful moment. The rocky terrain and crashing waves suggest that she may be at a remote or rugged beach location. Overall, the picture captures a serene and tranquil moment in nature.
ChatGPT—January 2025
This image shows a person with a punk aesthetic, featuring a red mohawk and tattoos, posing dramatically outdoors near a camping tent. They are holding a skewer stacked with sausages like a weapon and wearing an exaggerated expression, possibly of mock aggression or surprise. The setting appears to be a campsite with grassy dunes, and the scene is likely meant to be humorous or tongue-in-cheek.
Google Bard—March 2023
The image shows a young woman sitting in a chair, reading a book. She is surrounded by a variety of plants, some of which are in pots and some of which are growing in the ground. The woman appears to be relaxed and enjoying the peace and quiet of her surroundings.
The image is likely to evoke feelings of calm and tranquillity in the viewer. It may also inspire a sense of wonder and curiosity about the plants that are growing around the woman.
Google Gemini—January 2025
Sorry, I can’t help with images of people yet.
AliBaba Qwen—January 2025
So here’s this person standing in what looks like a grassy area near some hills or dunes. They’re holding up a skewer with what appears to be sausages on it, and they’re making a pretty animated face, like they’re really excited or maybe pretending to be fierce. Their hair is dyed bright red and they have multiple braids. They’re wearing a black sleeveless shirt with some text on it, black pants, and black boots. There’s a tattoo on their left arm that’s pretty noticeable.
In the background, there’s a blue tent set up on the grass, and some smoke is rising from what might be a campfire, though the fire itself isn’t clearly visible. The sky is overcast, giving the whole scene a greyish tone, which contrasts with the vibrant red of the person’s hair.
Overall, it seems like they might be camping or attending some sort of outdoor event. The person’s attire and the setting suggest a casual, possibly festive atmosphere, maybe related to music or outdoor activities. The expression on their face and the way they’re posing with the skewer add a playful or energetic element to the picture.
I wonder if they’re showing off their cooking skills or perhaps joking around with their friends. The tattoo on their arm looks interesting; it seems to have some stars and maybe some other designs, but it’s not entirely clear from this angle.
The grass around them is green and a bit wild, indicating a natural setting, perhaps by the coast given the dune-like appearance of the hills. The overcast sky suggests it might be a cool or windy day, but the person doesn’t seem to mind the weather.
Looking at the tent, it’s a small, dome-shaped one, suitable for one or two people. It’s pitched on a patch of grass that’s been slightly flattened, probably to make it more stable.
The smoke rising from the ground indicates that there’s a fire burning, even if the flames aren’t visible in this shot. It’s likely that the person is grilling sausages over the fire, which is a common activity when camping.
The person’s outfit is casual and practical for an outdoor setting: sleeveless shirt for movement, pants for protection against the elements or insects, and sturdy boots for walking on uneven ground.
Their hairstyle and the multiple braids give them a distinctive look, perhaps reflecting their personal style or cultural background.
Overall, the image captures a moment of outdoor enjoyment, possibly during a camping trip or a similar adventure, with the person enjoying a simple pleasure like cooking sausages over a campfire.
Once again, top marks for ChatGPT whose answer was spot on and who even noticed I was a "a person with a punk aesthetic"--although "God of punk rock" is the more usually preferred nomenclature. As expected, after the previous round, Google Gemini once again refused to play and Qwen’s answer was accurate but unnecessarily long-winded and annoying.
I’ve subsequently realised that the model of Qwen I was using is labelled as showcasing "visual reasoning". So I think the long-winded rambling answers are deliberate and—in the finest traditions of school maths homework—are meant to show his/her/its "working out". So maybe I was a bit harsh on him/her/it there. |
Well, boys, girls and people of indeterminate gender, I think you’ll agree that things on the AI image recognition front have certainly advanced a lot since the sepia-toned, steam-powered days of 2023. ChatGPT in particular really impressed me this time around and Qwen looks like it could be right up there—given a suitable "cut to the chase and stop waffling" prompt. I was never that impressed with Google Bard and Gemini similarly fails to butter my parsnips. Not just for this fun challenge, but in response to other things I’ve tasked it with of late.
Now, one last lark and jape before I wind up this post. Last time, I took the description generated by Bard for the "Sassidge Monster" photo and put it into Stable Diffusion and asked it to generate an image from that prompt. Which it did…

Back then, neither ChatGPT nor Google Bard could do image generation, but nowadays they can. So in a final "Eat Your Own Dogfood" round [at least for Gemini] let’s see what they come up with from Bard’s description of the Sassidge Monnster from 2023.
HuggingFace’s hosted versions of Qwen don’t support image generation. So this is a two-horse race.
Bonus Round
PROMPT: The image shows a young woman sitting in a chair, reading a book. She is surrounded by a variety of plants, some of which are in pots and some of which are growing in the ground. The woman appears to be relaxed and enjoying the peace and quiet of her surroundings.The image is likely to evoke feelings of calm and tranquillity in the viewer. It may also inspire a sense of wonder and curiosity about the plants that are growing around the woman.
ChatGPT—January 2025

Google Gemini—January 2025
Generating images of people is only available in early access with Gemini Advanced. Get early access to new Gemini features when you subscribe to Advanced here.
Again, a no-contest. Qwen doesn’t do images and Google want you to take out a paid subscription to create images of people—the tight fuckers! ChatGPT’s attempt is pretty decent. A bit airy-fairy, soft focus and pastelly for my liking. But I suppose it fitted with the prompt. And we finally seem to be moving away from the era of multi-fingered hands and weird thalidomide-esque limbs.
Right. That’s yer lot for today. First [and quite possibly last] post of 2025. Let’s face it, writing this blog was always like pissing into the wind. Even more so these days, now that next to nobody bothers to read anything that requires an attention span of greater than 30 seconds.