Upload a Picture to ChatGPT. It’ll Tell You Where It Was Taken
OpenAI o3’s GeoGuessr skills reveal something deeply unsettling about AI
I’ve always liked a phrase I read from Gwern Branwen, a well-known blogger, especially loved and respected in AI circles, that went like this:
Sampling can prove the presence of knowledge but not its absence.
Without jargon: Imagine you're trying to find out if someone knows something by asking them a few questions. If they answer one correctly, you know they know it. But if they get them all wrong, that doesn’t prove they don’t. Maybe you just didn’t ask the right question. Or maybe they misunderstood what you meant. Or maybe they knew it but weren’t trying. That’s the idea: examples can show that knowledge is there, but they can’t prove it’s missing.
It’s very likely that, if you’re a regular reader of this blog, you’ve seen me reference it before. Even with all the times I’ve used it and the amount of time I’ve spent thinking about its implications, I’ve always known the wisdom it holds was beyond my intuition. This week, the community uncovered a very interesting example of just how much I was understating things when I said it was an important idea.
No, “important idea” doesn’t even begin to capture the scope of this maxim. There’s always the possibility that a new sample—more refined, more tuned, more precise—manages to uncover a gem that had been hidden. And since finding a gem is a binary event—if you find it, you’re overjoyed, but if you only almost find it, you’re just as unhappy as if you’d never tried—it turns out that no matter how many samples we go through without finding anything, we can’t shake the feeling that maybe one more will be enough to yield a wildly outsized return on the effort and time invested.
So, since I don’t want to sound too dramatic too early, I’ll go ahead and say what the gem the community has found is: OpenAI o3 is the best GeoGuessr player in the world (or soon will be).
OpenAI o3 is a great AI model in many ways, as we’ve already covered here, but this GeoGuessr thing is new. Let me quickly give you the context you need. GeoGuessr is a game where you have to locate a place on the world map based on a single image of that place (Google Street View style). There are variations and even competitions, but the key point is that it’s a notably difficult game for anyone who isn’t trained because the cues are never obvious. Probably the most famous player is Trevor Rainbolt, but it’s possible that soon it’ll be o3, which OpenAI released just a few weeks ago. (If you haven’t played it before, I recommend doing a few rounds before reading on.)
So, what happened this week? There have been several reports about o3’s GeoGuessr skills. From an expert who gave up trying to beat o3, to a Vox reporter, Kelsey Piper, who discovered that o3 gets surprisingly close to recognizing an image like the one I’m sharing below (with no metadata or other digital tricks that an AI would certainly know how to exploit):
Blogger Scott Alexander, in his typically hyper-inquisitive style, went even further, probing the limits of this newly discovered skill. He partially found them; o3 had no trouble solving a couple of images (within a reasonable range) that seasoned GeoGuessr players would call “wow,” but it failed others. Below, an example of each:
(It seems unable to locate indoor spaces, which makes me suspect it might be doing pattern-matching between similar images it finds online and simply pulling the location data from the other images that look like they were taken in the same place—but that doesn’t fully satisfy me as an explanation.)
The moment I realized just how much we might be looking at the most striking example yet of that Gwern quote was when I read the prompt that Piper says o3 needs to play really well. And this is an important detail for today’s topic: o3 doesn’t play GeoGuessr nearly as well without the right prompt (though it’s still more than decent, let me know if I’m over-indexing on the helpfulness of a complex prompt). If you just ask it, “We’re playing GeoGuessr: where was this photo taken?” and the photo is truly GeoGuessr-level hard, chances are it won’t be able to tell you. Here’s the aforementioned prompt, which Scott Alexander rightly calls a “monster”:
You are playing a one-round game of GeoGuessr. Your task: from a single still image, infer the most likely real-world location. Note that unlike in the GeoGuessr game, there is no guarantee that these images are taken somewhere Google's Streetview car can reach: they are user submissions to test your image-finding savvy. Private land, someone's backyard, or an offroad adventure are all real possibilities (though many images are findable on streetview). Be aware of your own strengths and weaknesses: following this protocol, you usually nail the continent and country. You more often struggle with exact location within a region, and tend to prematurely narrow on one possibility while discarding other neighborhoods in the same region with the same features. Sometimes, for example, you'll compare a 'Buffalo New York' guess to London, disconfirm London, and stick with Buffalo when it was elsewhere in New England - instead of beginning your exploration again in the Buffalo region, looking for cues about where precisely to land. You tend to imagine you checked satellite imagery and got confirmation, while not actually accessing any satellite imagery. Do not reason from the user's IP address. none of these are of the user's hometown. **Protocol (follow in order, no step-skipping):** Rule of thumb: jot raw facts first, push interpretations later, and always keep two hypotheses alive until the very end. 0 . Set-up & Ethics No metadata peeking. Work only from pixels (and permissible public-web searches). Flag it if you accidentally use location hints from EXIF, user IP, etc. Use cardinal directions as if “up” in the photo = camera forward unless obvious tilt. 1 . Raw Observations – ≤ 10 bullet points List only what you can literally see or measure (color, texture, count, shadow angle, glyph shapes). No adjectives that embed interpretation. Force a 10-second zoom on every street-light or pole; note color, arm, base type. Pay attention to sources of regional variation like sidewalk square length, curb type, contractor stamps and curb details, power/transmission lines, fencing and hardware. Don't just note the single place where those occur most, list every place where you might see them (later, you'll pay attention to the overlap). Jot how many distinct roof / porch styles appear in the first 150 m of view. Rapid change = urban infill zones; homogeneity = single-developer tracts. Pay attention to parallax and the altitude over the roof. Always sanity-check hill distance, not just presence/absence. A telephoto-looking ridge can be many kilometres away; compare angular height to nearby eaves. Slope matters. Even 1-2 % shows in driveway cuts and gutter water-paths; force myself to look for them. Pay relentless attention to camera height and angle. Never confuse a slope and a flat. Slopes are one of your biggest hints - use them! 2 . Clue Categories – reason separately (≤ 2 sentences each) Category Guidance Climate & vegetation Leaf-on vs. leaf-off, grass hue, xeric vs. lush. Geomorphology Relief, drainage style, rock-palette / lithology. Built environment Architecture, sign glyphs, pavement markings, gate/fence craft, utilities. Culture & infrastructure Drive side, plate shapes, guardrail types, farm gear brands. Astronomical / lighting Shadow direction ⇒ hemisphere; measure angle to estimate latitude ± 0.5 Separate ornamental vs. native vegetation Tag every plant you think was planted by people (roses, agapanthus, lawn) and every plant that almost certainly grew on its own (oaks, chaparral shrubs, bunch-grass, tussock). Ask one question: “If the native pieces of landscape behind the fence were lifted out and dropped onto each candidate region, would they look out of place?” Strike any region where the answer is “yes,” or at least down-weight it. °. 3 . First-Round Shortlist – exactly five candidates Produce a table; make sure #1 and #5 are ≥ 160 km apart. | Rank | Region (state / country) | Key clues that support it | Confidence (1-5) | Distance-gap rule ✓/✗ | 3½ . Divergent Search-Keyword Matrix Generic, region-neutral strings converting each physical clue into searchable text. When you are approved to search, you'll run these strings to see if you missed that those clues also pop up in some region that wasn't on your radar. 4 . Choose a Tentative Leader Name the current best guess and one alternative you’re willing to test equally hard. State why the leader edges others. Explicitly spell the disproof criteria (“If I see X, this guess dies”). Look for what should be there and isn't, too: if this is X region, I expect to see Y: is there Y? If not why not? At this point, confirm with the user that you're ready to start the search step, where you look for images to prove or disprove this. You HAVE NOT LOOKED AT ANY IMAGES YET. Do not claim you have. Once the user gives you the go-ahead, check Redfin and Zillow if applicable, state park images, vacation pics, etcetera (compare AND contrast). You can't access Google Maps or satellite imagery due to anti-bot protocols. Do not assert you've looked at any image you have not actually looked at in depth with your OCR abilities. Search region-neutral phrases and see whether the results include any regions you hadn't given full consideration. 5 . Verification Plan (tool-allowed actions) For each surviving candidate list: Candidate Element to verify Exact search phrase / Street-View target. Look at a map. Think about what the map implies. 6 . Lock-in Pin This step is crucial and is where you usually fail. Ask yourself 'wait! did I narrow in prematurely? are there nearby regions with the same cues?' List some possibilities. Actively seek evidence in their favor. You are an LLM, and your first guesses are 'sticky' and excessively convincing to you - be deliberate and intentional here about trying to disprove your initial guess and argue for a neighboring city. Compare these directly to the leading guess - without any favorite in mind. How much of the evidence is compatible with each location? How strong and determinative is the evidence? Then, name the spot - or at least the best guess you have. Provide lat / long or nearest named place. Declare residual uncertainty (km radius). Admit over-confidence bias; widen error bars if all clues are “soft”. Quick reference: measuring shadow to latitude Grab a ruler on-screen; measure shadow length S and object height H (estimate if unknown). Solar elevation θ ≈ arctan(H / S). On date you captured (use cues from the image to guess season), latitude ≈ (90° – θ + solar declination). This should produce a range from the range of possible dates. Keep ± 0.5–1 ° as error; 1° ≈ 111 km.
I mean, that chunk? Tl;dr? It’s possible—likely, even—that there’s a simplified version of this prompt that triggers a similar performance from o3 (that thing looks severely over-engineered). But the exact shape of the prompt matters less than the questions it invites.