Costa del Sol · Private Real Estate
MUSE
The Journal·AI · Field notes
AI · Field notes

Claude vs GPT for Marbella Property Research: A Practical Test

I ran identical prompts through Claude and GPT to see how each handles Costa del Sol property research. Here is what I found about accuracy, fabrication, and practical utility.

By Muse Research16 May 2026 · 7 min
Claude vs GPT for Marbella Property Research: A Practical Test

When we built the AI Concierge and Curator tools on museselection.es, the decision about which underlying model to use was not obvious. Both Claude and GPT have vocal advocates, and most comparisons online focus on creative writing or coding tasks. Property research is a different problem — it involves specific numbers, specific places, specific legal frameworks, and a high cost of being wrong. I wanted to understand how each model performs when the questions are the kind a serious buyer actually asks.

Over several weeks I ran the same set of prompts through both systems. The prompts were drawn from real enquiries: zone comparisons, transaction cost structures, infrastructure timelines, ownership vehicle considerations, and price-per-metre benchmarks. What follows are my observations. I am not a software engineer and this is not a benchmark study. It is closer to a field note.

The Test Design

I used twelve prompts in total, grouped into three categories. The first category was factual and verifiable — things like the approximate transfer tax rate on a resale property in Andalusia, or the legal maximum build ratio in a given zone. The second category was comparative and analytical — questions such as how La Zagaleta and El Madroñal differ as residential propositions, or what distinguishes Nueva Andalucía from Puerto Banús for a buyer who wants walkable amenities without the marina pricing. The third category was predictive or forward-looking — infrastructure questions, market cycle questions, the kind of thing no model should claim to answer with certainty.

I ran each prompt on the same day, with no prior conversation context loaded. I recorded whether the answer was broadly accurate, partially accurate, confidently wrong, or appropriately hedged. I also noted whether the model cited any source, and whether that source, when I checked it, actually existed.

Accuracy on Verifiable Facts

On the first category — verifiable facts — both models performed reasonably well, with important differences at the edges.

GPT answered quickly and with apparent confidence. On the transfer tax question it cited the correct general range for Andalusia's ITP (currently a sliding scale from 7% to 10% depending on declared value), but it presented this without any note that the bands had been revised and that the region has further nuances for certain buyer profiles. The answer was not wrong, but it was presented as more settled than it is.

Claude, on the same prompt, gave a broadly equivalent answer but added a sentence noting that regional rates are subject to legislative change and that the figures it held might not reflect the most recent Junta de Andalucía tables. That hedge is not a weakness — it is the correct epistemic posture for a piece of tax information that genuinely does shift.

On build ratios and urban planning parameters — the kind of detail that matters when a buyer is evaluating whether a plot in Benahavís can support a larger footprint — both models struggled. Neither gave me numbers I would trust without verifying against the relevant PGOU. What differed was that Claude was more explicit about this limitation, while GPT produced figures with a confidence that was not warranted. In property research, confident fabrication is worse than honest uncertainty.

Zone Knowledge and Comparative Reasoning

This is where the comparison became more interesting. The Costa del Sol is not a homogeneous market. The difference between Cascada de Camoján and Sierra Blanca matters to a buyer — different access roads, different covenant structures, different views, different price corridors. The difference between the Golden Mile as a category and the specific stretch closest to Puente Romano matters. Generic answers are not useful answers.

GPT's zone descriptions were fluent and generally recognisable, but they leaned heavily on received characterisations — the kind of shorthand that has circulated in property marketing for years. La Zagaleta described in terms of its security and its golf. Puerto Banús described in terms of the marina and the boulevard. These descriptions are not false, but they are not analytically useful to a buyer who is already familiar with the area and wants a more granular read.

Claude's responses on zone comparisons were more differentiated. When I asked about the distinction between Nueva Andalucía and the Golf Valley micro-zone within it, the answer showed some awareness of topography and the layering of urbanisations in that area. It was not expert knowledge — I would not use it without cross-referencing — but it was more useful as a starting framework than GPT's response on the same prompt.

The most telling test was the El Madroñal question. I asked both models to describe the residential character of El Madroñal relative to La Zagaleta. GPT conflated the two estates in a way that suggested limited specific knowledge — it defaulted to a generic description of high-security gated communities above Benahavís. Claude gave a more accurate structural answer, noting the differences in scale and the different ownership and management history of the two communities, though it stopped short of the kind of detail that would only come from having actually visited.

Citation Quality and Source Fabrication

This is, in my view, the most practically important dimension for anyone using these tools in a professional context.

I asked both models, in the course of the prompts, to cite sources for specific claims — transaction volume data, price index figures, infrastructure project timelines. The results were instructive.

GPT cited sources. Some of them were real publications — the Colegio de Registradores, Tinsa, idealista's market reports. Some of them were plausible-sounding but did not exist as described: a specific report with a title and date that, when I searched for it, could not be located. This is the hallucination problem that is well-documented in language models, and it is a serious one in a research context. A buyer or their lawyer who tries to locate a cited source and cannot find it has lost time and, more importantly, has reason to distrust everything else in the response.

Claude, when asked to cite sources, was more restrained. It more frequently said that it could not provide a specific citation and that the user should verify the figure against primary sources. On several prompts it named the correct institutions to consult — the Registro de la Propiedad, the Catastro, the Junta de Andalucía's planning portal — without pretending to have retrieved current data from them. This is less impressive-looking but more honest, and in property research, honest is the correct posture.

Neither model should be used as a substitute for primary source verification. But one model's approach to its own limitations is more professionally usable than the other's.

Forward-Looking and Predictive Prompts

On the third category — infrastructure timelines, market cycle questions, anything genuinely forward-looking — both models behaved as they should, which is to say they declined to make reliable predictions. GPT's hedges were sometimes buried at the end of a response that had already offered a fairly confident-sounding read. Claude's hedges tended to come earlier and were more structurally integrated into the answer.

I asked about the Marbella bypass project and its likely impact on accessibility to specific zones. Both models acknowledged they lacked current data. GPT offered more context about the historical planning discussion, some of which was accurate and some of which I could not verify. Claude offered less content but was clearer that the specific question required current municipal sources to answer responsibly.

For forward-looking questions, the honest answer is almost always: consult the relevant authority directly, or work with someone who has. That is true whether or not you are using an AI tool.

What This Means in Practice

I am not drawing a conclusion that one model is categorically superior. The gap between them, in most of these tests, was narrower than the online discourse suggests. Both are capable of producing useful framing for a complex question. Both are capable of producing errors that could mislead someone who does not have independent knowledge to check against.

The practical implication, for the kind of research we do at Muse Selection, is that these tools work best as a first pass — a way of structuring a question, surfacing the dimensions that need to be examined, and identifying what to verify through human expertise and primary sources. Our working catalogue of roughly 670 deduplicated residences across the active market, alongside a further 300 off-market properties shown by introduction, represents a kind of ground truth that no language model trained on general web data can replicate. The model does not know that a specific villa in Cascada de Camoján has a covenant that restricts further development. It does not know that a particular seller in Sierra Blanca is operating under a timeline. That knowledge lives with people, not in training data.

What I noticed across all twelve prompts is that the model which was more honest about what it did not know was also more useful in practice. Confidence without accuracy is a specific liability in property transactions, where the numbers are large and the legal consequences of error are real. The instinct to produce a complete-sounding answer — which GPT exhibited more consistently — is not always the right instinct when the subject matter is as specific and consequential as a €1.5 million property decision on the Costa del Sol.

Both tools will continue to improve. The more interesting question, for now, is not which model is better but how to build research processes that use them appropriately — as capable but fallible assistants, not as authorities.

Marbella22:52
London21:52
Geneva22:52
Moscow23:52
Dubai00:52
Hong Kong04:52
WhatsApp MaxTelegram