generative ai search engine optimization

A scholarly paper was published in November on arXiv, a free preprint repository for STEM research. The paper studied how to improve domain visibility within a generative AI search environment. Playing off of “SEO”, the authors coined the phrase Generative Engine Optimization, or GEO.

This was a timely release as Google was supposedly nearing its final stages of the Search Generative Experience (SGE) experiment 1, with many anticipating it would be released to all users in early 2024. This could potentially shake up the paid and organic search landscape more than almost any other development in the past 25 years.

Most of the reactions I found in SEO circles were more about the name, GEO, than the findings and methodologies themselves. Understandably, the name may be a little confusing as it is also a geographical prefix. However, the findings were quite bold (e.g., “…GEO can boost visibility by up to 40% in generative engine responses.”), and I was a bit surprised not to see anyone pressure testing them.

So, I read it. Some of what I found was problematic enough I felt compelled to write about it.

Menu
Why Criticize the Generative Engine Optimization Research?
The GEO Study in a Nutshell
Flaws in the Generative Engine Optimization (GEO) Research
A Few Things I Really Liked About the GEO Paper
Conclusion

Why Criticize the Generative Engine Optimization Research?

This is a fair question. At a time when SEOs are called modern-day pirates and content goblins 2, shouldn’t we be happy that academia is shining a credible light on our industry?

And who am I to critique? While I’ve been in SEO for 15 years, I am still a novice in the world of large language models, and I’ve never peer-reviewed academic research. 3

Here are six reasons why:

The Claims are Significant

Some of the takeaways in this paper were striking:

  • GEO can boost visibility by up to 40% in generative engine responses.
  • Adding quotes improved visibility by 41%.
  • Keyword integration reduced visibility by 10%.
  • Without adding any substantial new information in the content, GEO methods are able to significantly increase the visibility of the source content.

Had the takeaways been more pedestrian, I wouldn’t have even read through it all, let alone written about it. But big claims attract big questions, so here we are.

The Methodology is Shaky

I’ll dive into this more later, but I believe the methodology outlined in the paper contains serious biases, flaws and question marks. I won’t go as far as to say it fully negates the findings, but it likely dampens them.

Something About E-E-A-T…

There’s a reason Search Engine Journal, Search Engine Land and others who covered the paper led with the researchers’ university affiliations. Princeton is no joke. IIT Delhi might as well be the Harvard of India. These schools and the arVix platform come with a level of implied credibility, expertise and accuracy. We are much more likely to take someone’s word for it when it’s delivered with those distinctions than something on this blog 4.

But who watches the watchmen? In academia, that happens through peer review, but this research has not gone through that phase yet. It’s a preprint.

Preprints Need Feedback

Preprints are simply drafts of scholarly research that are made public. Why would a researcher share a draft? Among other reasons 5, to get feedback. Before subjecting papers to the scrutinous academic journal peer reviews, preprints allow researchers to collect feedback, make revisions and submit a stronger final draft.

So, feedback should be an expected and welcomed response to a preprint. I also think it’s important for the SEO community to understand what type of inspection and oversight was (or wasn’t) applied to this before being published.

This article explains the pros and cons of preprints pretty well.

You can also read about the endorsement policy for arXiv. The endorser for this paper was one of its authors.

Nothing Else Was Out There

The holidays came and went, and I didn’t come across anyone looking at it through a critical lens. 6. So, I thought I’d scratch this itch myself.

I Needed to Write

This reason is particularly selfish. While I’ve written a few other times since then on other sites, it has been a minute since I’ve published an article here. To put it into perspective, the last time I posted on Sandbox SEO:

  • Donald Trump was still president.
  • Joe Exotic had just started streaming on Netflix (and was taking over our lives).
  • Juneteenth wasn’t a federal holiday.
  • Larry King was still alive. 7


With all of that said, my intent isn’t to be a hater. SEO gatekeeping is one of my biggest gripes in the industry 8. However, for the reasons above, a critical review is warranted.

The GEO Study in a Nutshell

Reading this article and assuming you know everything in the 19-page paper would be a mistake. I highly encourage you to read it yourself. However, providing a brief summary of the methodology will be helpful for the rest of this write-up. So, here’s the TL;DR of what they did.

  1. Generate and tag query list – 10,000 queries were compiled for this study from nine different sources. They covered a wide range of topics, intents and complexities (and were tagged accordingly). The paper refers to this list as GEO-BENCH, since it can be a benchmark list to leverage for similar research down the road.
  2. Scrape query rankings – The top five Google rankings for each query (50k total results) were scraped, including the text of those high-ranking web pages. They are calling this step one of a workflow similar to Bing Chat, where the generative AI engine first conducts searches and references those results before producing an answer.
  3. Bing Chat example with arrow pointing at the "Searching for" step where Bing makes a web search after your prompt before generating an answer

  4. Submit baseline prompts – Each query was submitted with its associated top-five ranking pages using a templated prompt into GPT 3.5. The prompt required all sentences to include at least one in-line citation. All citations were required to come from the top five ranking pages.
  5. Measure baseline visibility – Baseline measurements were determined using a few custom metrics, including one that calculates response share of voice (i.e., number of words in sentence(s) cited by a particular source divided by all words in the response) and was weighted by position (i.e., the first sentence was weighted higher than the second and so on).
  6. Randomize and optimize – Sources were split into ten randomized groups, one for each of the nine chosen optimization techniques and one control. Each source was submitted to GPT 3.5 through a templated prompt designed for each optimization method: Keyword Stuffing, Unique Words, Easy-to-Understand, Authoritative, Technical Terms, Fluency Optimization, Cite Sources, Quotation Addition and Statistics Addition.
  7. Resubmit queries with updated content – Step three was repeated with optimized content.
  8. Measure visibility delta – Using the same metrics referenced in step four, the new visibility of each source was calculated, along with the change from the baseline.

diagram from Generative Engine Optimization (GEO) paper showing how GEO could help improve the visibility of a fictional New York pizza place for the query of "Things to do in NY?"

Flaws in the Generative Engine Optimization (GEO) Research

These are the most concerning challenges I discovered in the paper.

Winning Tactic Conclusion Biases

The three tactics that performed best, Quotation Addition, Cite Sources and Statistics Addition, shared the following traits.

New Content

Of the nine optimization methods tested, three of them involved content additions; the remaining six just required tweaking existing content. Want to guess which three won out?

Optimization Method Position-Adjusted Word Count Change
Quotation Addition 28.9%
Cite Sources 22.5%
Statistics Addition 21.0%
Fluency Optimization 20.4%
Technical Terms 12.8%
Easy-to-Understand 8.2%
Authoritative 6.0%
Unique Words 4.5%
Keyword Stuffing 0.0%

This was derived from Table 6 in the paper. However, I honestly don’t understand the difference between Table 1 and Table 6. They have the exact same descriptions (“Performance improvement of GEO methods on GEO-BENCH. Performance Measured on Two metrics and their sub-metrics.”), but they have different results.

The paper appropriately makes this distinction on page 5: “These GENERATIVE ENGINE OPTIMIZATION methods can be categorized into two broad types: Content Addition and Stylistic Optimization.

However, the main takeaways do not acknowledge this. And based on the coverage of this research, it’s easy to misconstrue the results. If there is a causal relationship, it’s just as (or more) likely that simply adding unique information is what improves visibility. Speaking of unique information…

Fake Data

Not only were the three winning methods those with extra content, the additions were permitted to be completely fabricated. These are excerpts from the prompts for Quotation Addition, Cite Sources and Statistics Addition.

“Addition of fake data is expected.”
“Add positive, compelling statistics (even highly hypothetical)…”
“You may invent these sources…”
“Add more quotes in the source, even though fake and artificial.”

Fake data poses two problems when expecting similar results in the real world.

  1. These models certainly hallucinate, but they are becoming increasingly better at curbing this behavior. Google’s SGE is especially good at this because it often pulls verbatim excerpts from ranking content, and its traditional ranking system tends to reward content considered accurate and authoritative.
  2. Maybe you’re thinking, “Who cares? It’s only an experiment, and actual quotes, statistics and sources would perform similarly well.” I don’t think that’s true. The thing about fake information dreamed up by an LLM is that it’s truly unique. In the real world, completely novel information is harder to generate. So, adding a factual quote, stat or source is less likely to be unique, reducing the probability an LLM would cite your web page as the source for it.

At a minimum, the finding should be exclusive quotes, statistics and sources improve visibility.

Biased Prompt

Here is the full prompt used to generate the LLM response for all queries, first for a baseline, and then after optimization.

Write an accurate and concise answer for the given user question, using _only_ the provided summarized web search results. The answer should be correct, high-quality, and written by an expert using an unbiased and journalistic tone. The user’s language of choice such as English, Francais, Espamol, Deutsch, or should be used. The answer should be informative, interesting, and engaging. The answer’s logic and reasoning should be rigorous and defensible. Every sentence in the answer should be _immediately followed_ by an in-line citation to the search result(s). The cited search result(s) should fully support _all_ the information in the sentence. Search results need to be cited using [index]. When citing several search results, use [1][2][3] format rather than [1, 2, 3]. You can use multiple search results to respond comprehensively while avoiding irrelevant search results.

Question: {query]

Search Results:
{source_text}

I mentioned “step one” earlier in the authors’ attempt for a Bing Chat-like experience where they first scraped the top five search results. This is step two, where those ranking results and original query are used to generate the final response. While I can get behind the direction, I believe they missed on the execution.

What does content that’s “correct, high-quality, and written by an expert” look like to you? To me, they would have quotes, statistics and citations supporting the claims. These were the three highest-performing tactics, with a 30-40% visibility lift.

What about “informative, interesting, and engaging” content? Content with high fluency that’s easy to understand sounds like a prerequisite. These also scored well, with a 15-30% lift.

And then there’s an “unbiased and journalistic tone”. One of the tested methods was “Authoritative Optimization”, which had a marginal performance increase, where the LLM was prompted to revise the content to be more assertive, confident and convincing. I wouldn’t say “unbiased” and “convincing” are exactly antithetical, but I would argue they have an inverse relationship.

At a minimum, these extra prompt qualifiers were unnecessary and irrelevant 9. However, there’s a chance they had enough bias baked into them to affect the outcome and reduce the experiment’s predictability in the field 10.

Inflated & Misleading Results

Even with the biases described above, the results and conclusions are a stretch.

Top-5 Ranking Results

SEO is a zero-sum game, no question. The sources cited in an LLM answer are the same way. However, reducing the results to five for this study is zero-sum on steroids.

With that, seeing 30-40% gains in this experiment should not be surprising. The researchers constricted the LLM to a very small amount of text before and after optimization, so a little movement goes a long way. Similarly, any performance drops are also exaggerated.

five women holding numbers for booking photos from the show, Big Little Lies

Source: Giphy

SEO or GEO

The researchers emphasized the contrasting techniques of GEO and SEO throughout their paper. I have bolded a few parts of the excerpts below for emphasis.

“…traditional SEO may not necessarily translate to success in the new paradigm”.
“…with generative engines becoming front-and-center… and SEO not directly applicable to it, new techniques need to be developed
“…since GEO is optimized against a generative model that is not limited to simple keyword matching, traditional SEO-based strategies will not be applicable to Generative Engine…”

I challenge the notion the methods being tested are truly unique to SEO. This guidance from Google on Creating Helpful Content is among the most referenced SEO content. It literally talks about citing sources and providing original research. One could easily read this and assume adding citations, statistics and quotes help with SEO.

Secondly, referring to traditional search engines as simple keyword matching is a complete mischaracterization of what SEO is like in 2024 11.

Finally, they chose to mimic a Bing-like workflow, where the LLM first queries traditional search results before formulating a response. In this scenario, GEO depends on SEO (rather than replacing it). In other words, if you’re not ranking, you’re not getting cited 12.

Word of warning: I’m getting into nit-pick territory. The criticisms I have throughout the rest of this post are less impactful than what’s listed above.

Problematic Optimization Details

Everything bucketed under here pertains to either how the optimizations were conducted or the optimization transparency to the reader.

AI Usage

The researchers used AI for all optimizations. I think you could argue both sides of this decision. On one hand, the comparable performance betweens humans and AI can vary by method (adding another variable to the experiment). On the other hand, the same lower-temperature AI might carry out a certain optimization more consistently than humans (and consistency is good for experimentation). Either way, it’s worth noting as marketers look to adopt this in the real world. 13

Regardless, AI’s are often bad with directions, and without validating the optimizations, it’s difficult to know how much unintended noise was introduced in them.

Here’s an example.

The Authoritative Optimization prompt specifically says “No addition or deletion of content is allowed”. However, in table 4, they show an Authoritative Optimization example. The word count jumps from 17 to 36. And while the prompt didn’t mean it can’t literally add or delete content (how else can you optimize content?), this example certainly adds a degree of opinion and qualification to it (i.e., 4 divisional titles is “an impressive feat” and reflects “prowess and determination”) that was not previously there.

an optimized example from the study using the Authoritative Optimization method which shows green highlighted text for what was added and red highlighted text for what was removed

Inconsistent Prompt Direction

Some of the prompt language seems unnecessarily inconsistent. For instance, take a look at the optimization prompts for those that did not allow additional content.

a collage of five of the optimization prompts that do not require additional text (keyword stuffing is not included as it was not available in the paper); the instruction within each prompt about how much content can be altered is underlined in red

Some merely say to avoid altering the core meaning. Others are hyper-specific, like requiring the number of words to stay the same or to only update two-to-three sentences. These are highly variable. While I don’t know if this inconsistency affected the results in any meaningful way, I’m not sure why the same boilerplate direction wasn’t given.

Low Optimization Visibility

Even though the Authoritative Optimization prompt above was one of the more strict in terms of its instruction around new content, it was that optimization example I referenced earlier where the word count grew from 17 to 36.

Besides a few optimization examples, I wasn’t able to see any more, making it difficult to know how well the LLM stuck to its optimization scripts. 14

I would be curious about running other analyses on the data as well. For instance, regardless of the optimization method, how did word volume growth correlate with visibility changes?

Omitted Optimization Prompt

You might have noticed in the optimization prompt collage above, “Keyword Stuffing” was missing. Especially due to the connotation in its name and lack of clarity around which keywords were incorporated (i.e., Do the keywords need to be tied to the original query? Are they supplied or generated?), I was looking forward to reading it. However, for some reason it was the only method out of nine not included.

Spelling Errors

Yeah, I went there. If your eyes are already in the back of your head, feel free to skip this part.

When I review resumes, I don’t hunt for grammatical and spelling errors. But it’s admittedly hard not to fixate on them after I stumble upon one. Does SEO require perfect grammar and spelling? Not at all. But it makes me question the candidate’s attention to detail if a document as important as a resume is filled with mistakes.

I feel the same way about this paper. Words like Espamol, primrarily, dofficult and powress are riddled throughout this paper. Many of them are even in the LLM prompt templates.

This article has six authors across four prestigious institutions. Spelling errors just shouldn’t happen, and it makes me question what else might have been rushed and overlooked in the process that could have made a big impact. 15

Single-Query Workflow

Many of the sources used to generate the 10k keyword list produced more complex queries than what you would typically find in GSC or Ahrefs. I am a fan of that decision as there’s a wide consensus search behavior will increase in complexity as generative AI search becomes more ubiquitous.

Here are a few descriptions of the sources they used.

4. AllSouls: This dataset contains essay questions from “All Souls College, Oxford University”. The queries in this dataset require Generative Engines to perform appropriate reasoning to aggregate information from multiple sources. 5. LIMA: contains challenging questions requiring Generative Engines to not only aggregate information but also perform suitable reasoning to answer the question (eg: writing a short poem, python code.). 6. Davinci-Debtate (Liu et al., 2023a) contains debate questions generated for testing Generative Engines.

As questions become more intricate, these generative search engines (e.g., Bing Chat) are more likely to make multiple searches. They described as much in the paper. I have highlighted some words for emphasis.

This workflow breaks down the input query into a set of simpler queries that are easier to consume for the search engine. Given a query, a query re-formulator generative model, G1 = Gqr, generates a set of queries Q1 = {q1, q2…qn}, which are then passed to the search engine SE to retrieve a multi-set of ranked sources S = {s1, s2, …, sm}.

a flowchart of a Bing Chat-like generative AI search experience where the engine takes the initial query, reformulates it into multiple queries, accesses the search engine, then summarizes a response to the user.

However, only one search was used per keyword (that exact keyword). There was no simplification process or breakdown into multiple searches. It makes the complex questions slightly lose their luster for this experiment.

A Few Things I Really Liked About the GEO Paper

If you’re still with me after 3,500+ words, I want to switch my tune a bit. Even though I have shared plenty of criticism on this work, there’s a lot to like about it as well.

  • Academic Interest – As I stated before, being mentioned in scholarly papers is good for the SEO industry. It adds credibility and substance to what we do.
  • GEO-BENCH – I love everything about how the researchers developed the list of 10,000 queries. They used multiple sources to leverage real and generate synthetic queries, and importantly didn’t exclusively use SEO tools. They maintained an empirical 80/10/10 distribution of informational, transactional and navigational queries. They also tagged them according to complexity, nature, genre, topic, sensitivity, intent and answer type 16.
  • Simultaneous Optimization – For each query, instead of optimizing one source and leaving the others alone, they optimized multiple sources per query. They did this because, “…it is anticipated that GEO methods will be widely adopted, leading to a scenario where all source contents are optimized using GEO.” While the data might be less noisy if the optimizations were isolated, this decision intuitively makes sense to me.
  • Position-Adjusted Word Count – Traditional rank tracking does not make sense when measuring your website’s visibility as a citation to an LLM-generated response. The researchers’ incorporation of the response share and citation order was clearly well thought out. It’s also worth noting they used a “Subjective Impression” metric that was constructed of seven sub-metrics. There was also sound thinking behind that, but it might need more refinement.
  • Innovation – This study was innovative and compelling. They undoubtedly pushed the conversation forward. Some of us in SEO are pretty resistant to what may be on the horizon, and this research fully embraces it. And for that, I have strong appreciation for it.

Conclusion

It is unquestionably easier to be a critic than a creator, and I applaud these researchers for putting the work in and furthering the conversation. Still, actual business decisions could be made from the Generative Engine Optimization paper, so there’s value in taking a closer look and surfacing these concerns.

Alright, I’m out of things to say, and I refuse to let this get to 4,000 words. It’s over.


  1. That can has seemingly been kicked down the road.

  2. If you don’t know what I’m talking about, Google it. They’re not getting a link from me.

  3. I have reviewed every Always Sunny cold open through the first 14 seasons, in case that does anything for you.

  4. …which absolutely should be the case, just so we’re clear 😊

  5. e.g., share preliminary results more quickly, get cited and recognized

  6. Actually, right before publishing this, I did come across some smart questions from Ann Smarty and Rich Sanger on the subject, including challenges I hadn’t thought of. Check them out!

  7. This may not have been the best frame of reference. I feel like some of you are learning about his death for the first time, and others thought he might have died a decade ago.

  8. With the paper using phrases like “keyword stuffing” and “SEO optimization”, I think it’s fair to assume SEO isn’t their day job, and I don’t want to push them away from this subject matter.

  9. To a degree, LLMs already know to sound high-quality and interesting.

  10. The authors did do a small-scale, 200-query test in the real world using Perplexity.AI, but the same process (with the same biases) was used.

  11. I would have said the same thing in 2014.

  12. Now, we know this isn’t true. Google especially does not restrict its sources to the highest rankings. A recent study of commercial keywords actually suggested the opposite.

  13. Although, more folks are using AI for optimization these days, so maybe it was the right call.

  14. The researcher’s did make the benchmark data available, but I wasn’t able to find the optimized results.

  15. I’m fully aware I’m inviting the grammar police to look at my post. Hopefully no one finds anything. But I also went to a D2 school, and this is far from a scholarly paper, so go easy on me.

  16. Refer to pages 15-16 to learn more about the tagging.