Google AI Overviews and AI Mode: implications for knowledge and library services
Google offers AI Overviews for some queries. It aims to provide a snapshot of key information with links to make exploring the web easier.
AI Mode is an optional search feature that uses AI to provide conversational, detailed answers to complex queries instead of a list of links.
Members of the Current and Emerging Technology in Knowledge and Library Services Community of Practice have previously tested the efficacy of EBSCO’s AI Insights and Natural Language Search. Given the prevalence of Google Search for finding relevant information for evidence searches, it seemed sensible to evaluate Google’s generative AI tools.
Methodology
To explore these issues, 13 volunteer testers evaluated Google AI Overviews and AI Mode using a structured test script. This asked testers to run the same basic query on the 10 Year Health Plan for England, and then a query of their own design for a question they might receive from a healthcare professional and from a patient. They were asked to check the summary against the same queries in trusted sources of information. Finally, testers were asked to evaluate the summary against Google search results for the same query.
Test script returns were analysed by a reviewer as well as Microsoft Copilot. The latter was used to check how good a job Copilot made of analysing the feedback and whether it agreed with the reviewer’s analysis.
For the Copilot analysis, all returns were loaded into Copilot. The prompts used were specific to the question being analysed, for example, ‘Across the 13 documents please summarise the key themes and conclusions from the responses to the section heading entitled 'Checking for reproducibility'. Start from this heading and stop at 'Checking for accuracy’.
These are the main highlights of the analysis. A comparison between the reviewer’s and Copilot’s analyses concludes the post.
Accuracy and completeness
Across policy and clinical topics, AI Overviews and AI Mode were generally effective at identifying headline themes. For example, summaries of the 10 Year Health Plan consistently highlighted the three strategic shifts: hospital to community; analogue to digital and sickness to prevention. However, the same summaries frequently omitted significant detail, nuance, and context.
While the information presented was usually accurate, it was partial. The additional information supplied varied widely in the summaries produced by the tools. Different elements of the 10 Year Health Plan, such as workforce planning, financial frameworks, operating models, and implementation detail, were included and other were not. This creates a risk that users interpret AI summaries as comprehensive when they are not.
Reproducibility
One of the clearest findings was the lack of reproducibility. Minor changes in wording produced different emphases, formats, and source selections. Even when the same question was re-run, summaries varied in tone, level of detail, and supporting references.
This variability undermines the use of AI Overviews in professional contexts. From a KLS perspective, it reinforces the need to capture and document exact queries, outputs, and timestamps if AI-generated content is referenced or adapted.
Source selection and transparency
Testers repeatedly questioned how sources were chosen and used. AI summaries often mixed authoritative UK sources (NHS England, NICE, GOV.UK) with US-based organisations, news articles, private providers, and less authoritative websites. Some cited sources appeared not to contribute substantively to the summary.
This may have been partly due to the nature of the prompting. For example, the basic query about the 10 Year Health Plan, ‘what are the key parts of the NHS 10-year plan?’ referenced data from previous 10-year plans and included law firms and news websites in its listed sources. Adding context such as the ‘2025’, or ‘use only UK authoritative sources’ may have helped eliminate some of the noise and less relevant sources. However, the test script did not ask testers to add context to the query and this aspect is not covered by the analysis and would need to be evaluated separately.
The open-web bias was particularly apparent when compared with subscription resources such as BMJ Best Practice. AI Overviews cannot access paywalled evidence, which leads to systematic gaps in clinical depth and authority—precisely where KLS services add most value.
Clinical queries
For clinician-style queries, AI Overviews and AI Mode produced broadly sensible summaries but lacked the specificity required for safe practice. Compared with BMJ Best Practice, AI outputs were shorter, more generic, and often patient-oriented. Key omissions included details about differential diagnosis, the thresholds for action, contraindications, and escalation criteria.
While hallucinations were rare, even small omissions or inaccuracies were judged clinically significant. Google AI summaries may help frame a topic but cannot replace validated clinical resources.
Patient queries
For patient-facing questions, AI summaries were often clear, accessible, and practically oriented. In some cases, they complemented trusted patient resources by offering plain-English explanations or gave more information than the trusted sources provided.
However, sometimes the summaries missed some of the context of the query. For example, in the query, ‘‘My son has been in contact with someone with meningitis what should I do?’, the summary focused on diagnosis and treatment and did not cover what to do if you encounter someone with meningitis. Sometimes, obvious sources, such as the National Autistic Society, were not referenced by a query about autism.
How do the tools compare to the results returned by Google Search?
This comparison examined whether AI generated summaries reflected the key themes, emphasis, and sources evident in standard ranked search results. While high-level themes were generally consistent, AI generated outputs applied a different relevance filter, generated summaries reflected the key themes, emphasis, and sources evident in standard ranked search results level themes were generally consistent, AI generated outputs applied a different relevance filter.
AI Overviews tended to selectively synthesise information rather than reflect the breadth and weighting of conventional search results, sometimes omitting contextual detail or alternative perspectives. This preference for narrative coherence over comprehensive source representation may be suitable for exploratory or nonspecialist use but presents risks of oversimplification in evidence based or clinical contexts. specialist use but presents risks of oversimplification in evidence based or clinical contexts.
One tester wrote, ‘It’s not clear how “deeply” it reads the sources it cites. Although everything in the summary seems accurate, there are sometimes important pieces of info that would be pertinent if I were reading through the sources myself for a user. For example, there is useful information in the RCN guidance on safe staffing that doesn’t appear in the summary even though the AI mode apparently consulted this document when writing its answer.’
How do the reviewer’s and Copilot’s analyses differ?
The reviewer was able to assess what matters in practice, particularly in the healthcare and knowledge services contexts, distinguish between majority and minority views, avoid over generalising from isolated comments, and use cautious language to reflect uncertainty, risk, and limitation.
Copilot was effective at identifying patterns across large volumes of feedback but less reliable in judging their significance. Copilot frequently elevates single or minority observations into “key themes”, does not consistently indicate representativeness. It sometimes splits closely related points into multiple themes, amplifying their apparent importance. Its use of confident, declarative language can overstate findings relative to the underlying evidence.
Copilot also shows inconsistent control of scope and detail. Some summaries are superficial, while others include excessive or tangential information, occasionally drawing in material from outside the section being analysed. The reasoning behind conclusions is not always transparent.
The contrast demonstrates that AI can support synthesis, but expert human interpretation remains essential for trustworthy analysis.
What this means for KLS practice
The testing suggests that Google AI Overviews and AI Mode are best understood as orientation tools, not authoritative sources. For KLS, this has several implications:
- Treat AI summaries as orientation tools, not answers
- AI Overviews can help you rapidly scope a topic or understand how information may be presented to users, but they should never replace authoritative sources or critical appraisal
- Expect variability and document carefully
- The same query may produce different summaries, sources, or emphasis depending on wording, browser, or timing. If AI-generated content informs your work, record the exact query, date, and output viewed
- Check the sources
- It’s not clear how sources have been selected. In testing a mix of authoritative and non-authoritative sources were used. Make sure you give context to your prompt to ensure the summary is using appropriate sources
- Add the detail AI typically misses
- Be alert to content which may be included in the sources being references but not in the summary
- Support users’ critical appraisal skills
- Many users may not check sources or recognise limitations. Use AI examples in training and enquiries to highlight why source evaluation and context still matter
- Use AI outputs to anticipate user needs
- Reviewing AI summaries can help you understand what users may already have seen before contacting your service. This can help you to correct misconceptions and add value to the query more effectively
- Reinforce the professional role of KLS
- These tools underline - not diminish - the importance of knowledge specialists in validation, synthesis, and contextualisation. Human expertise remains essential for safe, evidence based decision support
Far from replacing knowledge services, these tools make the human in the loop more, not less, important.
Page last reviewed: 19 February 2026
Next review due: 19 February 2028