Parallel Session 1 – Artificial Intelligence for/in/and Research Libraries I
Moderator: TBC
Location: Room 1129
1.1) The Blue Marble: A Human-AI Knowledge Synthesis Platform for Environmental Research
Presenter: Lidia Uziel, University of California, Santa Barbara, United States of America
Researchers worldwide are increasingly overwhelmed by the sheer volume of information available when searching for literature, particularly in interdisciplinary areas like Critical Zone (CZ) science. This field combines geoscience, hydrology, ecology, soil science, chemistry, biology, and social sciences to understand the processes shaping the environment and their responses to natural and human-induced changes. While synthesis tools and AI platforms are available, they often fail to provide the depth and cross-disciplinary connections required to address complex environmental challenges. To help overcome this, the University of California, Santa Barbara (UCSB) developed the Blue Marble Platform to manage the immense body of peer-reviewed literature—over 3 million papers published annually, with 500,000 from the U.S. alone—critical to research on pressing issues such as climate change, drought, and wildfires.
The Blue Marble Platform, a collaboration between UCSB researchers and the UCSB Library, is a human-driven, machine-aided Knowledge Platform designed to contextualize scientific literature and connect related ideas across environmental fields. Its dynamic, updatable structure integrates three key components: human-led scientific reviews, science visualization, and machine learning. These elements work together within a strong governance structure to ensure the platform evolves to meet the research community’s needs. The Knowledge Landscape, a visual framework central to the platform, organizes key concepts, hypotheses, and methods. Expert panels oversee its development, while machine learning continuously updates it with newly published research, ensuring it remains current. Unlike existing AI tools such as Connected Papers, the Blue Marble incorporates expert guidance to curate and present research, addressing the gaps that purely AI-driven approaches often leave behind.
The platform’s development includes constructing the Knowledge Landscape, designing an intuitive user interface (UI), training the AI engine to automate updates, and deploying scalable computing infrastructure. Equally important is the creation of a Community-Led Governance Structure that emphasizes broad and equitable participation. This structure will feature an administrative home, a representative governing board, and subcommittees focused on technology, membership, sustainability, and equity. A membership model based on an “ability to pay” ensures institutions of all financial capacities can participate. Any surplus funds will support under-resourced institutions and the platform’s future development.
The Blue Marble Project’s ultimate goal is to provide an advanced, collaborative, and equitable tool for environmental researchers, enabling them to navigate, synthesize, and apply the vast body of scientific literature more effectively. By combining human expertise with AI-powered tools, the platform ensures a contextualized and relevant approach to interdisciplinary research. This presentation will outline the project’s goals, architecture, and governance, alongside strategies for community engagement and the roadmap for the pilot phase. Through these efforts, the Blue Marble will empower researchers to address today’s critical environmental challenges with greater clarity and collaboration.
1.2) Leveraging Pre-Trained Zero-Shot Large Language Models for Topic-Based Corpus Creation
Presenters: Anda Baklāne and Valdis Saulespurēns, National Library of Latvia, Latvia
The paper explores genre- and topic-based corpora creation at the National Library of Latvia (NLL). It includes examples of corpora developed at the library, with a specific case study showcasing the use of GPT-4o to compile a Corpus of Latvian Music Texts from the collection of digitized Latvian periodicals.
Providing access to collections as data has become a standard service for libraries managing extensive digital collections. On-demand corpus creation serves two key purposes: first, it facilitates negotiated access to copyrighted collections; second, it enables pre-processing of data to meet researchers’ needs better. Handling large datasets, such as extensive periodicals spanning many decades, can be technically burdensome. Therefore, pre-selecting data based on specific criteria like genre, period, or topic is often a more practical approach to creating manageable datasets.
While automating selections based on periods is relatively straightforward, topic-based selections pose more significant challenges. Keyword-based searches within full-text materials rarely yield exclusively relevant entries, with false favorable rates often reaching as high as 80%. Eliminating such inaccuracies typically requires extensive manual exploration, editing, and refinement of keywords.
Advancements in language models have shown promising potential in understanding text and retrieving relevant results based on precise instructions; these features can also be leveraged for corpus creation. During preliminary research, the authors identified several current state-of-the-art models that provided acceptable performance for the Latvian language. For the task at hand, better-known and better-supported GPT-4o was selected. Although non-commercial models would be preferable for research purposes, currently available fully open models do not provide satisfactory results for Latvian.
Extracting data for the Corpus of Latvian Music Texts has been one of the NLL’s most challenging corpus creation projects. Initially conceived as a project to compile a corpus of Latvian musical criticism, it has evolved into a critical endeavor that broadens the definition of music-related texts to include various forms of musical journalism, as well as critical and theoretical commentary. The corpus spans a timeline from the earliest music periodicals published in modern type (beginning in the 1920s) to newspapers issued throughout the 2000s. Key challenges include inconsistent orthography, Gothic typefaces, OCR errors, multilingual content, and limited data availability.
To explore the results of the GPT-4o model, a workflow was devised to query the pre-selected corpus of articles featuring keywords related to music but still containing a significant amount of false positives. The workflow was run in batch mode, using APIs. Several types of prompts were used for the exploration, including queries in Latvian and English, instructions based on keywords mimicking the workflow previously used for data extraction for the corpus, and other types of questions. As LLMs are prone to various anomalies and hallucinations, a hybrid approach was utilized to verify results and minimize false positives.
While large language models demonstrate encouraging results for information extraction and processing, several challenges persist. These include establishing proper methodologies to ensure trustworthy and verifiable outcomes and addressing issues related to scalability and the costs associated with implementing these methods for library automation and research purposes.
1.3) The ‘Retro Tool’: AI-Powered Automated Metadata Creation in Practice at KB National Library of the Netherlands
Presenter: Marie Buesink, KB National Library, Netherlands
The KB National Library collects all works published in and about the Netherlands following the establishment of the Repository of Dutch Publications (‘Depot van Nederlandse Publicaties’) in 1974. However, the library has a gargantuan backlog of approximately 260,000 uncatalogued publications due to factors such as COVID-19, the recent switch to a new cataloguing system and the exponential growth of physical and digital publications to be included in the Repository despite the shortage of trained cataloguing personnel. The upcoming move to the new book repository in early 2027 poses a challenging deadline to diminish this backlog. These dire circumstances call for smarter, automated cataloguing solutions that nevertheless uphold the KB’s metadata quality standards.
This is where the so-called ‘Retro Tool’ comes in. This AI-powered cataloguing tool was initially developed as a pilot with a third party to manage the 30,000-40,000 uncatalogued publications in our ‘retro collection,’ comprised of the missing pieces in the Repository. The Retro Tool supports cataloguers in mitigating the lacuna in the Repository with the assistance of OCR (Optical Character Recognition) and an LLM (Large Language Model) on two fronts. Firstly, it streamlines the process of removing doubles by showing weighted matching percentages per data field. Secondly, the tool reduces manual data entry by automatically generating MARC21 title descriptions for the remaining books under human supervision. This, in turn, alleviates work pressure among cataloguers and shortens the time-to-market for end users. The pilot resulted in a fully-functional cataloguing tool which is currently in use by a small number of cataloguers. This presentation will detail the Retro Tool workflow through a live demonstration, accounts of cataloguers’ hands-on experiences with the tool and discussion of challenges surrounding the responsible use of AI during the pilot.
The Retro Tool is evidently only the tip of the iceberg when it comes to effectively employing AI technologies in the library sector. There is ample opportunity to scale up the functionalities of the Retro Tool to accommodate a variety of workflows, some of which are already being explored with promising results. Still, the incorporation of artificial intelligence in such tools, no matter how small, warrants careful consideration in terms of possible ethical implications. For instance, the KB condemns the use of commercial AI on its collections because these models are often trained on illegally harvested data. This underscores the notion that any AI-driven solution employed in the library sector—including the Retro Tool—should ideally build upon ethically responsible models. However, this can be challenging considering the limited availability of (advanced) open source alternatives.
The Retro Tool evidently signals a new era in cataloguing and will continue to play a pivotal role in automated metadata creation at the KB National Library. Notwithstanding the conflict between technical advancement and ethical responsibility that continues to shape these developments, AI could be the key to keep moving forward in the rapidly evolving library sector. As such, libraries should not be afraid to take the first step.