Session 3: Find and Seek: building and preserving collections

Date: Wednesday 5th July – 14.30-16.00

Location: CEU 101

Chair: Heli Kautonen, Suomalaisen Kirjallisuuden Seura – Finnish Literature Society, Finland

3.1: Building and Processing Corpora from Archived Web Content

Presenter: Gyula Kalcsó, National Széchényi Library, Hungary

National Széchényi Library started its web archiving project in 2017 and it became a permanent service in 2020. Main harvest types are focusing on domain (.hu domain and related materials), various themes and major political, cultural, sports events. They started an event-based weekly harvest related to the Russian-Ukrainian conflict from 21 February 2022 from 75 news portals in Hungary and neighbouring countries. News collection is primarily based on categories and tags that are being used on the portals (currently 445 seed url: ). The collection is not public, however they have created a SolrWayback-based public search interface (https://ukrajnapublic.webharvest.oszk.hu/solrwayback/). With this service the full-text news is not available due to copyright reasons but there is a full-text search function, metadata and textual context can be also displayed.

Such thematic web research collections can provide an excellent basis and data set for various social science and humanities research. One of these uses is to build text corpora from the web content harvested, which can be analysed using natural language processing tools to produce rich data visualisations. The linguistically analysed texts could be a source for further research: discourse analysis, sentiment analysis and other analyses of interest to humanities and social science research. The Centre for Digital Humanities of the National Széchényi Library has built an experimental corpus of news material on the war in Ukraine to develop a methodology for building a corpus of texts from the web archive material.

The procedure is to extract the HTML from the WARC files using a Python library called WARCIO, then clean up the HTML code by removing the boilerplate with a Python tool called jusText to get a raw text. The resulting plain text is processed (broken down into sentences, tokens, got POS-tagged, be lemmatized, morphology analyzed and clarified) using a natural language processing toolchain, EMTSV, developed for Hungarian. The parsed text is processed in several ways, e.g. datasets are created that can be fed into Power BI software, which can be used to create various interactive reports, graphs and other data visualisations (available online on the Digital Humanities Centre platform).

A similar method can be used to build a text corpus from any thematic collection. Although the texts themselves (the corpus) cannot be made public due to copyright restrictions, the publication of data sets that can be extracted from the texts opens up new possibilities for research in the humanities and social sciences. The long-term goal is to classify the material of the entire web archive thematically and to make the data that can be extracted from the resulting thematic sub-collections available to researchers.

3.2: Research Software Preservation: Libraries’ role and contribution to a key pillar of Open Science

Presenters: Roberto di Cosmo and Sabrina Granger, Software Heritage and Inria, France

The awareness and recognition of software’s role in all scientific communities is growing as a result of the open science movement and as a response to the reproducibility challenges. Alongside publications and data, software is a pillar of the academic ecosystem. The preservation of the source code over the long term is one of the cornerstones of a reliable knowledge base with the potential to increase the transparency, reproducibility, and accountability of scientific research.

Libraries play an important role when it comes to documenting, describing and linking academic resources. Libraries have at their disposal a trustworthy gateway with the universal source code library, Software Heritage (SWH), to link research outputs to software source code.

In this presentation, the speakers will explain the SWH infrastructure. We also illustrate partnerships with different institutions, including academic libraries, scholarly repositories, publishers and the National Library of France to share knowledge and expertise and collaborate towards the archival of the academic software corpus.

SWH is an open, non-profit, and multi-stakeholder initiative that builds the universal archive of all publicly available source code, launched by Inria in partnership with UNESCO in 2015. SWH collects, preserves, and makes accessible the source code and development history of publicly available software over the long term, therefore also fulfilling the core missions of libraries.

The curation and preservation of software require different functionalities and services than the processing of publications or research data. Moreover, SWH’s approach differs from the legal deposit of software and databases, managed by the National French Library (BnF). In the presentation, we will demonstrate methods to build the full collection of code and metadata describing the software in a library of living source code. We’ll also explain how SWH provides practical solutions for researchers helping them publish articles that are “link rot proof”.

Furthermore, librarians and academic support staff require training to understand how to better archive and curate research software in their own collections or how to better reference to outputs that are archived in Software Heritage.

In 2023, the Alexandria source code library provides access to more than 13 billion unique files from forges and platforms around the world. The inclusion of SWH in the academic ecosystem also involves institutional incentives and the use of SWH is now part of the recommendations of the French National Research Agency (ANR) and the International Neuroinformatics Coordinating Facility (INCF). The support provided by the French Ministry of Higher Education and Research and the creation of a network of ambassadors also contributes to promoting SWH to its audiences.

The recognition of the importance of software in research remains a long-term task, which calls for technical as well as political and methodological solutions. The support of researchers as well as the development of descriptive standards for software will play a major role in giving software the place it deserves. From many points of view, to quote C. Borgman, software heritage is definitively the next frontier for libraries.

3.3: National Infrastructures Supporting Discoverability and Approachability of Research-Based Information – A Case from the Finna Services

Presenters: Veera Mujunen and Riitta Peltonen, The National Library of Finland, Finland

This presentation will provide a case example of how the National Library of Finland (NatLibFi), in collaboration with the Finnish library network, use the Finna infrastructure to promote and make research-based information in Finnish OA publications more discoverable and approachable to researchers and students as well as the general public.

Finna is a group of different services developed and maintained in the NatLibFi. The national search site Finna.fi combines all the content coming from over 450 Finnish libraries, archives, museums and publication repositories. Finna is also a platform service for content providers to build their collection search services or web libraries. NatLibFi is responsible for the Finna platform development whereas the other organisations are responsible for providing the content.

The speakers will present how they created a new contract model in order to include repositories from new types of organisations into the Finna index. The new model enabled the inclusion of OA repositories of the Finnish state research institutes. This is an example of how NatLibFi as a node organisation needed to take up new responsibilities to enable progress.

The presentation will elaborate on how these publications are made available in both, the national Finna.fi portal and in the search services of other libraries. Most Finnish library websites are built on top of the Finna platform and one of the key features of Finna-based web libraries can add databases from the Finna index into their own search service. This enables making the publications available where the readers are, whether they are students, scholars or a wider audience.

Speakers will discuss how Finna services and the network of libraries collaborated to make these publications more approachable to the general public. Approachability in this context means that in addition to making the publications available and accessible, they are also presented in places and ways that make it easy for the audience to start exploring them. Marketing is also part of the process. This means taking up new responsibilities for all parties in the collaboration.

When operating in this kind of networked environment of shared responsibilities, one major challenge is how to innovate and push something new to happen. The participating organisations’ collective decisions, contributions and in-depth expertise in their content offering are necessities for the progress, but structures and facilitation are also needed. Through this presentation, examples of initiatives that Finna as a node organisation took and where co-creation methods were essential to push the collaboration into new areas will be outlined.

Finna services are taking steps towards something new that can help in bringing research-based information a step or two closer to the general public and make it more approachable for them. In the current time of misinformation, this is more crucial than ever.

“You don’t have to burn books to destroy a culture. Just get people to stop reading them” – Ray Bradbury on culture. This goes for scientific information too.