Documents anciens et reconnaissance automatique des écritures manuscrites

Ancient documents and automatic recognition of handwriting

conference on HTR to be held on June 23 and 24, 2022 at the École nationale des chartes, Paris

23-24 Jun 2022 Paris (France)

FR EN

ATTENTION : Thursday, August 21 an operation is planned on the database server
which may cause access issues on Sciencesconf

"Ancient documents and automatic recognition of handwriting", 23 and 24 June 2022, ENC, Paris.

This scientific event will be held on 23 and 24 June 2022 at the École nationale des chartes (65, rue de Richelieu, 75002 Paris), room Delisle. This event will be held in a mixed format with a broadcast on Youtube :

- 23 June : https://www.youtube.com/watch?v=dE1XUXiuitU

- 24 June : https://www.youtube.com/watch?v=YORfV0yIsQg

Today, many projects include an automatic text acquisition step in their production chain or data exploitation. Several transcription platforms and different HTR engines are now available. The integration of this technology into more and more efficient processing chains has led to the automation of tasks that question the place of researchers in the text production process. This new, data-intensive practice makes it urgent to gather, and therefore harmonise corpora for the constitution of training corpora, but also to make them available in order to improve the quality of HTR results.

Thus, the École nationale des chartes, in partnership with the LabEx Hastec and the LAMOP, through CREMMALab project supported by the DIM MAP, organise on 23 and 24 June 2022 two days of conferences combining philological and technical questions on the use of HTR for ancient documents. We will take this opportunity to review HTR through its tools, results and new practices induced by its use in publishing and exploitation projects. We hope that this event will also be an opportunity to bring together a growing international community of researchers to discuss the use of HTR in their scientific projects.

The proposal is based on a desire to treat the theme of this conference from a technical point of view, while linking it to scientific problem of constitution and/or exploitation of corpor, by questioning both the practical aspects of the use of this technology (development of HTR engine, transcription interface, user interface to use and train models, etc.), while raising its methodological issues and its impact on research data.

If you wish to participate in these days participate in person, please register at the following link: https://dahtr.sciencesconf.org/registration

Programm - Day 1

June 23rd	Talks
9:15-9:30	Welcome of the participants
9:30-9:45	Opening speech with the presentation of the CREMMA and CREMMALAB projects Elsa Marguin-Hamon, directrice de la recherche et des relations internationales, École nationale des chartes
9:45-10:15	CremmaLab projects: Transcription guidelines and HTR models for French medieval manuscripts Jean-Baptiste Camps, maître de conférence, École nationale des chartes, CJM Ariane Pinche, post-doctorante, École nationale des chartes, CJM
	Résumé : L’étape d’acquisition du texte est première dans la plupart de nos entreprises de recherche, qu’il s’agisse d’édition de texte, d’études linguistiques, philologiques et historiques, ou de traitement massif de corpus. Pour produire des corpus textuels de qualité, il est crucial de pouvoir partager librement, en en garantissant l’interopérabilité, les données que nous produisons, et, in fine, de proposer à la communauté scientifique des modèles réutilisables. Pour répondre à ces besoins, et plus spécifiquement aux besoins des médiévistes, le projet CREMMALAB propose des réflexions méthodologiques sur les protocoles de transcriptions des corpus afin d’optimiser des modèles d’HTR à travers la rédaction d’un guide de transcription et la mise à disposition de modèles d’HTR. Nous présenterons les premiers résultats de ces travaux à travers le traitement de deux corpus massifs : un corpus de romans de chevalerie et un corpus de textes hagiographiques, pris en diachronie (xiii^e‑xv^e siècle).
10:15-10:45	HTR fine‑tuning for medieval manuscripts models: strategies and evaluation Sergio Torres Aguilar, post-doctorant, École nationale des chartes, CJM Vincent Jolivet, responsable de la mission projets numériques, École nationale des chartes
	Summary: In this presentation we intend to explore different practical questions about HTR modeling in order to determine at what point a model reaches the necessary robustness and a sufficiently broad-level of generalization to serve as a pre-trained base to raise a new specialized model. For this end, we use several HTR ground-truth documents from medieval cartularies and registers ranging from 12t^h to 15^th centuries and we will evaluate two aspects: (1) the creation of robust models by trying to calculate the learning break‑point and the minimum amount of ground truth necessary to achieve good generalization performances from a limited collection of documents and (2) the process of fine‑tuning in the aim to quickly specialize a robust model, used here as a pre-trained base, on a type of source other than those used during training.
10:45-11:15	Break
11:15-11:45	Une cursive du 17^esiècle Élodie Paupe, assistante-doctorante, université de Neuchâtel et chargée de projet pour les AAEB
	Résumé : Le projet « Crimes et châtiments » a pour objectif la numérisation et la transcription des procédures criminelles de l’ancien Évêché de Bâle (1461-1797). Dans le cadre de la phase pilote en cours de réalisation du projet, un modèle HTR est développé sur une série de procès de sorcellerie dont la majorité des documents sont écrits en cursive française par le prévôt Henri Farine, actif entre 1580 et 1618. Après avoir présenté les particularités de cette main et le corpus, un retour d’expérience sera donné autour des deux infrastructures utilisées (Transkribus et eScriptorium) et du recours aux méthodes de binarisation sur des documents manuscrits. Pour conclure, l’efficience du modèle « Farine » sur des documents contemporains d’autres mains sera présentée, ainsi que les pistes de développement poursuivies.
11:45-12:15	Un modèle ouvert pour la reconnaissance automatique des manuscrits du théâtre espagnol du Siècle d’Or Cuéllar Álvaro, PhD Student, University of Kentucky
	Résumé: Le projet ETSO, Estilometría aplicada al Teatro del Siglo de Oro (Cuéllar et Vega García-Luengos 2017-2022) (https://etso.es/), se propose de collecter et d’analyser à travers des techniques stylométriques le plus grand nombre de pièces de théâtre espagnol du Siècle d’Or. Un nombre important de ces textes ne se retrouvent que dans des témoignages manuscrits, pour lesquels il a fallu entreprendre un processus de transcription automatique à l’aide de Transkribus. L’entraînement du modèle « Spanish Golden Age Manuscripts (Spelling Modernization) 1.0 » a nécessité 3 250 116 mots et il est capable de moderniser automatiquement le texte, en obtenant un Character Error Rate (CER) de 10,54 % dans le validation set. Grâce à ce modèle, nous avons pu transcrire quelque 400 manuscrits de pièces du Siècle d’Or. Parmi tous les textes, un a retenu l’attention : La francesa Laura. Cette pièce de théâtre anonyme a été alignée stylométriquement avec l’ensemble du corpus du dramaturge Lope de Vega (1562-1635).
12:15-14:00	Lunch break
14:00-14:30	New Developments in Kraken and eScriptorium Benjamin Kiessling, ingénieur de recherche, PSL Peter Stokes, directeur d’étude, EPHE
	Summary: Recent releases of Kraken (v4) and eScriptorium introduce a number of new features that improve user experience and performance. The presentation will introduce the most important ones such as the new training library, binary datasets, and new layer types for Kraken, and annotation and text search for eScriptorium, as well as integration of both into Biblissima+. We will elaborate how these impact the use of the software in a variety of contexts, such as institutional and individual use, differences in dataset and target corpus size, etc. In addition, we will look briefly at subsystems in development such as a new algorithm for trainable reading order.
14:30-15:00	De Transkribus à eScriptorium : retour(s) d’expérience sur l’usage d’outils d’HTR appliqués à un corpus d’imprimés espagnols du XIX^esiècle Élina Leblanc, post-doctorante, unité d’espagnol, faculté des lettres, université de Genève Pauline Jacsont, collaboratrice scientifique, unité d’espagnol, Faculté des lettres, université de Genève
	Résumé : Dans cette communication, nous présenterons la chaîne éditoriale mise au point pour le projet Démêler le cordel, en vue d’élaborer une bibliothèque numérique dédiée à la collection d’imprimés éphémères espagnols du xix^esiècle de la Bibliothèque universitaire de Genève. Notre chaîne éditoriale a pour particularité d’avoir eu recours à deux outils d’HTR, Transkribus et eScriptorium, dont nous proposerons une analyse en termes d’usages à différentes étapes d’un projet. Dans un premier temps, nous décrirons la collection d’imprimés, en insistant sur ses spécificités et ses enjeux dans un contexte de transcription automatique. Puis, nous reviendrons sur notre expérience avec chacun des outils d’HTR employés, sur les raisons qui nous ont conduites à passer de l’un à l’autre et sur les difficultés rencontrées. Pour conclure, nous présenterons l’exploitation des prédictions HTR sur notre site web, développé avec TEI‑Publisher.
15:00-15:30	Lettres en lumières Florian Fizaine, doctorant, archives départementales de la Côte-d’Or Édouard Bouyé, directeur des archives départementales de la Côte-d’Or
	Résumé : Dans le cadre du projet « Lettres en lumières » mené par les Archives départementales de la Côte-d’Or en partenariat avec le Laboratoire d’étude de l’apprentissage et du développement (LEAD, Université de Bourgogne), nous développons un outil de HTR en utilisant Mask RCNN, un algorithme de segmentation d’instance utilisé notamment dans le médical, pour la segmentation des lignes et les réseaux transformer qui ont largement montré leur efficacité dans la compréhension du langage naturel, pour la transcription. Nous avons commencé ce travail sur les registres des états de bourgogne du xviii^esiècle, ces données d’entraînements sont obtenues grâce à la participation de transcripteurs bénévoles.
15:30-16:00	Break
16:00-16:30	Les archives inquisitoriales (Portugal) sous HTR : le projet TraPrInq (Transcribing the court records of the Portuguese Inquisition, 1536-1821) Hervé Baudry, chercheur au CHAM-Centro de Humanidades (Universidade Nova de Lisboa). Responsable du projet TraPrInq.
	Résumé : Le projet TraPrInq a pour objectif de créer un modèle d’HTR. Une partie des archives inquisitoriales portugaises (Arquivo Nacional da Torre do Tombo, Tribunal do Santo Ofício, 1536‑1821) est constituée de procès, au nombre de plus de 40 000. Près de la moitié de ce sous-fonds a été numérisée. Le modèle générique en cours d’élaboration sur la plateforme Transkribus par une équipe d’une dizaine de paléographes permettra la transcription à grande échelle des documents. La présente communication établit en premier lieu un état d’avancement des travaux à l’issue des cinq premiers mois d’activité : particularité du corpus, mode de travail, obstacles rencontrés et solutions adoptées, premiers résultats (données d’entraînement). En outre, comme il semble prématuré de dresser un bilan général, elle s’attache à décrire la démarche adoptée, ses évolutions, ainsi qu’à réfléchir sur les aspects techniques et humains des moyens mis en œuvre et des objectifs à atteindre.
16:30-17:00	Segmentation Mode for Archival Documents with Highly Complex Layout Daniel Stökl Ben Ezra, directeur d’étude, EPHE Marina Rustow, professor, Princeton University Devorah Witty, software developper, The Research software compagny
	Summary: Using eScriptorium together with kraken as an infrastructure, we developed a simple but highly efficient procedure for reducing the amount of human labor necessary for creating large amounts of segmentation ground truth for documents with highly complex layouts, i.e., documents comprising regions with lines at eight different angles. Our specific project deals with medieval documents in Hebrew script in Judeo‑Arabic, Aramaic and Hebrew from the Cairo Genizah, including letters, legal documents, lists, notes and accounts. There are about 40,000 documentary texts from the Genizah, of which only about 5,000 have been transcribed. Therefore, our current aim is to create enough data to be able to train a global segmentation model with a very large number of classes, so that it can segment complex layouts in a single step.
17:00-17:30	SegmOnto – A Controlled Vocabulary to Describe Historical Textual Sources Simon Gabay, maître-assistant, université de Genève Ariane Pinche, post-doctorante, École nationale des chartes, CJM
	Summary: Our initiative aims to design a controlled vocabulary for the description of the layout of textual sources: SegmOnto. Following a codicological approach rather than a semantic one, it is designed as a generic typology, coping with a maximised number of cases rather than answering specific needs. Systematise the layout description has a double objective: on the one hand it facilitates the exchange of annotated data and therefore the training of better models for image segmentation (a crucial preliminary step for text recognition), on the other hand, it allows the development of a shared post-processing workflow and pipeline for the transformation of ALTO or PAGE files into DH standard formats such as RDF or TEI.
17:30-17:45	Conclusion of the day

Programm - Day 2


June 24^th	Talks
9:15-9:30	Welcome of the participants
9:30-10:00	FoNDUE - A Lightweight HTR Infrastructure for Geneva Simon Gabay, maître-assistant, université de Genève
	Summary: Recognising text on an image is becoming increasingly important for scholars working with textual sources. Because institutions have to address the needs of their members, the University of Geneva has decided to offer a free of charge and user-friendly solution based on eScriptorium. The specificity of our instance is that it relies only on local infrastructures to minimise its cost and offer additional services, such as training models directly with command lines. Therefore, it promotes a double empowerment: the one of the institution, that does not depend on external private solutions, but also the one of scholars, who gain new digital skills. On top of a theoretical reflexion on this empowerment, we propose a first feedback on how to deploy an efficient HPC-based instance of eScriptorium.
10:00-10:30	From HTR to Critical Edition: A Semi-Automatic Pipeline Daniel Stoekl Ben Ezra, directeur d’étude, EPHE Hayim Lapin, professor, University of Maryland, College Park Bronson Brown-Devost, post-doctoral researcher, Scripta Qumranica Electronica Pawel Jablonski, PhD student, EPHE
	Summary: This paper describes a pipeline for the creation of critical editions of literary texts from manually corrected HTR results of distinct manuscripts as prepared in the Sofer Mahir project. The Sofer Mahir project produces manually corrected transcriptions of 16 large medieval Hebrew codexes of all six main works of Tannaitic Rabbinic literature, redacted in the third or perhaps fourth century CE in Galilee. These works comprise Mishnah (~200k tokens), Tosefta (~300k tokens), Mekhilta deRabbi Yishmael (~80k tokens), Sifra (~120k tokens), Sifre Numbers (~60k tokens) and Sifre Deuteronomy (~60k tokens). Each work is extant in between 3 (Mishnah and Tosefta) to 5 witnesses (all others).
10:30-11:00	Break
11:00-11:30	Analyse, Reconnaissance et Indexation des manuscrits CHAM Anne-Valérie Schweyer, chercheuse CNRS, Centre Asie du Sud-Est (CASE-EHESS-INALCO), Jean-Christophe Burie, professeur des universités, Université de La Rochelle Tien Nam Nguyen, doctorant, Université de La Rochelle
	Résumé : Le cham ancien a été la langue véhiculaire utilisée dans des inscriptions gravées dans tout le centre du Vietnam du vi^e au xvii^esiècle. Le cham ancien a ensuite été remplacé par le cham moyen, la langue d’une riche collection de manuscrits écrits entre les xvii^e et xix^esiècles dans le Centre-Sud du Vietnam et au Cambodge. Afin d’éviter la disparition de ces écritures alpha syllabiques, le projet CHAMDOC, projet pluridisciplinaire, regroupant des chercheurs en SHS et en informatique, vise à concevoir des méthodes et des outils innovants basés sur l’intelligence artificielle pour extraire, reconnaitre, translittérer et indexer les caractères Cham. Nous présenterons les travaux en cours et les premiers résultats.
11:30-12:00	Expérimentations pour l’analyse automatique de sources chinoises anciennes Marie Bizais-Lillig, maître de conférences, université de Strasbourg, Chahan Vidal-Gorène, doctorant, École nationale des Chartes et EPHE
	Résumé : Dans cette présentation, nous nous proposons de rendre compte d’une expérience de transcription automatisée de textes xylographiés de la Chine impériale, à partir d’un très petit jeu de données (50 images). Bien que particulièrement lisibles, ces documents très denses présentent un double défi pour les HTR tant au niveau du sens de lecture du contenu que du très grand nombre de caractères différents à reconnaître, variété impossible à représenter en apprentissage. Le propos questionnera tout d’abord les choix de transcription réalisés et leur impact sur la capacité des modèles à apprendre efficacement en situation de one-shot learning, puis nous aborderons la question du sens de lecture du résultat produit et des différentes approches mises en place avec et sans apprentissage machine.
12:00-14:00	Lunch break
14:00-14:30	Sharing HTR datasets with standardized metadata: the HTR‑United initiative Alix Chagué, doctorante, EPHE, Université de Montréal, Inria Thibault Clérice, responsable pédagogique du master TNAH, École nationale des chartes, CJM
	Summary: Since some scholars adopted Ocropy in the mid-2010s, production of HTR or OCR ground truth has seen an impressive and steady growth. However, few projects share their gold dataset, and when they do, they are scattered across many different hosting options (Github, zenodo, gitlab, institutional repository, etc.) making them very hard to find. For reuse, when they are "discovered", their description is often lacking crucial details. The HTR-United initiative is an answer to this problem: with a standardized metadata schema, a curated catalogue and tools focusing on helping them through every step, owners can now easily publish and make their dataset findable.
14:30-15:00	EpiSearch. Recognising Ancient Inscriptions in Epigraphic Manuscripts Federico Boschetti, researcher; Institute for Computational Linguistics “A. Zampolli” – CNR, Pisa / VeDPH, Ca’ Foscari University of Venice Tatiana Tommasi, MA student; Ca’ Foscari University of Venice
	Summary: The project focuses on epigraphic codices as a proof of concept for putting digital tools at the test, thus defining new ways for the integration of large epigraphic collections. As a sample, we use the epigraphic manuscript composed by the learned ecclesiastical antiquarian Giovanni Antonio Astori (Venice, 1672-1743) and preserved in the Marciana National Library in Venice: Marc. lat. XIV, 200 (4336). In the first part of our talk, we analyse the life of the author and the characteristics of his manuscript. In the second part, we focus on the following tasks: a) evaluating the accuracy of eScriptorium on epigraphic manuscripts with training sets of different size, in order to estimate the best trade-off between the human effort to prepare the training sets and the human effort to correct the results; b) mapping legacy manual transcriptions on the manuscript facsimile; c) improving the layout analysis for epigraphic manuscripts.
15:00-15:30	HTR of Handwritten Paleographic Greek Text as a Function of Chronology Platanou Paraskevi, postgraduate student, Athens University of Economics and Business
	Summary: Today classicists are provided with a large number of digital tools which, in turn, offer possibilities for further study and new research goals. In this paper, we explore the idea that old Greek handwriting can be machine-readable and consequently, researchers can study the target material fast and efficiently. The overall aim of this paper is to assess HTR for old Greek manuscripts. To address this statement, we study and use images of the Oxford University Bodleian Library Greek manuscripts. By manually transcribing images, we have created and present here a new dataset for Handwritten Paleographic Greek Text Recognition. The dataset instances have been organized by establishing as a leading factor the century to which the manuscript and hence the image belongs. In this way, the HTR performance can reveal century-specific challenges when it comes to Handwritten Paleographic Greek Text Recognition.
15:30-16:00	Break
16:00-16:30	Reconnaissance et extraction d’informations dans des tableaux manuscrits historiques : vers une compréhension des recensements de Paris de l’entre‑deux guerre Thomas Constum, doctorant, LITIS EA4108, université Rouen Normandie
	Résumé : Le projet POPP, Projet d’Océrisation des Recensements de la Population Parisienne (S. Brée et al, 2022) vise à constituer une vaste base de données à partir des recensements nominatifs de Paris de l’entre‑deux guerres, composés chacun d’environ 100 000 pages simples manuscrites sous forme de tableaux. Nous avons à ce jour traité les recensements de 1926, 1931, et 1936, ce qui représente un total d’environ 9 millions d’individus. Ce corpus est une source d’information primordiale pour les historiens, les démographes, les économistes ou les sociologues. L’objectif de notre communication est de décrire un système complet pour l’extraction d’informations de recensements historiques de la population. POPP est un projet qui a réuni des chercheurs en vision par ordinateur, en reconnaissance de formes et en démographie historique.
16:30-17:00	Retour d’expériences sur l’utilisation comparée de plusieurs de dispositifs de transcription numérique d’archives de fouilles archéologiques Christophe Tufféry, ingénieur de recherche à l’Institut national de recherches archéologiques préventives, doctorant à CY Cergy Paris Université, en partenariat avec l’Institut national du patrimoine.
	Résumé : Dans le cadre d’une thèse de doctorat engagée depuis 2019, nous proposons une étude historiographique et épistémologique des effets du numérique sur l’archéologie et sur les archéologues sur les cinquante dernières années, une période pendant laquelle l’archéologie a vu ses méthodes modifiées par l’introduction progressive de la micro-informatique dès le terrain. Cette recherche s’appuie sur notre expérience comme archéologue depuis la fin des années 1970 et sur notre activité à l’Inrap depuis 2010. Nous avons exploité plusieurs archives de chantiers de fouilles dont celles d’un chantier sur lequel nous avons été fouilleur bénévole entre 1980 et 1988. Nous avons procédé à la numérisation de deux cahiers de fouille. Nous avons ensuite procédé à leur transcription numérique avec trois solutions techniques différentes et complémentaires, dont eScriptorium, qui présentent des avantages et des limites techniques et méthodologiques. Nous avons pu ensuite exploiter les résultats de la transcription avec diverses méthodes et outils numériques.
17:00-17:15	Conclusion of the day

Privacy | Accessibility