CuReD: Deep Learning Optical Character Recognition for Cuneiform Text Editions and Legacy Materials

Shai Gordin, Morris Alper, Avital Romach, Luis Sáenz, Naama Yochai, Roey Lalazar

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Cuneiform documents, the earliest known form of writing, are prolific textual sources of the ancient past. Experts publish editions of these texts in transliteration using specialized typesetting, but most remain inaccessible for computational analysis in traditional printed books or legacy materials. Off-the-shelf OCR systems are insufficient for digitization without adaptation. We present CuReD (Cuneiform Recognition-Documents), a deep learning-based human-in-the-loop OCR pipeline for digitizing scanned transliterations of cuneiform texts. CuReD has a character error rate of 9% on clean data and 11% on representative scans. We digitized a challenging sample of transliterated cuneiform documents, as well as lexical index cards from the University of Pennsylvania Museum, demonstrating the feasibility of our platform for enabling computational analysis and bolstering machine-readable cuneiform text datasets. Our result provide the first human-in-the-loop pipeline and interface for digitizing transliterated cuneiform sources and legacy materials, enabling the enrichment of digital sources of these low-resource languages.

Original languageEnglish
Title of host publicationML4AL 2024 - 1st Workshop on Machine Learning for Ancient Languages, Proceedings of the Workshop
EditorsJohn Pavlopoulos, Thea Sommerschield, Yannis Assael, Shai Gordin, Kyunghyun Cho, Marco Passarotti, Rachele Sprugnoli, Yudong Liu, Bin Li, Adam Anderson
PublisherAssociation for Computational Linguistics (ACL)
Pages130-140
Number of pages11
ISBN (Electronic)9798891761445
StatePublished - 2024
Event1st Workshop on Machine Learning for Ancient Languages, ML4AL 2024 - Hybrid, Bangkok, Thailand
Duration: 15 Aug 2024 → …

Publication series

NameML4AL 2024 - 1st Workshop on Machine Learning for Ancient Languages, Proceedings of the Workshop

Conference

Conference1st Workshop on Machine Learning for Ancient Languages, ML4AL 2024
Country/TerritoryThailand
CityHybrid, Bangkok
Period15/08/24 → …

Fingerprint

Dive into the research topics of 'CuReD: Deep Learning Optical Character Recognition for Cuneiform Text Editions and Legacy Materials'. Together they form a unique fingerprint.

Cite this