CuReD: Deep Learning Optical Character Recognition for Cuneiform Text Editions and Legacy Materials

Shai Gordin, Morris Alper, Avital Romach, Luis Sáenz, Naama Yochai, Roey Lalazar

نتاج البحث: فصل من :كتاب / تقرير / مؤتمرمنشور من مؤتمرمراجعة النظراء

ملخص

Cuneiform documents, the earliest known form of writing, are prolific textual sources of the ancient past. Experts publish editions of these texts in transliteration using specialized typesetting, but most remain inaccessible for computational analysis in traditional printed books or legacy materials. Off-the-shelf OCR systems are insufficient for digitization without adaptation. We present CuReD (Cuneiform Recognition-Documents), a deep learning-based human-in-the-loop OCR pipeline for digitizing scanned transliterations of cuneiform texts. CuReD has a character error rate of 9% on clean data and 11% on representative scans. We digitized a challenging sample of transliterated cuneiform documents, as well as lexical index cards from the University of Pennsylvania Museum, demonstrating the feasibility of our platform for enabling computational analysis and bolstering machine-readable cuneiform text datasets. Our result provide the first human-in-the-loop pipeline and interface for digitizing transliterated cuneiform sources and legacy materials, enabling the enrichment of digital sources of these low-resource languages.

اللغة الأصليةالإنجليزيّة
عنوان منشور المضيفML4AL 2024 - 1st Workshop on Machine Learning for Ancient Languages, Proceedings of the Workshop
المحررونJohn Pavlopoulos, Thea Sommerschield, Yannis Assael, Shai Gordin, Kyunghyun Cho, Marco Passarotti, Rachele Sprugnoli, Yudong Liu, Bin Li, Adam Anderson
ناشرAssociation for Computational Linguistics (ACL)
الصفحات130-140
عدد الصفحات11
رقم المعيار الدولي للكتب (الإلكتروني)9798891761445
حالة النشرنُشِر - 2024
الحدث1st Workshop on Machine Learning for Ancient Languages, ML4AL 2024 - Hybrid, Bangkok, تايلند
المدة: ١٥ أغسطس ٢٠٢٤ → …

سلسلة المنشورات

الاسمML4AL 2024 - 1st Workshop on Machine Learning for Ancient Languages, Proceedings of the Workshop

!!Conference

!!Conference1st Workshop on Machine Learning for Ancient Languages, ML4AL 2024
الدولة/الإقليمتايلند
المدينةHybrid, Bangkok
المدة١٥/٠٨/٢٤ → …

بصمة

أدرس بدقة موضوعات البحث “CuReD: Deep Learning Optical Character Recognition for Cuneiform Text Editions and Legacy Materials'. فهما يشكلان معًا بصمة فريدة.

قم بذكر هذا