Machine Learning for Ancient Languages ACL 2024 Workshop 15 August 2024
Florian Kessler will present a paper entitled “Towards Context-aware Normalization of Variant Characters in Classical Chinese Using Parallel Editions and BERT” at the ML4AL@ACL workshop on August 15, 2024 in Bangkok.
For the automatic processing of Classical Chinese texts it is highly desirable to normalize variant characters, i.e. characters with different visual forms that are being used to represent the same morpheme, into a single form. However, there are some variant characters that are used interchangeably by some writers but deliberately employed to distinguish between different meanings by others. Hence, in order to avoid losing information in the normalization processes by conflating meaningful distinctions between variants, an intelligent normalization system that takes context into account is needed. Towards the goal of developing such a system, in this study, we describe how a dataset with usage samples of variant characters can be extracted from a corpus of paired editions of multiple texts. Using the dataset, we conduct two experiments, testing whether models can be trained with contextual word embeddings to predict variant characters. The results of the experiments show that while this is often possible for single texts, most conventions learned do not transfer well between documents.
If you cannot make it to Bangkok, online participation is also possible. You can find the entire program here.