Workshop Report „Non-Latin Scripts in Multilingual Environments: research data and digital humanities in area studies”

This is the English version of the text published in DhDBlog – Digital Humanities im deutschsprachigem Raum: https://dhd-blog.org/?p=10669.

Report written by Esther Asef and Dr. Cosima Wagner with contributions by Martin Lee, translation by Sean Nowak

What are the requirements and demands with regard to the design of a (national) research-data infrastructure from a humanities point of view? This question is currently being thoroughly discussed – not least at the DHd Alliance’s initiative – and documented in position papers.[i]

In this context, a workshop – organized by the BMBF research project FDM_OAS-Orient[ii] on 3 July 2018 at Freie Universität Berlin, Campus Library, – addressed matters of research data and digital humanities in area studies and, more specifically, NLS in multilingual environments. 27 researchers, IT and data experts, and librarians came together from all over Germany[iii] to discuss challenges and demands relating to creation, processing, analysis, archiving and re-use of NLS research data in general and, in particular, with regard to the development of a national research-data infrastructure (NFDI).

The main objective of the workshop was to survey the different problems in dealing with non-Latin-script data and to combine the pertinent demands that research projects have. The workshop offered the opportunity to share experience concerning software suitability with respect to NLS and to jointly outline possible solutions.

The workshop started with a brief introduction to the policy context of requirements that have to be tackled within research-data management (such as DFG (German Research Foundation) requirements for externally funded projects, or a letter from DFG and RfII (German Council for Scientific Information Infrastructures ) to academic communities demanding that they position themselves on the topic of research-data management). Then participants discussed what precisely falls under the term research data in their respective fields of work and how NLS enter into the picture. The discussion was based on examples of project data provided by the participants. In the afternoon, the “Pro Action Café” method was used, as participants divided into four groups, which gathered around a desk each to delve deeper into one of four topics, namely, 1) infrastructure, 2) digital tools, 3) technical requirements and teaching, and 4) professional training. The groups took three cycles to formulate challenges, demands, and next steps.

The following paragraphs summarize some important discussion findings and results of the joint work:

The general understanding at the workshop was that, in the respective academic communities, the discussion around “research data in the humanities and social sciences” has not yet actually begun or only just begun and that position papers such as the paper of the Association of German Historians are not yet available and, for certain, are not yet being implemented. Nor is there a common understanding of what is meant by the very term “research data” in the various communities.[iv] However, NLS are definitely an overarching factor in the respective requirements for subject-specific data management.

Using project examples, the participants explored the difficulties of managing heterogeneous data (text corpora, archive materials, audiovisual data, metadata, etc.) in general and the use of NLS in particular. The following difficulties were mentioned: As a rule, software, information systems, and information infrastructure allow for the use of NLS to a limited extent or not at all. First, there are limitations to reproducing different scripts, especially when different writing directions are involved (left-to-right; right-to-left; top-to-bottom), and secondly, discovery and retrieval are also problematic because search algorithms are not optimized for non-Latin languages insofar as they lack such functionality as mapping between different character systems, transcriptions, recognition of variants, tokenization. Another problem noted is that of complex or very rare characters in East Asian studies, oriental studies and ancient studies that are not (yet) defined in Unicode. Even in cases where Unicode is used, applications are often restricted to the Basic Multilingual Plane, i. e., the level of Unicode in which most characters of modern languages are encoded. However, the “rare” CJK characters, which are important for research based on historical Asian sources, are contained in the Supplementary Ideographic Plane and can only be represented if that layer is active. To date, tools and systems require time-consuming and expensive extensions for use with non-Latin fonts. However, such adaptations are seldom published and, therefore, cannot be discovered or reused.

Other experience discussed had to do with semantic curating technologies such as pattern recognition, deep learning, OCR and HTR, which in many areas have not yet been developed for NLS or are not as well developed as for Latin script. Print quality, scan quality and mixing of different fonts in one document additionally diminish OCR accuracy. Workshop participants stressed the necessity of raising awareness of the diverse challenges with IT employees involved in the relevant areas so that they can optimize existing tools, infrastructures and search engines, and actively share developments of extensions and optimizations. A stronger networking of digital-humanities projects would facilitate exchanging experience and thus make research processes more efficient. In this context, it was discussed whether and how the FIDs (Specialised Information Services Programme) the heads of the CrossAsia FID and the Middle East, North Africa and Islamic studies FID were present) could serve as hubs of such networking or as central stores for findings and tools for DH projects / NLS.

Participants also addressed the lack of standardization in the application of metadata, the use / development of interfaces, and the definition of exchange formats, which still contributes to the lack of data visibility and reusability. Consequently, participants consider the usefulness of uniform metadata elements for multilingual materials and for linguistic specifics (e.g., transcription system used) in describing content as equal to that of standardized vocabulary and taxonomies. It was pointed out, however, that national standards (in all regions, including the regions where the data originate) should be aligned with international standards (such as ISO and Unicode).

In fact, the area studies study representatives among the participants (researchers in Egyptology, Ancient Near Eastern Archaeology, Japanese Studies, Jewish Studies, Sinology) who frequently collaborate with partners in the relevant regions, using data from those regions, identified the international interoperability of their research (data) as a basic prerequisite for a functional academic discourse with the specialist community. Multilingual authority data and semantic data linking would improve access to and re-use of non-Latin-script research data, and, in the case of research projects with partners in the regions, would only even make such access and re-use possible at all. It was suggested that, in the future, it should be documented in the metadata what software, what packages, what workflows, and what metadata scheme were used so as to make the data understandable and re-usable for others. A relevant guideline could be compiled within the NLS network and disseminated into the respective disciplinary communities for further refinement.

As was reflected by the participants’ short presentations of project data, research data in area studies are very heterogeneous owing to the multidisciplinary approaches (philological, empirical, ethnological, media-studies, historical, etc.). Formats include digital copies (of archive materials, for instance), texts, images, films, audio and video game data, dynamic data and databases. Participants from the archaeological disciplines in particular, emphasized that many of their digitally recorded research objects are the only remaining testimony of cultural assets that have been damaged or destroyed and must be permanently secured as sources (beyond the 10 years indicated in the DFG recommendations for good academic practice[v]). Long-term archiving is therefore viewed as an important factor for qualitative research-data management in the relevant disciplines.

Another question taken up was that of conserving applications of research data: Beyond generating static data, research in area studies, just as research in other disciplines in the humanities and social sciences, produces data applications, such as digital editions, not least thanks to the increase in digital-humanities projects. However, such dynamic data cannot usually be stored in repositories at the researchers’ own universities; therefore, new solutions for the preservation of applied data are needed. Here, reference was made to the statement of principles produced by the DHd Alliance’s (Digital Humanities in the German speaking Area) data centers working group, which lists some existing data centers relevant for publishing dynamic data and outlines recommendations and prospects for the future.[vi]

As concerns research-data repositories, it was observed that the majority of them have not yet been adapted for NLS. It is usually possible to publish data, but important information cannot be represented within the metadata formats in place. The most serious obstacle to finding research data in non-Latin fonts, however, is the lack of multilingual adaptation of search algorithms in common search engines / discovery systems used for repositories. The data thus remain invisible despite publication. This makes the use of such (institutional) repositories unattractive to the researchers.

Finally, teaching and professional training were also discussed as important levers for changing research-data-management practices. It was found that the methodological competence of students in the field of digital tools is still almost non-existent and that research-data management has rarely played a role in graduation and doctoral theses or in other research and teaching activities. Participants also mentioned the necessity of specifically dealing with NLS in the teaching of information literacy and of search strategies.

It was pointed out that it is currently temporarily employed academic staff whose personal commitment is mainly relied on when skills in the field of “digital tools” / DH projects are needed. It is they who, in addition to fulfilling their actual qualification tasks in research projects, familiarize themselves with digital basics such as programming languages and digital methods so that those can also be taught or applied competently. Another factor observed is that there are no “safe spaces” with restrictive access rights in which research data and digital tools could be made available to students for practicing purposes. Participants also called for training programs tailored to academics in various disciplines; training should be institutionalized and expanded continuously as a targeted way of disseminating DH skills both generally and, more specifically, with a focus on DH tools for NLS. Furthermore, researchers expressed their demands concerning information and advice on new ways of publishing research results (enhanced publication, open access, etc.).

The workshop ended with reflections on the extent to which IT skills could be introduced as an integral part of Bachelor’s and Master’s degree curricula in the humanities and social sciences.

Conclusion

Workshop participants agreed that one important next step would be strengthening the networking activities among themselves in order to promote knowledge sharing on the specifics of the management of non-Latin-script data as well as a closer cooperation in developing solutions. They decided that there should be further workshops, and they agreed on the creation of a mailing list, through which experience with software and self-developed code, or subject-specific curation guides could be shared. The list “nicht-lateinische-schriften” has been set up, to which anyone interested in the NLS network can subscribe at this page: https://lists.fu-berlin.de/listinfo/nicht-lateinische-schriften

As another way of making existing solutions visible, participants discussed the creation of a curated website, where information on digital tools and software developments and extensions for working with NLS would be collected.[vii] A central information point for DH tools and applications would not only heighten the visibility of new research opportunities; it would also greatly reduce the time that projects have been taking in their planning phases while searching for suitable software and for non-Latin-script solutions, and it would avoid redundant developments.

Another thing to consider is the creation of guidelines for research-software development that would reflect the specific technical challenges that need to be addressed in the context of NLS. These could be supplemented with generic solutions (e.g., code segments) made available for further use.

Such tool collections and manuals could be developed by the respective disciplinary groups in the NLS network and disseminated by institutions, disciplinary communities and specialist information services – e.g., on the portals provided by the participating FIDs.

Finally, there should be a discussion in the disciplinary communities about accrediting production and publication of software, extensions, metadata schemes, mapping tables, etc., as academic achievement.

Two working groups emerged from the workshop, one dealing with the formulation of a template for disciplinary communities’ position papers and one that has developed a joint project proposal within the framework of the recently announced BMBF funding line for research projects covering development and testing of curation criteria and quality standards for research data as part of digital change in the German academic system.

The next steps will be a further workshop in early summer of 2019 and a topical issue “Digital Humanities/ Forschungsdatenmanagement und nicht-lateinische Schriften” of the open-access journal 027.7 Zeitschrift für Bibliothekskultur.

Workshop led by: Martin Lee
(Co-) Moderators: Esther Asef, Dr. Andreas Gräff, Dr. Cosima Wagner

Contacts for anyone interested in the NLS network:
Freie Universität Berlin
Campusbibliothek
E-mail: fdm@campusbib.fu-berlin.de
Website: https://www.fu-berlin.de/sites/campusbib/bibliothek/Forschungsdatenmanagement
Non-Latin-scripts mailing list: https://lists.fu-berlin.de/listinfo/nicht-lateinische-schriften

Languages/ Scripts covered (as of July 2018):
Akkadian
Ancient Egyptian (all language stages except Coptic)
Hieratic
Abnormal or cursive hieratic
Cursive hieroglyphs
Hieroglyphs
Arabic
Bengali
Chinese (hant / traditional and hans / simplified)
German
English
French
Hattic
Hebrew
Hittite
Hindi
Hurrian
Japanese
Yiddish
Korean
Luwian
Manchu
/chem. special characters
Nepali
Palaic
Persian
Russian
Sanskrit
Turkic

Notes

[i] For an overview of available research-data-management position papers in the humanities, see: https://forschungsinfrastrukturen.de/doku.php/positionspapiere

[ii] Grant reference: 16FDM022. Duration: 1 April 2017 to 30 September 2018. Funding measure “Erforschung des Managements von Forschungsdaten in ihrem Lebenszyklus an Hochschulen und außeruniversitären Forschungseinrichtungen.“ https://www.bmbf.de/foerderungen/bekanntmachung-1233.html

Project page on the Campus Library website: https://www.fu-berlin.de/sites/campusbib/bibliothek/Forschungsdatenmanagement/16fdm022.html

[iii] Participants came from: Berlin, Essen, Erlangen, Frankfurt am Main, Halle, Hamburg, Heidelberg, Leipzig, Mainz, Potsdam, Tübingen, Würzburg.

[iv] For a critical analysis of the term “research data” from a humanities perspective, see: Fabian Cremer, Lisa Klaffki & Timo Steyer (2018): “Der Chimäre auf der Spur: Forschungsdaten in den Geisteswissenschaften.” o-bib. Das offene Bibliotheksjournal / edited by VDB, 5(2), 142-162 https://doi.org/10.5282/o-bib/2018H2S142-162

[v] Deutsche Forschungsgemeinschaft (2013): Sicherung guter wissenschaftlicher Praxis. Available online at http://www.dfg.de/download/pdf/dfg_im_profil/reden_stellungnahmen/download/empfehlung_wiss_praxis_1310.pdf, last reviewed on 2 August 2018.

[vi] See: DHd AG Datenzentren: Geisteswissenschaftliche Datenzentren im deutschsprachigen Raum – Grundsatzpapier zur Sicherung der langfristigen Verfügbarkeit von Forschungsdaten (Version 1.0). Zenodo, 3 February 2018. Link: http://doi.org/10.5281/zenodo.1134760

[vii] An exemplary collection is available on the Freie Universität Berlin Campus Library website: https://www.fu-berlin.de/sites/campusbib/bibliothek/Forschungsdatenmanagement/tools-os/index.html

Freie Universität Berlin

biblioblog

Workshop Report „Non-Latin Scripts in Multilingual Environments: research data and digital humanities in area studies”

Schreibe einen Kommentar Antworten abbrechen