Unlocking history: AI project turns Ottoman archives into modern Turkish

A multidisciplinary team of computer scientists, historians, and linguists streamlines the translation of Ottoman manuscripts into modern Turkish by using artificial intelligence.

Ottoman Turkish was a language written using a Turkish form of the Arabic script between the 13th and 20th centuries, containing a great deal of Arabic and Persian expressions. / Photo: AA Archive
AA Archive

Ottoman Turkish was a language written using a Turkish form of the Arabic script between the 13th and 20th centuries, containing a great deal of Arabic and Persian expressions. / Photo: AA Archive

State archives, libraries and private collections hold millions of documents from the Ottoman period, including books, journals, newspapers, notebooks, records and other material written in Ottoman Turkish — leaving a centuries-old historical heritage waiting to be uncovered.

Instead of investing a significant amount of time in learning Ottoman Turkish from scratch, now there is a new initiative to address this need, known as “Artificial Intelligence-Assisted Ottoman-Turkish End-to-End Translation."

Osmanlica.com, an initiative started as a doctoral thesis project by Dr Ishak Dolek under the supervision of Associate Professor Dr Atakan Kurt from Istanbul University-Cerrahpasa Computer Engineering Department, has achieved 96 percent success in the Ottoman Optical Character Recognition (OCR) process, which can be considered as the first step in the transfer of Ottoman sources into Modern Turkish language.

"We possess a vast archive involving approximately a hundred million pages from the Ottoman era. However, the challenge lies in the fact that people cannot read and comprehend these archives due to their language being different from modern Turkish,” Atakan Kurt tells TRT World.

“This stands as one of the foremost challenges confronting our people," he says.

Others

In Osmanlica.com, Ottoman documents are converted to Modern Turkish in three steps.

Language revolution

Ottoman Turkish was a language written using a Turkish form of the Arabic script between the 13th and 20th centuries, containing a great deal of Arabic and Persian expressions.

In 1928, five years after the Republic of Türkiye was founded, the country experienced a language revolution. It rapidly shifted from using the Arabic alphabet to adopting an early version of Turkish, written with the Roman alphabet, which is still in use today. Additionally, there was a substantial removal of foreign elements from the language during this period.

Kurt says that what the European Union has done for their historical manuscripts, written since the Middle Ages, is to use these computer programmes to translate them into editable text.

“Because in Europe there is no big difference between the languages of the Middle Ages and the languages of today, they just convert these printed and manuscript texts — old newspapers, books, letters, manuscripts — from image files to editable texts, and share them,” he noted.

Others

Ottoman OCR converts image to editable text.

Three-step solution

When it comes to Ottoman Turkish, Kurt says that they faced two additional problems.

“Firstly, the alphabet in our texts is different from the one we use today. Secondly, the language is also different. Even if we translate the letters, people do not understand the language used about one or two centuries ago. Even the language used fifty years ago is almost incomprehensible nowadays.”

“In other words, the language used at that time is like a foreign language now. That is why we also have to translate the language of the documents into modern Turkish.”

In Osmanlica.com, Ottoman documents are converted to Modern Turkish in three steps. Firstly, Ottoman OCR (Optical Character Recognition), ie, converting image to editable text; secondly, Ottoman-Turkish alphabet transliteration; and thirdly, translation of Ottoman Turkish into Modern Turkish.

Each one of these three steps are technically complex problems demanding heavy resources in Natural Language Processing (NLP, ie, a computer’s ability to use and understand spoken and/or written language akin to a human) and Deep Learning (an AI method that teaches computers to process data in a way similar to the human brain).

In order to achieve this, Atakan Kurt and his partner Ishak Dolek have established a company, called Mina Arge, and developed the OCR project as the first step.

After the successful completion of the OCR project, the company is currently developing the second stage, the Ottoman-Turkish alphabet transliteration, with the support of KOSGEB, the Small and Medium Enterprises Development Organisation, and TUBITAK, the scientific and technological research council of Turkiye.

Others

Ottoman-Turkish alphabet transliteration.

Interdisciplinary study

The company, which has already achieved 75 percent accuracy in alphabet translation, continues its research and development activities with a group of computer scientists, language and linguists and historians, in order to achieve an accuracy rate of 95 percent in this application.

“To conduct these studies effectively, it requires more than just one PhD student; you need two distinct groups collaborating. One group comprises computer scientists, while the other consists of experts in history and language. This constitutes an interdisciplinary study,” Kurt noted.

Adile Ozgunay, one of the historians employed as an expert on the project, said she has been working on Ottoman Turkish for about 11 years. “I had the chance to closely observe how much effort and time the field needs. For the past two years, we've been pouring our efforts and faith into this project.”

Ozgunay said that “graduate and PhD students engaged with the Ottoman Archives spend a considerable amount of time in translation and transliteration as part of their academic study. This project will allow the researchers to devote more of their time to their research and much less to translation".

Others

The company, which has already achieved 75 percent accuracy in alphabet translation, continues its research and development activities with a group of computer scientists, linguists and historians.

“The most significant project of the century”

Kurt mentioned, "I estimate that there are over a hundred million Ottoman archives located abroad. Even institutions like the University of Toronto in Canada possess at least a thousand books written in Ottoman Turkish. Additionally, numerous Ottoman documents can be found in the Balkans, the Middle East, and even some countries in Africa."

Ozgunay also stated that the rapid adaptation of AI to the social sciences offers academics an interdisciplinary field expansion. Scholars have begun integrating other technologies, such as mapping and relationship analysis, into their research, she added.

In addition to benefiting academics, the project will also help people who are unable to read Ottoman Turkish, but wish to read documents such as property deeds, letters from their ancestors, or handwritten notes on the back of a photograph, she noted.

“I believe that once we achieve all stages of this project, it will stand out as the most significant project of the century within the realm of social sciences in Türkiye,” Kurt emphasised.

Route 6