Kalevalaic poetry as a digital corpus

by Jukka Saarinen,
the Finnish Literature Society, Helsinki

The Kalevala, the epic compiled by Elias Lönnrot, is based on the ancient Baltic-Finnic poetry in the trochaic tetrameter subsequently known in Finland as Kalevalaic poetry. It is not, however, a single genre in itself since the metre has been widely cultivated in various genres: epic, lyric songs, wedding songs, incantations, riddles, proverbs, and so on.

Kalevalaic poetry played a major role in the establishment of a specifically Finnish culture and the Finnish nation. Lönnrot’s Kalevala was a significant attempt to create a national history and mythology founded on tradition. Some oral poetry in the ancient metre had been collected and published even before the Kalevala, but the volume of material was greatly augmented as the result of the work of Lönnrot and later collectors. The bulk of the poetry has been preserved in the collections of the Finnish Literature Society since around the middle of the 19th century. Ever since the Kalevala was published, and even before, scholars have been proposing the publication of the texts as they were actually noted down from the original informants. The aim initially was to verify the authenticity of the Kalevala or to check whether the collation was correct. From the late 19th century onwards people began to realise that, rather than the Kalevala, the original texts were focal sources for the study of ancient poetry and through it of archaic Finnish-Karelian culture, history and religion. The outcome was the birth of Finnish folkloristics. The observations on the way Kalevalaic poetry changed in travelling from one locality to another were fundamental in the establishment of the geographical-historical or Finnish school. Projects were launched aiming at the scientific publication of the original texts, and by the end of the century two volumes entitled “Kalevalan toisintoja” (“Les variants de Kalevala”) had been published, one edited by Julius Krohn (1888) and the other by A. A. Borenius (1895). The editorial principles were, however, found to be too complex and the project was soon aborted. A new start was made at the very beginning of the 20th century, the aim being to publish a complete collection of all poetry in the Kalevalaic metre. The first volumes, containing Viena Karelian epic and edited by A. R. Niemi, appeared in 1908 and served as a model for the volumes dedicated to other regions published over the next 40 years. The result was an extensive collection of Kalevalaic poetry preserved in archives: Suomen Kansan Vanhat Runot (“The Ancient Poems of the Finnish People”), commonly known for short as SKVR. The series is in 14 regional sections (33 volumes, 1908–1948) with the addition in 1997 of a volume containing texts by four leading collectors not previously published. The complete SKVR runs to over 27,000 pages and more than 86,000 items; including the different versions of the same text by one informant, the number of items totals well over 100,000.

Although the different volumes of SKVR were published over a long period of time, there were many editors and the process was at times laborious and far from smooth,1 the volumes all observed the same underlying editorial principles and look, with only minor variations.

The main editorial principles were as follows:

1. The texts are arranged into subcategories under the main generic headings (epic, lyric poetry, occasional poetry, incantations), and finally specific “poems” (a particular song, incantation, rhyme, etc.). The variants of each specific poem are arranged geographically.

2. Every effort was made to publish every available text in the Kalevalaic metre; texts that were regarded as not “genuine archaic poetry”, i.e. learnt from books, were in some cases excluded.

3. Infinite precision was observed in publishing collectors’ notes. The texts are published as such, preserving all their special features, diacritic characters, additions and corrections. Any additions by the editors are placed in square brackets [ ].

4. Each text is accompanied by background information using a standard format, such as the source locality, collector and performer (the “metadata”, to use the modern terminology).

Thanks to the standard editorial principles and the accuracy of the copying (the texts were, for example, proofread several times, comparing them with the original texts), it has been possible for scholars to use SKVR as a source without having to refer to the original notes. SKVR further assigns each text a number of its own that can be used as such as a source reference.

The Finnish Literature Society began digitalising SKVR in 1998. The conversion of the texts into digital format by scanning and OCR (Optical Character Recognition) was assigned to an Estonian team as Estonia already had experience of such work. The task was to produce a version that was character-by-character identical with SKVR; this is now being processed to form a serviceable text corpus in the Folklore Archives of the Finnish Literature Society. The leader of the project at the Eesti Kirjandusmuuseum in Tartu was Arvo Krikmann and the assignment took two years to complete.

The Folklore Archives also contain a large volume (over 60,000 texts) of Kalevalaic poetry that is not included in SKVR, most of it collected after the regional volumes had already been published. This material is being digitised in Tartu in the course of 2001 and will be appended to the SKVR corpus.

The objective of the original SKVR was to produce source material that was readily accessible to scholars of archaic Finnish poetry. Researchers no longer needed to rely on the archives and to handle vulnerable manuscripts that were difficult to read and sometimes in poor condition. The individual researcher could borrow or buy the whole series at a time and work on it in his or her own study. Similar reasons once again prompted the digitalisation of SKVR, because manual browsing through the entire, vast series is time-consuming and laborious. Nor can researchers today necessarily afford to buy the whole series, and some of the volumes are sold out. Now, researchers can access the entire opus on screen and conduct more efficient searches.

The digital text is no more than a character string. In order to facilitate various search functions, to view texts in a logical format or to exchange data with other computers, the text must be converted into a compatible form. This process is called structurising and the method used to do it markup. Texts can be marked up by, for example, inserting tags to indicate the beginning and end of a given type of data.

In the printed SKVR items are marked by typographical means. The human brain can, by combining visual and semantic elements, distinguish between the metadata, the original manuscript text, or the name of the collector or informant.

The images produced by scanning have been converted into text by OCR and the typographical information has been preserved by saving it as a Word file. The preservation of features is, however, not an end in itself but a means of enabling the computer to add structurising tags to the text. The first stage – structurising and marking up the typographical features of SKVR – is being carried out in Estonia. The structure and markup will then be verified and supplemented in Finland.

The potential of common word processors for marking up a corpus of text is limited and in most cases program-specific. In transferring from one program to another, typographical signs such as type of text, line changes, division into paragraphs, etc., no longer have universal application. Special characters are a major problem. A researcher working on a text over a long period of time must always use a particular program and even version of a program, and information will probably be lost when a text is transferred from one system to another.

Digitised texts must therefore be saved in some standardised format that is not program-specific and must be marked up in some standardised way. The SKVR project therefore uses XML (Extensible Markup Language), a subset of SGML (Standard Generalized Markup Language). XML is a platform-independent way of marking up, saving and transferring data in text form.2

XML is a universal, growing standard. Being platform-independent and well-supported, it is a safe way of storing data for the future. Another reason for choosing XML is its flexibility. The elements, their attributes, the relationships between the elements and the entities can be defined by creating a document type definition (DTD) that can then be used to check that text created by scanning, OCR and mechanical structurising is structurally correct (all the vital elements are present and in the right places and there is nothing “extra” in the text). Many standard document type definitions (DTDs) exist, such as TEI (Text Encoding Initiative) for coding digital texts for purposes of research.3 The use of a standard DTD would enhance the corpus by improving its comparability and transferability to other corpuses and by providing ready software for accessing texts. The uniform principles observed in the editing of SKVR and the accuracy and precision of the work have proved to be fundamental assets in converting the work to digital format. The uniform structure and standard markup methods mean that the initial structurising of the texts can be done by computer, thus appreciably reducing the time spent doing this. In view of this, we have decided that we will initially use a document type definition of our own. It is possible and even probable that we may at some point in the future transfer and convert to some standard DTD, but this will require additional manual marking up and text analysis – no mean undertaking in a corpus of over 100,000 text units.

All the 400 or more special characters appearing in SKVR are being preserved. This will ensure that the digital SKVR can be used as such as source material for research, in preparing publications and in other demanding assignments. The special characters are being coded according to the Unicode system, which so far runs to over 90,000 characters.4

The SKVR volumes comprise 1) source references and an introduction outlining the editing principles, 2) a table of contents that at the same time constitutes the volume’s motif index, and 3) the main body of the volume, the actual texts. The only items apart from the numbered texts are almost all poetic genre, subgenre and individual poem type headings, and they are also listed in the contents. Some volumes also have an index of informants.

The numbered texts are, in the digitised SKVR, in a corpus of their own separate from the rest of the text. The introductions and indexes have also been digitalised, but they are used separately, parallel to the text corpus. Each numbered text in the corpus constitutes a single item that always has the same, recurring structure. Each item is divided hierarchically into elements and has three main constituents: metadata, text and references.

The metadata consists of the text number, topographical data, the name of the collector, the archive reference, the date of collection, and other information such as, in many cases, the name of the informant. The structure is the same from one unit to another. The metadata is formatted by the editors of the publication, either according to the original source or sometimes by inference. Unfortunately the source is not always clearly stated.

Although the metadata supplied by SKVR is for the most part very consistent, there is some variation in, for example, the orthography of people’s names and place names (abbreviations), and the writing of dates (far from the “yyyy-mm-dd” format). One particular obstacle to systematic searching and indexing is posed by the informants’ names, for it is impossible for the computer to distinguish between these and other-information elements. A more systematic, revised version of the metadata may prove necessary for computerised data retrieval.

The elements of the text unit are the lines of poetry, episodes in prose, titles/headings, and editorial comments.

The reference unit contains footnotes: both editorial comments and commentaries and notes on corrections, additions and deletions in the manuscripts.

The following is an example of variant VII4 2168 as printed 5 and as XML text:

<item nro=”ib21680”>
<meta>
<id>2168.</id>
<loc>Rääkkylä.</loc>
<col>Hyvärinen,</col>
<sgn>A. n. 323.</sgn>
<tmp>—02.</tmp>
<inf>
Rasivaara. Loviisa Asikainen, 76 v. Kuullut lapsuudessaan
Kiteen Potoskavaarassa.
</inf>
</meta>
<text>
<l>Päivännäkemättömällä kun paineltiin,
sanottiin:</l>
<v>Puuhun#1 muhkat, muahan mahkat,</v>
<v>kantoin#2 ves’näräpät;</v>
<v>elä immeiseen tähän</v>
<v>ennee nosta muhkii!</v>
</text>
<refs>
#1 kk:ssa pieni alkukirjain.
#2 r. (kantoihin).
</refs>
</item>

Note that in the printed publication the metadata does not include title data. The XML tags are ITEM = text item, META = metadata, ID = SKVR number, LOC = locality data, COL = collector, SGN = archive number, TMP = time (year) of collection, INF = other information (here: informant’s home village, name and age and information on learning of poem), TEXT = text copied from manuscript, L = prose commentary, V = verse, REFS = footnotes.

XML is one way of saving structurised data as a text file, but it does not do anything. XML texts are, with their tags and entities, difficult to read as such. A program is needed to interpret them, and several are available. The latest versions of the common Internet browsers, for example (Microsoft Internet Explorer 5 and later), are capable of showing XML text in a readable format and can, with an additional stylesheet, present text almost exactly as it looks in the original. It is thus possible to browse texts in just the same way as original texts, to copy text and to make simple word searches.

The biggest advantage of digital text is naturally that it permits efficient, comprehensive searches of an entire corpus for which a database is required. XML permits the transfer of data to many different database programs. In the database the elements constitute a search field of their own, and the search can be targeted at a particular element, or the search can be limited so as to find several character strings in the same element (such as a verse). A database accessible to researchers via the Internet is to be established in the course of the present project.

The index to each volume of SKVR is also a thematic index. The thematic classification is approximately the same for all the volumes, but there are some differences of detail, naming practices, etc., and there are many things missing from the tables of contents. The Finnish Literature Society is drawing up a comprehensive, standard thematic index of all the volumes that will, when complete, be of service to the digital SKVR as well.

The structure defined in the DTD consists only of the logical structure; it does not concern the content of texts. The structure has neither elements nor attributes describing themes, motifs, poetic devices or linguistic features. These can, however, be added later: researchers can obtain a copy of the corpus or some section of it to which they can add the structures they need, and they can then use this copy as a basis for analyses. The basic corpus thus remains untouched, permitting as many potential uses as possible.

The digital corpus will be as close a copy of the published SKVR as possible. It should be noted that it does not seek to be a copy of the original sources, the manuscripts in the archives, because the latter texts have a different structure corresponding to a different DTD. The printed SKVR adheres faithfully to the information in the sources, but isolating the texts and text units from their contexts and compiling the metadata have caused some confusion and errors. No attempt has been made to rectify these in the digital corpus: the aim has not been to produce a new edition requiring source criticism of its own.

Source criticism is important whatever the mode of a work: printed or digital. It often necessitates study of the original sources, which is possible in the archives, through microfilm copies or, in the future – hopefully – browsing facsimile copies of the original manuscripts on the net.6 The digital SKVR will greatly facilitate the use of the material and searches. The more comprehensive search functions may raise new ideas for research. Even so, researchers will still have to be thoroughly familiar with their materials and be critical of the sources. Computers are a useful tool, but they cannot solve problems or draw conclusions.

Notes:

1 Hautala, Jouko 1957: Vicissitudes in publishing the Ancient Poetry of the Finnish People. Studia Fennica VII:5. Helsinki.
2 Further details from the website of the developer of the standard, The World Wide Web Consortium, http://www.w3.org/XML
3 See http://etext.virginia.edu/TEI.html
4 See the Unicode Home Page, http://www.unicode.org/
5 Translation:
“2168. [Parish] Rääkkylä. [collector] Hyvärinen, A. [archive number] n. 323. [year] –02.
[Place] Rasivaara. [informant] Loviisa Asikainen, [age] 76 years. Heard this in her childhood at Potoskavaara of Kitee.
When [a tumour] was pressed with something that has not seen the daylight, it was said:
To a tree the lumbs, to the ground the bumps,
to treestubs the water blisters;
do not raise lumbs
to this person any more!
1 in the manuscript a small initial letter. – 2 r. (to treestubs). ”
6 Some examples can be viewed via the MUISTI database, http://www.lib.helsinki.fi/memory/haku_e.html. Search e.g. author=“berg, o”. The Digitised Archive Material in Cultural Studies project of the Academy of Finland is, in the course of 2000–2001, digitising in facsimile format all the collection manuscripts of Elias Lönnrot. This material will in due course be accessible to researchers as a companion to the SKVR corpus.

Kalevalaic poetry as a digital corpus

Subscribe to the FF news!

Pin It on Pinterest