------------------------------------------------------------------------------- Wilhem Streitberg: Gotisch-Griechisch-Deutsches Wörterbuch (1910) ------------------------------------------------------------------------------- Preliminary XML encoding of the headwords and German glosses ------------------------------------------------------------------------------- All elements and attributes are described in the RELAX NG schema file. Below is a prose description of the schema, along with a few general remarks. Note that attributes in the http://www.wulfila.be/archive/2006/DB/dictionary namespace (using prefix db:*) are not part of the dictionary text and may be ignored; they contain data from the original database. (1) The schema borrows a lot from TEI. Some elements are equivalent, or may have the same name. Future conversion to TEI (either as the primary file or as a secondary, derived format) should be fairly easy. Contrary to TEI, however, the ad hoc schema is restrictive and specific: it describes the current transcription of Streitberg's dictionary – nothing more, nothing less. (2) The document root has a block with a few notes on the file, followed by 3500+ blocks. Each corresponds exactly to one entry (boldfaced lemma) in Streitberg's dictionary . As far as we know, the list is complete. Entries are uniquely identified by a key that is composed of the page number and position on the page. (3) An has roughly the same structure as a structured entry in a TEI encoded dictionary. It has information on form and grammar, followed by either a element or one or more elements (homographs), optionally followed by related sub-entries or notes. The embedded homographs and related entries have a similar content model, but do not allow further nesting of entry-level elements. (4) Whitespace between top-level children of is not important, but all elements are in text order, which means that the original text can be restored by putting a space after each top-level element. Within top-level elements, whitespace is important (xml:space="preserve"), with the exception of the and elements. (5) The element contains mixed content (text intermingled with a small set of inline elements), optionally followed by a sequence of one or more embedded elements. These may be nested indefinitely. It mirrors the structure of complex entries having a general translation followed by numbered subdivisions. Streitberg has up to four nested levels (labeled A. I. 1. a). This content model cannot be expressed in a DTD or W3C schema, where elements are either fully mixed, or not mixed at all. (6) Inline elements cover quotations, translations, mentioned foreign words, cross-references, manuscript sigla, grammatical labels and unspecified highlighting. There is also a semantically neutral span-element that allows grouping of text and inline elements. Language is always indicated by an xml:lang language identifier; the default for
is Gothic (got). (7) Every text node is emendable: typographical errors can be marked (and optionally corrected) with a element. So far, I found eight obvious errors. More generally, a element is available to signal encoding problems, include comments on the content of the entry, etc. (8) The goal is an editorial view of the data, as described in the TEI Guidelines. It should be trivial to construct the original text from the XML encoding, ideally by simply stripping all tags from the file. In the current encoding, the following items are not expressed as textual content: - Parentheses are implied around the content of and . - Numbered and blocks store their label in an attribute. In both cases, the text can easily be restored by a stylesheet. More problematic are: - Separators between elements, usually one or two m-dash(es); Streitberg did not always use them consistently. They could be entered as text nodes in between sense-blocks, but this complicates the editing process; densely tagged mixed content blocks quickly become unwieldy. Maybe this an example where exact reproduction of the typographic form is arguably less important? To be discussed. - Final devoicing (‘Auslautverhärtung’) is signaled in a boolean attribute, but not yet properly encoded. Streitberg generally puts the devoiced letter in italics and appends the stem consonant in parentheses, e.g. "hlaiFs (b)". He is not always consistent, this has to be tagged manually (all 128 occurrences can be selected with the XPath expression //form[@auslautverhärtung]). (9) Known issues with the current encoding: - In related sub-entries, the grammatical label usually (but not always) comes before the actual form (e.g. “Adv. abraba”). This is not reflected in the current encoding. Simply switching elements is not enough: there are some complications, e.g. “Komp. airis Adv.”. - About 20 to 30 entries have multiple forms or alternative reconstructions. This clashes with the current structure. An obvious solution would be to make the entire entry mixed content, allowing , , etc. as inline elements; on the other hand, having a few fixed fields really simplifies editing and querying. Problematic alternatives are currently stored in or elements, not unlike a TEI container. - The distinction between quotation/mentioned and translation/mentioned is unclear. It's a matter of interpretation, that could be avoided by simply tagging foreign words and phrases according to language. The schema documentation has more information on this. (10) Finally, a few conventions are not (yet) expressed in XML: - Double underscore indicates ellipsis, as in “entweder __ oder”. - Tilde replaces dash whenever it is used as an iconic reference to the headword, as in “~ sik” (lemma ‘skaman’) or “sa ~da” (lemma ‘ufarhimina- kunds’). This corresponds to TEI en . - The language identifier is missing from the headword. Since every is Gothic, the identifier is redundant – at least in this encoding stage. It can easily be supplied later on. Tom De Herdt, 2006-11-30 – last modified 2026-03-25. https://www.wulfila.be/