-------------------------------------------------------------------------------
Wilhem Streitberg: Gotisch-Griechisch-Deutsches Wörterbuch (1910)
-------------------------------------------------------------------------------
Preliminary XML encoding of the headwords and German glosses
-------------------------------------------------------------------------------

All elements and attributes are described in the RELAX NG schema file. Below is 
a prose description of the schema, along with a few general remarks. Note that 
attributes in the http://www.wulfila.be/archive/2006/DB/dictionary namespace 
(using prefix db:*) are not part of the dictionary text and may be ignored; 
they contain data from the original database.

(1) The schema borrows a lot from TEI. Some elements are equivalent, or may 
have the same name. Future conversion to TEI (either as the primary file or as 
a secondary, derived format) should be fairly easy. Contrary to TEI, however, 
the ad hoc schema is restrictive and specific: it describes the current 
transcription of Streitberg's dictionary – nothing more, nothing less.

(2) The document root <dictionary> has a <meta> block with a few notes on the 
file, followed by 3500+ <entry> blocks. Each <entry> corresponds exactly to one 
entry (boldfaced lemma) in Streitberg's dictionary . As far as we know, the list 
is complete. Entries are uniquely identified by a key that is composed of the 
page number and position on the page.

(3) An <entry> has roughly the same structure as a structured entry in a TEI 
encoded dictionary. It has information on form and grammar, followed by either 
a <sense> element or one or more <hom> elements (homographs), optionally 
followed by related sub-entries or notes. The embedded homographs and related 
entries have a similar content model, but do not allow further nesting of 
entry-level elements.

(4) Whitespace between top-level children of <entry> is not important, but 
all elements are in text order, which means that the original text can be 
restored by putting a space after each top-level element. Within top-level 
elements, whitespace is important (xml:space="preserve"), with the exception of 
the <hom> and <related> elements.

(5) The <sense> element contains mixed content (text intermingled with a small 
set of inline elements), optionally followed by a sequence of one or more 
embedded <sense> elements. These may be nested indefinitely. It mirrors the 
structure of complex entries having a general translation followed by numbered 
subdivisions. Streitberg has up to four nested levels (labeled A. I. 1. a).
This content model cannot be expressed in a DTD or W3C schema, where elements
are either fully mixed, or not mixed at all.

(6) Inline elements cover quotations, translations, mentioned foreign words, 
cross-references, manuscript sigla, grammatical labels and unspecified 
highlighting. There is also a semantically neutral span-element that allows 
grouping of text and inline elements. Language is always indicated by an 
xml:lang language identifier; the default for <form> is Gothic (got).

(7) Every text node is emendable: typographical errors can be marked (and 
optionally corrected) with a <sic> element. So far, I found eight obvious 
errors. More generally, a <note> element is available to signal encoding 
problems, include comments on the content of the entry, etc. 

(8) The goal is an editorial view of the data, as described in the TEI 
Guidelines. It should be trivial to construct the original text from the XML 
encoding, ideally by simply stripping all tags from the file. In the current 
encoding, the following items are not expressed as textual content:
  - Parentheses are implied around the content of <formInfo> and <grammarInfo>.
  - Numbered <hom> and <sense> blocks store their label in an attribute.
In both cases, the text can easily be restored by a stylesheet.
More problematic are:
  - Separators between <sense> elements, usually one or two m-dash(es); 
Streitberg did not always use them consistently. They could be entered as 
text nodes in between sense-blocks, but this complicates the editing process; 
densely tagged mixed content blocks quickly become unwieldy. Maybe this an
example where exact reproduction of the typographic form is arguably less 
important? To be discussed.
   - Final devoicing (‘Auslautverhärtung’) is signaled in a boolean attribute, 
but not yet properly encoded. Streitberg generally puts the devoiced letter in 
italics and appends the stem consonant in parentheses, e.g. "hlaiFs (b)". He is 
not always consistent, this has to be tagged manually (all 128 occurrences can 
be selected with the XPath expression //form[@auslautverhärtung]).

(9) Known issues with the current encoding:
  - In related sub-entries, the grammatical label usually (but not always) 
comes before the actual form (e.g. “Adv. abraba”). This is not reflected in the 
current encoding. Simply switching elements is not enough: there are some 
complications, e.g. “Komp. airis Adv.”.
  - About 20 to 30 entries have multiple forms or alternative reconstructions. 
This clashes with the current structure. An obvious solution would be to make 
the entire entry mixed content, allowing <form>, <grammar>, etc. as inline 
elements; on the other hand, having a few fixed fields really simplifies 
editing and querying. Problematic alternatives are currently stored in 
<formAlternative> or <grammarAlternative> elements, not unlike a TEI 
<dictScrap> container.
  - The distinction between quotation/mentioned and translation/mentioned is 
unclear. It's a matter of interpretation, that could be avoided by simply 
tagging foreign words and phrases according to language. The schema 
documentation has more information on this.

(10) Finally, a few conventions are not (yet) expressed in XML:
  - Double underscore indicates ellipsis, as in “entweder __ oder”.
  - Tilde replaces dash whenever it is used as an iconic reference to the 
headword, as in “~ sik” (lemma ‘skaman’) or “sa ~da” (lemma ‘ufarhimina-
kunds’). This corresponds to TEI <oRef/> en <oVar/>.
  - The language identifier is missing from the headword. Since every <form> 
is Gothic, the identifier is redundant – at least in this encoding stage.
It can easily be supplied later on. 

Tom De Herdt, 2006-11-30 – last modified 2026-03-25.
https://www.wulfila.be/