This is the second in a series on querying Greek texts with
XQuery. Before you read this, you should consider reading Querying
Greek Texts in XML: Part 1, which
introduces simple queries on base texts. In this post, we introduce
morphologies and syntax trees. In the next post, we will show some of
the queries that can be done using treebank markup.
A note on XQuery and XML
First, a word on XQuery and XML. People in the digital humanities use
various languages for exploring texts, and this tutorial is not
designed to say that XQuery is the only tool to use. The author
believes that learning both XQuery and a scripting language like
Python is generally wise for people working with these texts, but many
scholars will choose other tools. One advantage of XQuery for a
tutorial like this is that it is very high level: If you choose to use
another language rather than XQuery, this tutorial should still be
useful for presenting the concepts. After the XQuery tutorial is
done, there will also be a tutorial in Python.
Regardless of the languages you use for querying or programming, XML is one of the best formats for encoding texts. One XML format, the
Text Encoding Initiative (TEI) is
the most widely used format in digital humanities. XML provides good support for the kind of metadata people use in digital humanities, and almost all programming languages have libraries for working with XML. A future post will deal with formats for various kinds of data.
Now let’s continue the tutorial where we left off last time …
Morphologies and Syntax Trees
In Part 1, we queried the SBL Greek New Testament text. Here is an example of a verse in that format:
In this part of the tutorial, we need morphology and syntax. We will use the MorphGNT morphology. The morphology provides information on individual words, such as voice, tense, mood, person, number, and case, plus the dictionary form. MorphGNT is a high quality morphology, and is widely used.
For this verse, it looks like this:
That’s not XML, and we can get the same information in an XML treebank. The main treebanks for biblical Greek ( Global Bible Initiative, Lowfat, PROIEL) all use morphology done by James Tauber and Ulrik Sandborg-Petersen, the brains behind MorphGNT (so do most Bible software programs).
For this tutorial, a treebank is more convenient. A syntax tree describes the relationships among the words of a sentence. There are different kinds of syntax trees, they describe different kinds of relationships, and they describe these relationships in different ways. Let’s look at three different treebanks:
The Lowfat treebanks define sentences in terms of words (w elements) and word groups (wg elements). The class attributes for a word identify the kind of word that it is, e.g. a noun, a verb, a determiner, or an adjective. The class attributes for a word group identify the kind of word group it is, e.g. a clause, a noun phrase, or a prepositional phrase. The role attributes for either a word or a word group identify the role of the node with respect to the main verb, e.g. subject, object, indirect object, or adverbial. Like the other two treebank formats, the morphology data from MorphGNT has been incorporated into the treebank.
The Lowfat treebanks shown above are derived from the Global Bible Initiative treebanks, and use their analysis. The GBI treebanks represent both words and word groups as Node elements, and contain all of the information found in the Lowfat treebanks. In addition, they contain information that describes the parse tree, information that is not found in the Lowfat treebank. The GBI trees are much larger, with more nodes and more attributes per node, and perhaps more difficult to read.
The PROIEL treebank is a dependency treebank. It does not have word groups at all. Instead, it describes the relationships among words using identifiers.
Each of these treebank formats has advantages and disadvantages. In the next few parts of this tutorial, we will explore queries on the Lowfat treebank. A future set of posts will do the same queries using the PROIEL database.