Monthly Archives: August 2011

Getting Fieldworks lexical data into XLingpaper

I’ve written before about using WeSay to collect language data, and Wesay lift files can be fairly easily imported into Fieldworks Language Explorer (FLEx) for analysis. Recently, I’ve been working on getting data from the FLEx lexicon into XLingpaper, to facilitate the writing of reports and papers than can be full of data (which is the way I like them…:-)).
I start with a lexicon (basically just a word list) in flex, that has been parsed for root forms (go through noun class categorization and obligatory morphology with a speaker of the language). Figure out the canonical root syllable profile (e.g., around here, usually CVCV), and look for complimentary distribution and contrast within that type, both (though separately) for nouns and verbs.
I have a script that is putting out regular expressions based on what graphs we expect to use (those in Swahili, plus those we have added to orthographies in the past –since we start with data encoded in the Swahili/Lingala orthography, this covers most of the data that we work with). This script puts out expressions like


which means that the whole word/lexeme form (between ^ and $) is just b, u, some consonant (one of m or n, or not, then a consonant letter that appears alone, or the first and second of a digraph), then some vowel (any of ten basic ones, long or short, plus diacritics, or not). In other notation, It is giving buCV, or canonical structures with root initial [bu]. This data is paired with data from other regular expression filters giving [b] before other vowels to show a complete distribution of [b] before all vowels (presumably…).

The script puts out another expression,


which gives me CaCa, as in the following screenshot:

(The 3 refers to the third set of parentheses, (a), so that changing (a) to (i) gives you CiCi.) The data from these filters gives evidence of the independent identity of a vowel, as opposed to vowels created through harmony rules.
So these regular expressions allow filtering of data in the FLEx lexicon to show just the data that you I need to prove a particular point you’re trying to make (in my case, why just these letters should be in the alphabet). But then, how to get the data out of FLEx, and into a document you’re writing?
FLEx has a number of export options out of the box, but none of them seem designed for outputting words with their glosses, based on a particular filter/sort of the lexicon. In particular, I’m looking for export into a format that can be validated against an XLingpaper DTD, since I use XLingpaper XML for most of my writing, both for archivability and longevity of my data, as well as for cross-compatibility in differing environments (there are also developed stylesheets to make XLingpaper docs into html, pdf, and usually word processor docs, too). The basic XML export of the data on the above sort starts like this:

<?xml version=”1.0″ encoding=”utf-8″?>
<LexEntry id=”hvo16380″>
<AStr ws=”gey”>
<Run ws=”gey”>bana</Run>
<LexSense number=”1″ id=”hvo16382″>
<AStr ws=”en”>
<Run ws=”en”>num</Run>
<AStr ws=”en”>
<Run ws=”en”>four (4)</Run>
<AStr ws=”fr”>
<Run ws=”fr”>quatre (4)s</Run>
<AStr ws=”swh”>
<Run ws=”swh”>nne</Run>
<AStr ws=”pt”>
<Run ws=”pt”>quatro (4)</Run>
<AStr ws=”es”>
<Run ws=”es”>cuatro</Run>
<LexEntry id=”hvo11542″>
<LexEntry_HeadWord>… and so on…

But this is way more information than I need (I got most of these glosses for free using the CAWL to elicit the data), and in the wrong form. The cool thing about XML is that you can take structured information and put in in another structure/form, to get the form you need. To do this, I needed to look (again) and xsl, the extensible stylesheet language, which had succesfully intimidated me a number of times already. But with a little time, energy, and despration, I got a working stylesheet. And with some help from Andy Black, I made it simpler and more straightforward, so that it looks like XLPMultipleFormGlosses.xsl looks today. Put it, and an xml file describing it, into /usr/share/fieldworks/Language Explorer/Export Templates, and this is now a new export process from within FLEx. To see the power of this stylesheet, the above data is now exported from FLEx as

<?xml version=”1.0″ encoding=”utf-8″?>
<!DOCTYPE xlingpaper PUBLIC “-//XMLmind//DTD XLingPap//EN” “XLingPap.dtd”>
<xlingpaper version=”2.8.0″>
<section1 id=”DontCopyThisSection”>
<secTitle>Click the [+] on the left, copy the example that appears below, then paste it into your XLingpaper document wherever an example in allowed.</secTitle>
<example num=”examples-hvo16380″>
<listWord letter=”hvo16380″>
<langData lang=”gey”>bana</langData>
<gloss lang=”en”>four (4)</gloss>
<gloss lang=”fr”>quatre (4)s</gloss>
<listWord letter=”hvo11542″>

which contains just enough header/footer to validate against the XLingpaper DTD (so you don’t get errors opening it in XMLmind), and the word forms and just the glosses I want (English and French, but that is easily customizable in the stylesheet). The example node can be copied and pasted into an existing XLingpaper document, which then can eventually be transformed into other formats, like the pdf from this screenshot:

which I think is a pretty cool thing to be able to do, and an advance for the documentation of languages we are working with.

Wesay Wrapper

As happy as I am with WeSay, it is designed with a mindset of one user working on a given computer. It can work on a number of different languages (one at a time, of course!) out of the box, but it isn’t straightforward. You have to navigate to the WeSay folder, then click on the .lift file each time you want to work in a language other than the one WeSay opened last. Since one of the goals of working in BALSA is to hide the (often confusing) directory structure from the (initiate) user, this is unideal for switching between languages in Wesay on BALSA.

Since I’m doing just that (and a fair bit of switching on my own computer between WeSay in different languages), I wrote a script to invoke WeSay with the arguments calling for a given project each time it runs, so we will never need to think about what the last project was. It has a graphical tool (Zenity) to select which project to open, which is populated by the projects actually on that computer.
It assumes a structure of projects named by Ethnologue code, each in a folder with that name, each of which is in the same WeSay folder. It also assumes that each project has the full language name defined in the palaso:languageName tag in the language’s .ldml file (if it doesn’t, it will still work, but the gui will look off, and the next language will be on the same line).

I have the gui in English and French, since that’s what we use here. 🙂

For those interested, here it is:
#set -x
#Note: this assumes a directory structure of WeSay projects named after three letter ISO/Ethnologue codes.

case $HOSTNAME in
Balsa*) wsfolder=/home/balsa/WeSay/;;
*) echo “Is WeSay set up on this computer? if so, update the $0 wrapper.”;exit;;
cd $wsfolder
for xyz in `ls -d ???`
langs=”$langs $xyz”
langs=”$langs `grep –after-context=1 “palaso:languageName” $xyz/WritingSystems/$xyz.ldml|tr -d ‘n’| grep -o “<palaso:languageName.*>” |grep -o ‘”.*”‘|tr -d ‘”‘`”
cd $PWD
lang=`zenity –width=120 –height=400 –list –title “Open/Ouvrir Wesay” –text “Choose language / Choisissez langue:” –column=”ISO 639-3″ –column=”Name/Nom de langue” –multiple $langs`

echo “Ethnologue code entered: $lang”
if [ -f $wsfolder/$lang/$lang.WeSayConfig ]
wesay $wsfolder/$lang/$lang.lift
zenity –error –text “Sorry, please check the code and try again.n Désolé, SVP verifier la code, et éssayer encore.”