All posts by kentling

New Wesay Tasks

I found out today that the tasks in WeSay are more configurable than I had thought. I add a “Plural” field to any language project I work in, but until today we have been stuck inputting that either in FLEx (i.e., by me) or in “Dictionary Browse & Edit,” which is unideal at best.
Looking at the config file (in a text editor — make a copy first, etc.!), there are a number of <task/> nodes, several of which have the taskName attribute of “AddMissingInfo.” This appears to be a generic task (unlike some of the others), that comes with a number of options, and can be used multiple times.
In addition to the labeling, there is a field it checks for (if that field is populated, the record doesn’t show in the task), fields it shows as editable, and fields that are shown, but not editable. I don’t know what the last two do, since they are always empty nodes (in all the tasks in my config, anyway). I found out (by trying) that if you tell it to look for a field that isn’t there, it will crash (so make your new fields, then the tasks to add them). I think if you tell it to display a field that isn’t there, it simply won’t. Here is my task node for the new “Add Plurals” task from my config file:

<task
taskName=”AddMissingInfo”
visible=”true”>
<label>Plurals</label>
<longLabel>Add Plurals</longLabel>
<description>Add plural forms to entries where they are missing.</description>
<field>Plural</field>
<showFields>Plural</showFields>
<readOnly>definition, EntryLexicalForm, citation</readOnly>
<writingSystemsToMatch />
<writingSystemsWhichAreRequired />
</task>
Here is the new desktop:

and Here is the page for the new task:

Getting Fieldworks lexical data into XLingpaper

I’ve written before about using WeSay to collect language data, and Wesay lift files can be fairly easily imported into Fieldworks Language Explorer (FLEx) for analysis. Recently, I’ve been working on getting data from the FLEx lexicon into XLingpaper, to facilitate the writing of reports and papers than can be full of data (which is the way I like them…:-)).
I start with a lexicon (basically just a word list) in flex, that has been parsed for root forms (go through noun class categorization and obligatory morphology with a speaker of the language). Figure out the canonical root syllable profile (e.g., around here, usually CVCV), and look for complimentary distribution and contrast within that type, both (though separately) for nouns and verbs.
I have a script that is putting out regular expressions based on what graphs we expect to use (those in Swahili, plus those we have added to orthographies in the past –since we start with data encoded in the Swahili/Lingala orthography, this covers most of the data that we work with). This script puts out expressions like

^bu([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])([aiɨuʉeɛoɔʌ]{1,2})([́̀̌̂]{0,1})$

which means that the whole word/lexeme form (between ^ and $) is just b, u, some consonant (one of m or n, or not, then a consonant letter that appears alone, or the first and second of a digraph), then some vowel (any of ten basic ones, long or short, plus diacritics, or not). In other notation, It is giving buCV, or canonical structures with root initial [bu]. This data is paired with data from other regular expression filters giving [b] before other vowels to show a complete distribution of [b] before all vowels (presumably…).

The script puts out another expression,

^([mn]{0,1})([[ptjfvmlryh]|[bdkgcsznw][hpby]{0,1}])(a)([mn]{0,1})([[ptjfvmlryh]|[bdkgcsznw][hpby]{0,1}])3([́̀̌̂]{0,1})$

which gives me CaCa, as in the following screenshot:

(The 3 refers to the third set of parentheses, (a), so that changing (a) to (i) gives you CiCi.) The data from these filters gives evidence of the independent identity of a vowel, as opposed to vowels created through harmony rules.
So these regular expressions allow filtering of data in the FLEx lexicon to show just the data that you I need to prove a particular point you’re trying to make (in my case, why just these letters should be in the alphabet). But then, how to get the data out of FLEx, and into a document you’re writing?
FLEx has a number of export options out of the box, but none of them seem designed for outputting words with their glosses, based on a particular filter/sort of the lexicon. In particular, I’m looking for export into a format that can be validated against an XLingpaper DTD, since I use XLingpaper XML for most of my writing, both for archivability and longevity of my data, as well as for cross-compatibility in differing environments (there are also developed stylesheets to make XLingpaper docs into html, pdf, and usually word processor docs, too). The basic XML export of the data on the above sort starts like this:

<?xml version=”1.0″ encoding=”utf-8″?>
<ExportedDictionary>
<LexEntry id=”hvo16380″>
<LexEntry_HeadWord>
<AStr ws=”gey”>
<Run ws=”gey”>bana</Run>
</AStr>
</LexEntry_HeadWord>
<LexEntry_Senses>
<LexSense number=”1″ id=”hvo16382″>
<MoMorphSynAnalysisLink_MLPartOfSpeech>
<AStr ws=”en”>
<Run ws=”en”>num</Run>
</AStr>
</MoMorphSynAnalysisLink_MLPartOfSpeech>
<LexSense_Definition>
<AStr ws=”en”>
<Run ws=”en”>four (4)</Run>
</AStr>
</LexSense_Definition>
<LexSense_Definition>
<AStr ws=”fr”>
<Run ws=”fr”>quatre (4)s</Run>
</AStr>
</LexSense_Definition>
<LexSense_Definition>
<AStr ws=”swh”>
<Run ws=”swh”>nne</Run>
</AStr>
</LexSense_Definition>
<LexSense_Definition>
<AStr ws=”pt”>
<Run ws=”pt”>quatro (4)</Run>
</AStr>
</LexSense_Definition>
<LexSense_Definition>
<AStr ws=”es”>
<Run ws=”es”>cuatro</Run>
</AStr>
</LexSense_Definition>
</LexSense>
</LexEntry_Senses>
</LexEntry>
<LexEntry id=”hvo11542″>
<LexEntry_HeadWord>… and so on…

But this is way more information than I need (I got most of these glosses for free using the CAWL to elicit the data), and in the wrong form. The cool thing about XML is that you can take structured information and put in in another structure/form, to get the form you need. To do this, I needed to look (again) and xsl, the extensible stylesheet language, which had succesfully intimidated me a number of times already. But with a little time, energy, and despration, I got a working stylesheet. And with some help from Andy Black, I made it simpler and more straightforward, so that it looks like XLPMultipleFormGlosses.xsl looks today. Put it, and an xml file describing it, into /usr/share/fieldworks/Language Explorer/Export Templates, and this is now a new export process from within FLEx. To see the power of this stylesheet, the above data is now exported from FLEx as

<?xml version=”1.0″ encoding=”utf-8″?>
<!DOCTYPE xlingpaper PUBLIC “-//XMLmind//DTD XLingPap//EN” “XLingPap.dtd”>
<xlingpaper version=”2.8.0″>
<lingPaper>
<section1 id=”DontCopyThisSection”>
<secTitle>Click the [+] on the left, copy the example that appears below, then paste it into your XLingpaper document wherever an example in allowed.</secTitle>
<example num=”examples-hvo16380″>
<listWord letter=”hvo16380″>
<langData lang=”gey”>bana</langData>
<gloss lang=”en”>four (4)</gloss>
<gloss lang=”fr”>quatre (4)s</gloss>
</listWord>
<listWord letter=”hvo11542″>

which contains just enough header/footer to validate against the XLingpaper DTD (so you don’t get errors opening it in XMLmind), and the word forms and just the glosses I want (English and French, but that is easily customizable in the stylesheet). The example node can be copied and pasted into an existing XLingpaper document, which then can eventually be transformed into other formats, like the pdf from this screenshot:

which I think is a pretty cool thing to be able to do, and an advance for the documentation of languages we are working with.

Wesay Wrapper

As happy as I am with WeSay, it is designed with a mindset of one user working on a given computer. It can work on a number of different languages (one at a time, of course!) out of the box, but it isn’t straightforward. You have to navigate to the WeSay folder, then click on the .lift file each time you want to work in a language other than the one WeSay opened last. Since one of the goals of working in BALSA is to hide the (often confusing) directory structure from the (initiate) user, this is unideal for switching between languages in Wesay on BALSA.

Since I’m doing just that (and a fair bit of switching on my own computer between WeSay in different languages), I wrote a script to invoke WeSay with the arguments calling for a given project each time it runs, so we will never need to think about what the last project was. It has a graphical tool (Zenity) to select which project to open, which is populated by the projects actually on that computer.
It assumes a structure of projects named by Ethnologue code, each in a folder with that name, each of which is in the same WeSay folder. It also assumes that each project has the full language name defined in the palaso:languageName tag in the language’s .ldml file (if it doesn’t, it will still work, but the gui will look off, and the next language will be on the same line).

I have the gui in English and French, since that’s what we use here. 🙂

For those interested, here it is:
#!/bin/bash
Version=2011.07.28
#set -x
#Note: this assumes a directory structure of WeSay projects named after three letter ISO/Ethnologue codes.

case $HOSTNAME in
Balsa*) wsfolder=/home/balsa/WeSay/;;
*) echo “Is WeSay set up on this computer? if so, update the $0 wrapper.”;exit;;
esac
langs=
PWD=`pwd`
cd $wsfolder
for xyz in `ls -d ???`
do
langs=”$langs $xyz”
langs=”$langs `grep –after-context=1 “palaso:languageName” $xyz/WritingSystems/$xyz.ldml|tr -d ‘n’| grep -o “<palaso:languageName.*>” |grep -o ‘”.*”‘|tr -d ‘”‘`”
done
cd $PWD
lang=`zenity –width=120 –height=400 –list –title “Open/Ouvrir Wesay” –text “Choose language / Choisissez langue:” –column=”ISO 639-3″ –column=”Name/Nom de langue” –multiple $langs`

echo “Ethnologue code entered: $lang”
if [ -f $wsfolder/$lang/$lang.WeSayConfig ]
then
wesay $wsfolder/$lang/$lang.lift
else
zenity –error –text “Sorry, please check the code and try again.n Désolé, SVP verifier la code, et éssayer encore.”
$0
fi
exit

Using computers to Help the Computer Illiterate Develop their Language

I’ve been working with orthography development a bit, and it has been a major challenge getting all the different pieces of the work to fit together well. One of the main divides I see in work is between people who would use computers exclusively, and others would who use them not at all (during the research process at least). I have never felt comfortable in either camp, in part because I am of the computer generation, but also because I have seen a lot that paper and pencil methodology has to offer. Perhaps the most relevant point to our work is that computers are inaccessible to most of our national colleagues. Which means that if I’m doing everything on a computer, I’m doing it by myself. Or I’m teaching people things I started learning in the third grade (they know what the 0/1 symbol is from cell phones, but other than that, I’m usually starting from scratch).
Fortunately for me, there are people working on making linguistic computer work more accessible to the less computer literate, so we can take advantage of computers, without pushing our national colleagues out of the work. One bright shining example is WeSay. Its interface is straightforward and simple. Think of a word and type it in. Give it a short meaning. Next word. Later, you can go back and add longer meanings, other senses, etc. You can work through a wordlist, or use semantic domains — both great ways to help people think of new words to put in their dictionary. I had someone working on it for several days, to bring his wordlist up to over 2,000 words, and we were both quite happy with how it went. I didn’t realize just how happy I should have been until I had the same guy do some other tasks on the same computer (I think it was still just typing, but in openoffice). Things immediately bogged down, and I was constantly needed to fix something.
So here’s my problem: I really like dictionaries, and I think a dictionary is a backbone to any other work done in a language. And it just doesn’t make sense to write a dictionary on pen and paper. But the majority of my time is developing writing systems, which is better done as a community process (i.e., not everyone standing around watching me use a computer). The first thing I do is collect a wordlist, which I eventually make the beginnings of a dictionary. But that dictionary is going to need to be put in a database in a standardized writing system, or it will be a mess. So writing system development and lexicography go best together, but how?
Constance Kutsch Lojenga (among others) has developed a participatory linguistic research methodology, which is good at bringing speakers of a language into the language discovery process from the beginning (c.f., “Participatory Research in Linguistics. ”Notes on Linguistics Vol. 73(2):13-27. Dallas, TX: SIL. 1996.). Language community members see the sound distinctions in their language (at about) the same time I do, and we get to make writing system decisions as a group, given the linguistic facts we have observed together. The basic discovery process puts two words together, and asks, “is the sound in question in these two words the same, or different?” If they are the same, they are put in the same pile, if different, in different piles. There is, of course, lots of background work to be done (like sorting words by syllable profiles, so we’re looking at the same position in the same type of word at a time), but we attempt to make the discovery process itself as attainable as possible, and I have seen it work well.
Which brings me to my dilemma. This methodology as currently conceived uses words written on paper, which are sorted as a group. Which means, that if I use WeSay to collect a wordlist, I (or some other computer savvy person) need to export that wordlist into a format that will print onto paper that can be cut into cards (not hard with mailmerge, and document templates, but it is work), and then after the sorting process, all the information gained (e.g., “fapa” really should be spelled “paba”) needs to be put back into the database — or else it will just remain on those cards. Even if we write up a summary of selected cards in a report, the wordlist/dictionary won’t improve, unless we can get the information off those cards, and back into the computer — which again is not hard, but it is time consuming, and prone to error.
Which got me thinking, would there be a way to simplify the round-trip, and make the same/different decisions in a way that the spelling change information would be immediately returned to the database? As nice as the simple card-sorting interface is, if we could make something (at least nearly) as simple on a computer, I know our colleagues would be up to it. If we could make it simple. Which got me thinking about orthography and WeSay.
Anyway, I’ve written up some thoughts on how this might work practically, which you can find here.

Wesay Orthography Proposal

I think it would be helpful to language communities to have an orthography development/checker tool in WeSay. For why I think that and what I’d want to do with it, see this other post. Here’s what I would like to see:
Config:Verification
Making the task simple for the user will necessarily require other complexities. We would need a config page, that would specify which words we’re looking for. It would basically be a filter for
1. Part of speech (maybe just from a list of noun, verb, and other)
2. Syllable type (e.g., px-CVCV, px-CVC, etc. –I’m not sure how hard this would be)
2b. position in the root: (C1, C2, V1, V2)
3. Consonants [] or Vowels [] (we just look at one or the other at a time)
4. Consonants available (for later selection by user)
5. Vowels available (for later selection by user)
6. Syllable types available (Options to choose from for dropdown in #2. For ortho development, this could contain just one or two canonical types; for full dictionary checking it would require more. I’m not sure how savvy WeSay could be at this point. It might be wise to split this into two groups, one for nouns, and another for verbs –or more.)

Based on the above information, the user is asked which letter to verify, and then would be presented with a “Verification” page where he would decide whether the sound in the position in question on each word is the same (or not) as others that are marked with the same letter (in the same position of the same types of words).
Verification
At the top of the page would be a new word, with LWC glosses and a picture, if available, and a list of words below it (all of which are filtered according to the settings on the config page, so we’re only looking at sounds/letters in one part of the same type of word) If it were simple to bold the letter in question in each word, that would be great, but not necessary to the task. In any case, it would be good to prominently display the letter under investigation, and more (very?) subtly the part of speech, syllable profile and root position. At the top of the page are two buttons: “Yes” and “No,” and at the bottom of the page are “Recheck” and “Verification OK/Next Letter”
The user task is to understand and pronounce the word at the top of the page, and decide if it belongs with the rest. If it does, he clicks “Yes”; otherwise, “No.” Once a button is clicked, another word is presented (If a mistake is made it will be sorted out later). Once the user has gone through all the input words and he is satisfied that the list of words all belong in the same group (if he isn’t sure, “Recheck” would redo the list –with just the “Yes” words), he clicks “Verification OK/Next Letter” and the following is done behind the scenes:
1. the letter (e.g., ‘p’, ‘b’, or ‘f’) in the place in question (e.g., first consonant) is changed to the letter being verified (e.g., ‘p’) for each word on the “Yes” list, if it isn’t already that letter.
2. those words are marked as verified for that letter (Is this possible? We could track progress elsewhere, if need be).
The user is then returned to the dialogue asking which letter to compare (next, and given the option to quit/pause). The same page is presented, but with words that had been taken out of the last group (i.e., “No” was clicked) along with words with the new letter, and the process repeats. If at any time the user terminates the task, it would be nice if the words that had been put in a box they might not belong in (i.e., marked “No” to ‘p,’ but not “Yes” to anything else yet) could be marked as unverified orthography (for that letter?). Then, when this task is started up again, it can start where it was left off.

At any point the user says, “give me the words with letter ‘x’,” he will get words that have been verified, and words that didn’t belong somewhere else (unless there were no “No” words on the last run). Thus, if there is a word that was wrongly put into a group (bad user input), a letter can always be re-verified (One can go back and click “No” on a word that one had just accidentally clicked “Yes” on). In any case, a spelling/letter is not corrected without comparing it to a list of words using the same letter in the same place of the same type of word. (I’m presuming good/new/consistent orthographies here, not ones like English, where a list of correctly spelled words would be more appropriate.)
Occasionally the process would need to be stopped to modify words that had been wrongly indicated for part of speech, root or syllable profile –unless the user doesn’t mind continually hitting “No,” or there is some other way of excluding them from the sort (maybe “Yes,” “No,” and “Wrong Category”?)
Once all the letters have been gone through for a given syllable type/position and part of speech, the settings on the config page could be changed, and the next set of words could be begun. If we put all but the letter to investigate on a hidden config page, then this would be just a one page task. Getting through all the letters in a given position can take awhile, so changes to those config settings (other than the letter to verify) wouldn’t need to be done all that often.

I originally had a “sorting” page before the verification page, since that is the way we do it with cards, but I think we could use a simple binary sort, as is present in the “verification” page, from the beginning. With cards on a table, a given sort might come up with a number of different piles (i.e., if there were m’s, f’s, and ɸ’s mixed up with p’s and b’s, for some reason. Or more likely, if there was under-differentiated vowels, and five vowels become nine.) But I think in either case, a binary sort would do, it would just take a bit longer (a few more sorts) to get each record in it’s proper pile.

Feasibility
I’m not sure how feasible this idea is; it is probably more like a “purple elephant” wish, but it would be nice to have, if possible. Without any real knowledge of the inner workings of WeSay, I imagine that knowledge of CV structure might be the largest hurdle to this idea’s implimentation. Normally when I collect words, I collect a couple forms: sg/pl for nouns, and infinitive/imperative for verbs, or whatever two forms will give me root structure information (a bit of preliminary research may be necessary to know what will work/help). While I can add fields for plural, infinitive and root in WeSay, there is no task to “add forms”, so giving WeSay information on syllable structure would be a more consultant-oriented task, using “Dictionary Browse and Edit”. Unless we had “Add forms: plural” and “Add forms: imperative” and then “Add Root forms”:
Add forms: plural:
word:
plural:
Add forms: imperative:
word:
imperative:
Add Root forms:
(in config:
noun root parsing fields to compare:
[ ] and
[ ]
verb root parsing fields to compare:
[ ] and
[ ])
field 1:
field 2:
root: (this might even be guessable?)

But there are probably other hurdles that I would be completely unaware of. Anyway, this is my idea.

EDIT:
Actually, looking back at our proceedures, we would like an ability to filte on words with V1=V2 (at least), and also C1=C2, if possible.
Then normally we would look at all vowels in V2 position with V1=a (for instance), then V1=i, etc.
This makes the investivation process much more straightforward (paka is more clearly ‘a’ than paki), though perhaps it would impossibly complicate the innards of the task, were they ever doable.

Discourse Workshop Finished

We just finished a workshop oriented towards the study of discourse properties, originally in several Central Soudanic languages of DRC, but also included was one Nilotic language.  I worked with Lese (Ethnologue: les), a language where I studied tone for some six months of this last year.

One of the highlights for the team was finally (I hope) getting the aspect distinctions of their language. That is, their basic verbal contrast is between something like perfective and non-perfective aspects, though all their formal education is in French, so they have been thinking in terms of past, present, and future tenses.  One morpheme they had glossed as ‘future,’ but from a survey of texts, only one of seven examples referred to certain, future action.  The rest were states or actions that would result from a choice that  was not, in fact, taken (either past, present, or future), or else a desired future action, which was never in fact, accomplished. So, not to weird to see aspectual and modal marking mapped onto tense functions, but it was good to see some of interpretationsthe team catching on to the grammatical distinctions in their language, and how it differs from French.