Creating Custom Fields in WeSay 0.9.28.0 for Fieldworks 7.0.5~beta5

I’ve been working with custom fields in FLEx and WeSay enough to feel the need to figure out what is really going on. The goal is to be able to straightforwardly create custom fields in one or the the other that are editable and round-trip-able in the other. To do this, I’m going to look into the interface of each program, and see what impact adding fields has on the LIFT (and config, for WeSay) file. Today I’m making a field in WeSay, and seeing what it looks like there, and then in FLEx.

The WeSay Configuration Tool

The WeSay config tool looks like this (once you click on ‘Fields’ then ‘New Field’):

Once you save and exit, you get a section under the <fields> node in the WeSayConfig file that looks like this:

<field>
<className>LexEntry</className>
<dataType>MultiText</dataType>
<displayName>*newField</displayName>
<enabled>True</enabled>
<fieldName>newField</fieldName>
<multiParagraph>False</multiParagraph>
<spellCheckingEnabled>False</spellCheckingEnabled>
<multiplicity>ZeroOr1</multiplicity>
<optionsListFile></optionsListFile>
<visibility>Visible</visibility>
<writingSystems>
<id>en</id>
<id>fr</id>
<id>hav</id>
</writingSystems>
</field>

Adding Data in WeSay

Returning to WeSay, one can add some bogus info to this field in one of the records:

Closing out WeSay and looking at the LIFT file, we see the following under this entry (between <lexical-unit> and the first <sense>):

<field type=”newField”>
<form lang=”fr”>
<text>BogusNewfield</text>
</form>
</field>

What this Means

Putting this all together, we see that

  1. The ‘Name in file’ from the WeSay Config Tool corresponds to the field/fieldName node in the WeSayConfig file.
  2. Both of the above correspond to the LIFT entry/field ‘type’ attribute (once data is entered):
    ‘Name in file’ = (xyz.WeSayConfig)/configuration/components/viewTemplate/fields/field/fieldName = (xyz.lift)/lift/entry/field/@type
  3. ‘Name for display’ from the WeSay Config Tool is the label the WeSay user sees on the field, which corresponds to the contents of the field/displayName node, i.e., (.WeSayConfig)/configuration/components/viewTemplate/fields/field/displayName
  4. Therefore, the name a WeSay user sees for a field will not necessarily relate to anything in FLEx. This is because the WeSay label is related to the proper LIFT field in the WeSayConfig file (which FLEx doesn’t see), and not in the LIFT file, which is what FLEx imports. So in setting up custom fields, we need to pay attention to what the config tool says for the ‘Name in file’, not the ‘Name for display’ (Note that it is ‘*newField,’ and not ‘newField,’ in the WeSay user interface. The asterisk, which is visible in WeSay, is only present in displayName in the WeSayConfig, not in either of fieldName from the WeSayConfig or field/@type from the LIFT file.)

Importing to FLEx

I was happy to see that the field created in WeSay shows up under FLEx custom fields (after importing the WeSay LIFT file):

Note that Location, Type, and Writing System(s) are all grayed out. There may be some way of modifying these settings in FLEx once they have been set in WeSay, but isn’t obvious at first glance. Here is the field in the lexicon editor:

I had to select ‘Show Hidden Fields’ to be able to see it the first time for some reason. But then I deselected it, and the field remained visible.
Note that the label in FLEx is ‘newField,’ without the asterisk, which comes from the type attribute of the field in the LIFT file. As far as I can see, there is no Distinction between file and display names in FLEx. This is appropriate for at least the following two reasons:

  1. FLEx seems to deal fine with spaces in field names (I’ve had problems with this in WeSay).
  2. FLEx users should be able to handle whatever complexity the field names throw at them. WeSay, on the other hand, needs to control carefully what the user sees, and it’s relationship to the LIFT field in question. For instance, the form in lexical-unit in a lift file is displayed as “Word” by default in WeSay, since people are putting words into it. But when I analyze those words into roots, it is nice to be able to change that field’s display name to “Root” in WeSay, without having to change the underlying LIFT structure. This flexibility of the display name can help keep the WeSay user from getting confused without unnecessarily complicating the database.

Notes for Creating fields in WeSay to be imported to FLEx

  1. Pay attention to ‘Name in file’ in the WeSay Config Tool, since that will be what the field will be called in the LIFT file, and in FLEx (and presumably in other programs that would use LIFT).
  2. You may need to click on ‘Show Hidden Fields’ to see the field in FLEx.
  3. There doesn’t seem to be a way to put fields anywhere than in the ‘Custom Fields’ section of FLEx, so I hope that’s where you want it (if not, stay tuned for the next installment, going the other way).

Wesay instructions

When I started working with WeSay in BALSA, it became clear that the people I’m working with were still going to need either a lot of hand-holding, or some instructions. So I wrote these down, and have massaged them a little (not everything was as clear as it could have been), and hope I have something highly (but not completely) fool-proof. But your mileage may vary. I’m submitting them in case someone finds them useful.

Instructions.pdf
Instructions_fr-FR.pdf

Proposal revisited

I’ve written before about using WeSay to collect language data, and at one point I even wrote up a proposal for features I think it would be nice to have there –specifically targeted at orthography development, but the same tools could be used for spell checking (lining up similar profile words, that is, not coming up with a list of correctly spelled words). Anyway, it doesn’t look like that proposal is going anywhere, so I thought I’d give it another try.

The Basic Problem

  1. I’m working in an increasingly large number of languages (five people representing three languages were in my office most of this morning), and I’m looking to see that trend continue (looking to the 60 unwritten languages in eastern DRC).
  2. I am one person, and can only work on one language at a time (however often the language may change throughout the day).
  3. The best tool we have for sorting and manipulating large amounts of lexical data (FLEx) is great at what it does, but is inaccessible to most of the people I work with (it was made for linguists, after all. :-))
  4. So I am left with doing all the work with each guy in FLEx (see #2), or else finding a tool that can be more easily used by the people in #1.

Getting from Data collection to an Orthography

For a number of tasks (word collection), WeSay does exactly what I want. Since most of these people are touching a computer for the first time, let’s get rid of as much of the complexity and room for error as possible. Do one thing at a time, in a constrained environment. But once I’ve collected a wordlist in WeSay, I still need to

  1. Parse the roots out of the forms collected (I can collect plural forms in WeSay now, but we still need to get the full word forms into a field that should be full word forms (e.g., <citation>), and just roots in the lexical-unit field. In case it isn’t obvious, a lot of the phonological analysis depends on the structure of the root — the first root consonant is more important to our analysis that the first word consonant, so we need to be able to sort/filter on it.
  2. Sort and filter the word forms, so we see just one kind of thing at a time (we don’t want to see if kupaka and kukapa have the same ‘p’, since the difference in word position might cause such a difference that is irrelevant to the phonemic system (English speakers pronounce p’s differently in different environments, but use one letter for them all — did you know?).
  3. Go through each controlled list of words, to see where the current writing system (we use a national language to get us started) is not making enough distinctions, and where it is making too many.
  4. Mark changes on the appropriate words, returning the corrected information to the database.

While FLEx can handle all these tasks fine (however slowly at times), if I’m going to help other people move forward in their own language development, I need to find another tool (or tool set). I have got these guys quite happily working in WeSay (yes, there are kinks, but still we all are often working 3-4 times as fast as if I was typing everything myself, since there are more of us typing), but I need to get ready for these next steps, in a way that allows us to build on this momentum, rather than tell everyone but one team to go home until I have more time.

An Attempt

So I’ve been playing with XForms lately (for a lot of other reasons), and I’ve toyed with the idea of getting at and manipulating the LIFT file by another engine. The idea of being able to write a simple form to control the complexity of the underlying XML, and to manipulate it and save back to XML, was very exciting. I even have one form (for collecting noun class permutation examples) deployed. Then I read about the post describing the death of the Firefox XForms extension, and I thought surely, there must be a better way to do this, and I’m sure someone out there knows what it is. So I’ll spend some time outlining exactly what I’d like to see, and maybe someone will know what to do to make it happen.

A Proposal

I took some screenshots of some xforms I did, displayed in firefox. Here’s the first form, for parsing roots (Havu is the name of the language I’m using to test this, and one of the next who will be looking for this to work):

The important aspects are

  1. The prefix of the original citation form (sg here), which filters through the database, allowing us to work on words that are (probably) just of one noun class at a time.
  2. A number of possible plural prefixes, to control the potential output forms
  3. The originally input word form, with gloss (where present), to clearly identify each word.
  4. A number of buttons which allow a non-linguist to simply push a button to input the plural form and parse the root (probably including a “none of the above,” which would skip processing for this word).
  5. Under the hood processing which would, on a click:
  1. copy the original form into a citation form field in the database (potentially a new node in the lift XML),
  2. remove the prefix from the original form, and
  3. input the plural form into a field for the plural form (again, potentially a new node in the lift XML).

It would probably be better to show the user one word at a time, rather than the list in this screen-shot, but I included it all here, to show how the removal of the prefix, and the application of the new prefix, would need to apply to the form of each entry. Also, it would be nice if the form could be adapted for suffixing languages.
What I had so far for an example trigger (in XForms) would be

<xf:trigger><xf:label><xf:output value=”concat(‘mo’, substring(lexical-unit/form[@lang=’hav’]/text[starts-with(.,
‘aka’)], 4))” /></xf:label>
<xf:action ev:event=”DOMActivate”>
<xf:insert context=”.” origin=”instance(‘init’)/citation” />
<xf:setvalue ref=”./citation/form/text” value=“lexical-unit/form[@lang=’hav’]/text[starts-with(., ‘aka’)]”/>
<xf:setvalue ref=“lexical-unit/form[@lang=’hav’]/text[starts-with(., ‘aka’)]” value=“substring(lexical-unit/form[@lang=’hav’]/text[starts-with(., ‘aka’)], string-length(‘aka’) +1)”/>
<xf:send submission=”Save”/>
</xf:action>
</xf:trigger>

As you can see, the value of the prefix is hard-coded here, since I haven’t been able to get variables to work. also, the setvalue expressions don’t really behave (neither node-creation, not setting the right value for an existing node). It’s hard to tell what is a limitation of XForms, and what is a limitation of the Firefox extension –I tried another XForms renderer, but no luck so far… Needless to say, this is not what I do best, so help, anyone?
The next form I’d like sets ATR values for whole words (usually the harmony around here is fairly strict, so it would help with a lot, but not all, vowel questions):

This form is similar to the above, in that I’m looking for a simple regular expression (or a more complicated on, if possible. :-)) to control the data we’re looking at at once, and a binary choice for which vowel group the word belongs to (showing the new word form on the button). A choice of one or another would set the word form accordingly.
Same caveats above about the list of words on a page, and should probably have a “neither” button for when a word doesn’t obey strict vowel harmony, which would bypass processing for that word.
The trigger I had so far (for the -ATR button, reverse for the +ATR) was

<xf:trigger><xf:label><xf:output value=”translate(lexical-unit/form[@lang=’hav’]/text, ‘aeiou’, ‘aɛɨɔʉ’)”/>(-ATR)</xf:label>
<xf:action ev:event=”DOMActivate”>
<xf:setvalue context=”.” value=”translate(lexical-unit/form[@lang=’hav’]/text, ‘aeiou’, ‘aɛɨɔʉ’)”></xf:setvalue>
<xf:send submission=”Save”/>
</xf:action>
</xf:trigger>

The translation includes a>a, but could be modified depending on what /a/ does in a given language (especially if there are 10 vowels).
The third and final form, where we would spend the bulk of our time, might look like this:

Here we have:

  1. The ubiquitous regex to control the data we’re looking at.
  2. The letter that is being evaluated (we’re only analyzing one written form, be it digraph or not, at a time).
  3. The position to find that letter (something like first or second in the root, would probably be good enough).
  4. The options for replacing it (buttons labeled with the new forms), in case a word in the list uses a sound that doesn’t sound the same as the sound in that position in the rest of the words in the list.

Again, a choice of new word form would write the new form to the database (at which point the word would likely disappear from the list, until the regexp matched the new form of that word). It would be nice if the same replacement could be made in the citation and plural forms, presuming the same sounds and written letters would apply. I’m not sure if it would be necessary to devise two forms, one for consonants, and another for vowels. It would depend (at least) on how the regex worked, since the underlying principle is the same for consonants and vowels — look at one thing in one position at a time, and mark each one as the same or different, compared to the other things on the list.

Specifications

Some things we would need to make this tool useful:

  1. Read and modify LIFT format in a predictable and non-destructive way (making all and only the changes we’re looking for, to the fields we’re looking at, leaving everything else alone). This is the format we’re keeping all our lexical data in, and we need to play nice with a number of other programs that use the same data structure standard.
  2. Use of regular expressions. For the root parsing tool, a simple prefix (or suffix) filter would do, but for the others it would be nice to be able to constrain syllable type, as well as position in the word (i.e., kupapata should not appear on the same list as kupapa, even if we’re looking at first ‘p’ in the root, since the second is a longer root). In FLEx, I use expressions like these (more included below), though a simpler format could do, if that kind of power were not possible. Ideally, these expressions would be put into a config file, and the user would only see the label (I have these all done, and can come up with more if needed).
  3. Cross-platform. We run most of our work on BALSA, so it would need to be able to run on at least Linux, which provides BALSA’s OS.
  4. Simple UI. It is probably not possible to overstate computer illiteracy we are dealing with here. People are eager to learn, and often capable to learn, but the less training we need, and the less room a given task has to screw everything else up, the better (WeSay‘s UI is a great model).
  5. Shareable. Even if not open sourced, any tool we might use here needs to be legally put on every computer we and our colleagues use, and they don’t often have the money or access to internet to buy licenses.
  6. Supportable. As hard as we try to keep the possibility of errors out of our workplace, it happens. Today we had two technological problems, each of which required non-significant amounts of my time. If I weren’t here (or if the problems had been beyond me), the teams would have been stuck. The simpler and more accessible (or absolutely error free!) the technology is under the hood, the more likely someone local will be able to deal with problems that arise in a timely manner.

Anyway, there it is.

Examples of Useful Regular Expressions for Filtering Lexical Data

The following are output by a script, which takes as input the kinds of graphs (e.g., d, t, ng’, and ngy) used in a given language’s writing system. For instance, these expressions do not allow ‘rh’ as a single consonant, but those I did for another language does. Similarly, these are based on ten particular vowel letters, which could also be changed for a given language.

^([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])([aiɨuʉeɛoɔʌ])([́̀̌̂]{0,1})([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])([aiɨuʉeɛoɔʌ])([́̀̌̂]{0,1})$

(all CVCV –short vowels only)

^([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])([aiɨuʉeɛoɔʌ]{1,2})([́̀̌̂]{0,1})([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])([aiɨuʉeɛoɔʌ]{1,2})([́̀̌̂]{0,1})$

(all CVCV –long vowels and dypthongs OK)

^([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])([aiɨuʉeɛoɔʌ]{1,2})([́̀̌̂]{0,1})12([aiɨuʉeɛoɔʌ]{1,2})([́̀̌̂]{0,1})$

(all CVCV with C1=C2 –counting prenasalization)

^([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])([aiɨuʉeɛoɔʌ])([́̀̌̂]{0,1})([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])3([́̀̌̂]{0,1})$

(all CVCV with V1=V2 –no long V’s or dypthongs)

^([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])(a)([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])3([́̀̌̂]{0,1})$

(all CVCV with V1=V2=a)

New Wesay Tasks II –Parsing and Tone

Having found out how to configure and add new tasks in WeSay, I can now make some serious changes to my workflow (read: I can actually have people do some of their own data entry and parsing!)

The following task is a follow-up to the one presented here, where the root and meaning were visible but not editable, helping a non-linguist focus on adding plural forms to his dictionary –based on those visible but not editable fields. In this next task, the plural form forms a basis for parsing the root, so the plural and definition fields are visible but the only editable fields are root (EntryLexicalForm, default English label ‘Word’) and citation form. This allows the non-linguist to copy the form from the root field to the citation form field, and remove affixes from the root.
This task is slightly unideal in that it doesn’t know if a word has been parsed yet, or not. It checks for the presence of a citation field (which is usually only made in this kind of process), but it doesn’t know if, for instance, you have a mono-morphemic word which is pronounceable as is (i.e., standard lexicography practice would be to not have a citation field), and it would think that that word was still “to do.”

Here is my new root parser task:

<task
taskName=”AddMissingInfo”
visible=”true”>
<label>Roots</label>
<longLabel>Parse Roots</longLabel>
<description>Move citation info to Citation form field, leave roots in root field.</description>
<field>citation</field>
<showFields>EntryLexicalForm, citation</showFields>
<readOnly>Plural, definition</readOnly>
<writingSystemsToMatch />
<writingSystemsWhichAreRequired />
</task>

While I was at it, I made another task to help populate the “Tone” field. Too bad I can’t get WeSay and FLEx to agree on a writing system to restrict the characters here, so we’re likely to get a lot of malformed data here. But it should be a place to start. Anyway, here it is:

<task
taskName=”AddMissingInfo”
visible=”true”>
<label>Tone</label>
<longLabel>Add Tone Info</longLabel>
<description>Input tone information for each record.</description>
<field>Tone</field>
<showFields>Tone</showFields>
<readOnly>Plural, citation, definition</readOnly>
<writingSystemsToMatch />
<writingSystemsWhichAreRequired />
</task>

This gives me a new desktop:

And a new parsing task:

And a new Tone input task:

And the nice thing about tasks in WeSay, is that you can enable just the ones relevant to the task at hand, to simplify the UI. Which means that I can put all these in my new default project, and leave them disabled (<task taskName=”AddMissingInfo” visible=”false”>, or unticking them in the config) until we need them.
I’ve already shared my dream, but in the mean time, this gets us a lot closer do doing low-level language development in WeSay.

New Wesay Tasks

I found out today that the tasks in WeSay are more configurable than I had thought. I add a “Plural” field to any language project I work in, but until today we have been stuck inputting that either in FLEx (i.e., by me) or in “Dictionary Browse & Edit,” which is unideal at best.
Looking at the config file (in a text editor — make a copy first, etc.!), there are a number of <task/> nodes, several of which have the taskName attribute of “AddMissingInfo.” This appears to be a generic task (unlike some of the others), that comes with a number of options, and can be used multiple times.
In addition to the labeling, there is a field it checks for (if that field is populated, the record doesn’t show in the task), fields it shows as editable, and fields that are shown, but not editable. I don’t know what the last two do, since they are always empty nodes (in all the tasks in my config, anyway). I found out (by trying) that if you tell it to look for a field that isn’t there, it will crash (so make your new fields, then the tasks to add them). I think if you tell it to display a field that isn’t there, it simply won’t. Here is my task node for the new “Add Plurals” task from my config file:

<task
taskName=”AddMissingInfo”
visible=”true”>
<label>Plurals</label>
<longLabel>Add Plurals</longLabel>
<description>Add plural forms to entries where they are missing.</description>
<field>Plural</field>
<showFields>Plural</showFields>
<readOnly>definition, EntryLexicalForm, citation</readOnly>
<writingSystemsToMatch />
<writingSystemsWhichAreRequired />
</task>
Here is the new desktop:

and Here is the page for the new task:

Getting Fieldworks lexical data into XLingpaper

I’ve written before about using WeSay to collect language data, and Wesay lift files can be fairly easily imported into Fieldworks Language Explorer (FLEx) for analysis. Recently, I’ve been working on getting data from the FLEx lexicon into XLingpaper, to facilitate the writing of reports and papers than can be full of data (which is the way I like them…:-)).
I start with a lexicon (basically just a word list) in flex, that has been parsed for root forms (go through noun class categorization and obligatory morphology with a speaker of the language). Figure out the canonical root syllable profile (e.g., around here, usually CVCV), and look for complimentary distribution and contrast within that type, both (though separately) for nouns and verbs.
I have a script that is putting out regular expressions based on what graphs we expect to use (those in Swahili, plus those we have added to orthographies in the past –since we start with data encoded in the Swahili/Lingala orthography, this covers most of the data that we work with). This script puts out expressions like

^bu([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])([aiɨuʉeɛoɔʌ]{1,2})([́̀̌̂]{0,1})$

which means that the whole word/lexeme form (between ^ and $) is just b, u, some consonant (one of m or n, or not, then a consonant letter that appears alone, or the first and second of a digraph), then some vowel (any of ten basic ones, long or short, plus diacritics, or not). In other notation, It is giving buCV, or canonical structures with root initial [bu]. This data is paired with data from other regular expression filters giving [b] before other vowels to show a complete distribution of [b] before all vowels (presumably…).

The script puts out another expression,

^([mn]{0,1})([[ptjfvmlryh]|[bdkgcsznw][hpby]{0,1}])(a)([mn]{0,1})([[ptjfvmlryh]|[bdkgcsznw][hpby]{0,1}])3([́̀̌̂]{0,1})$

which gives me CaCa, as in the following screenshot:

(The 3 refers to the third set of parentheses, (a), so that changing (a) to (i) gives you CiCi.) The data from these filters gives evidence of the independent identity of a vowel, as opposed to vowels created through harmony rules.
So these regular expressions allow filtering of data in the FLEx lexicon to show just the data that you I need to prove a particular point you’re trying to make (in my case, why just these letters should be in the alphabet). But then, how to get the data out of FLEx, and into a document you’re writing?
FLEx has a number of export options out of the box, but none of them seem designed for outputting words with their glosses, based on a particular filter/sort of the lexicon. In particular, I’m looking for export into a format that can be validated against an XLingpaper DTD, since I use XLingpaper XML for most of my writing, both for archivability and longevity of my data, as well as for cross-compatibility in differing environments (there are also developed stylesheets to make XLingpaper docs into html, pdf, and usually word processor docs, too). The basic XML export of the data on the above sort starts like this:

<?xml version=”1.0″ encoding=”utf-8″?>
<ExportedDictionary>
<LexEntry id=”hvo16380″>
<LexEntry_HeadWord>
<AStr ws=”gey”>
<Run ws=”gey”>bana</Run>
</AStr>
</LexEntry_HeadWord>
<LexEntry_Senses>
<LexSense number=”1″ id=”hvo16382″>
<MoMorphSynAnalysisLink_MLPartOfSpeech>
<AStr ws=”en”>
<Run ws=”en”>num</Run>
</AStr>
</MoMorphSynAnalysisLink_MLPartOfSpeech>
<LexSense_Definition>
<AStr ws=”en”>
<Run ws=”en”>four (4)</Run>
</AStr>
</LexSense_Definition>
<LexSense_Definition>
<AStr ws=”fr”>
<Run ws=”fr”>quatre (4)s</Run>
</AStr>
</LexSense_Definition>
<LexSense_Definition>
<AStr ws=”swh”>
<Run ws=”swh”>nne</Run>
</AStr>
</LexSense_Definition>
<LexSense_Definition>
<AStr ws=”pt”>
<Run ws=”pt”>quatro (4)</Run>
</AStr>
</LexSense_Definition>
<LexSense_Definition>
<AStr ws=”es”>
<Run ws=”es”>cuatro</Run>
</AStr>
</LexSense_Definition>
</LexSense>
</LexEntry_Senses>
</LexEntry>
<LexEntry id=”hvo11542″>
<LexEntry_HeadWord>… and so on…

But this is way more information than I need (I got most of these glosses for free using the CAWL to elicit the data), and in the wrong form. The cool thing about XML is that you can take structured information and put in in another structure/form, to get the form you need. To do this, I needed to look (again) and xsl, the extensible stylesheet language, which had succesfully intimidated me a number of times already. But with a little time, energy, and despration, I got a working stylesheet. And with some help from Andy Black, I made it simpler and more straightforward, so that it looks like XLPMultipleFormGlosses.xsl looks today. Put it, and an xml file describing it, into /usr/share/fieldworks/Language Explorer/Export Templates, and this is now a new export process from within FLEx. To see the power of this stylesheet, the above data is now exported from FLEx as

<?xml version=”1.0″ encoding=”utf-8″?>
<!DOCTYPE xlingpaper PUBLIC “-//XMLmind//DTD XLingPap//EN” “XLingPap.dtd”>
<xlingpaper version=”2.8.0″>
<lingPaper>
<section1 id=”DontCopyThisSection”>
<secTitle>Click the [+] on the left, copy the example that appears below, then paste it into your XLingpaper document wherever an example in allowed.</secTitle>
<example num=”examples-hvo16380″>
<listWord letter=”hvo16380″>
<langData lang=”gey”>bana</langData>
<gloss lang=”en”>four (4)</gloss>
<gloss lang=”fr”>quatre (4)s</gloss>
</listWord>
<listWord letter=”hvo11542″>

which contains just enough header/footer to validate against the XLingpaper DTD (so you don’t get errors opening it in XMLmind), and the word forms and just the glosses I want (English and French, but that is easily customizable in the stylesheet). The example node can be copied and pasted into an existing XLingpaper document, which then can eventually be transformed into other formats, like the pdf from this screenshot:

which I think is a pretty cool thing to be able to do, and an advance for the documentation of languages we are working with.

Wesay Wrapper

As happy as I am with WeSay, it is designed with a mindset of one user working on a given computer. It can work on a number of different languages (one at a time, of course!) out of the box, but it isn’t straightforward. You have to navigate to the WeSay folder, then click on the .lift file each time you want to work in a language other than the one WeSay opened last. Since one of the goals of working in BALSA is to hide the (often confusing) directory structure from the (initiate) user, this is unideal for switching between languages in Wesay on BALSA.

Since I’m doing just that (and a fair bit of switching on my own computer between WeSay in different languages), I wrote a script to invoke WeSay with the arguments calling for a given project each time it runs, so we will never need to think about what the last project was. It has a graphical tool (Zenity) to select which project to open, which is populated by the projects actually on that computer.
It assumes a structure of projects named by Ethnologue code, each in a folder with that name, each of which is in the same WeSay folder. It also assumes that each project has the full language name defined in the palaso:languageName tag in the language’s .ldml file (if it doesn’t, it will still work, but the gui will look off, and the next language will be on the same line).

I have the gui in English and French, since that’s what we use here. 🙂

For those interested, here it is:
#!/bin/bash
Version=2011.07.28
#set -x
#Note: this assumes a directory structure of WeSay projects named after three letter ISO/Ethnologue codes.

case $HOSTNAME in
Balsa*) wsfolder=/home/balsa/WeSay/;;
*) echo “Is WeSay set up on this computer? if so, update the $0 wrapper.”;exit;;
esac
langs=
PWD=`pwd`
cd $wsfolder
for xyz in `ls -d ???`
do
langs=”$langs $xyz”
langs=”$langs `grep –after-context=1 “palaso:languageName” $xyz/WritingSystems/$xyz.ldml|tr -d ‘n’| grep -o “<palaso:languageName.*>” |grep -o ‘”.*”‘|tr -d ‘”‘`”
done
cd $PWD
lang=`zenity –width=120 –height=400 –list –title “Open/Ouvrir Wesay” –text “Choose language / Choisissez langue:” –column=”ISO 639-3″ –column=”Name/Nom de langue” –multiple $langs`

echo “Ethnologue code entered: $lang”
if [ -f $wsfolder/$lang/$lang.WeSayConfig ]
then
wesay $wsfolder/$lang/$lang.lift
else
zenity –error –text “Sorry, please check the code and try again.n Désolé, SVP verifier la code, et éssayer encore.”
$0
fi
exit

Using computers to Help the Computer Illiterate Develop their Language

I’ve been working with orthography development a bit, and it has been a major challenge getting all the different pieces of the work to fit together well. One of the main divides I see in work is between people who would use computers exclusively, and others would who use them not at all (during the research process at least). I have never felt comfortable in either camp, in part because I am of the computer generation, but also because I have seen a lot that paper and pencil methodology has to offer. Perhaps the most relevant point to our work is that computers are inaccessible to most of our national colleagues. Which means that if I’m doing everything on a computer, I’m doing it by myself. Or I’m teaching people things I started learning in the third grade (they know what the 0/1 symbol is from cell phones, but other than that, I’m usually starting from scratch).
Fortunately for me, there are people working on making linguistic computer work more accessible to the less computer literate, so we can take advantage of computers, without pushing our national colleagues out of the work. One bright shining example is WeSay. Its interface is straightforward and simple. Think of a word and type it in. Give it a short meaning. Next word. Later, you can go back and add longer meanings, other senses, etc. You can work through a wordlist, or use semantic domains — both great ways to help people think of new words to put in their dictionary. I had someone working on it for several days, to bring his wordlist up to over 2,000 words, and we were both quite happy with how it went. I didn’t realize just how happy I should have been until I had the same guy do some other tasks on the same computer (I think it was still just typing, but in openoffice). Things immediately bogged down, and I was constantly needed to fix something.
So here’s my problem: I really like dictionaries, and I think a dictionary is a backbone to any other work done in a language. And it just doesn’t make sense to write a dictionary on pen and paper. But the majority of my time is developing writing systems, which is better done as a community process (i.e., not everyone standing around watching me use a computer). The first thing I do is collect a wordlist, which I eventually make the beginnings of a dictionary. But that dictionary is going to need to be put in a database in a standardized writing system, or it will be a mess. So writing system development and lexicography go best together, but how?
Constance Kutsch Lojenga (among others) has developed a participatory linguistic research methodology, which is good at bringing speakers of a language into the language discovery process from the beginning (c.f., “Participatory Research in Linguistics. ”Notes on Linguistics Vol. 73(2):13-27. Dallas, TX: SIL. 1996.). Language community members see the sound distinctions in their language (at about) the same time I do, and we get to make writing system decisions as a group, given the linguistic facts we have observed together. The basic discovery process puts two words together, and asks, “is the sound in question in these two words the same, or different?” If they are the same, they are put in the same pile, if different, in different piles. There is, of course, lots of background work to be done (like sorting words by syllable profiles, so we’re looking at the same position in the same type of word at a time), but we attempt to make the discovery process itself as attainable as possible, and I have seen it work well.
Which brings me to my dilemma. This methodology as currently conceived uses words written on paper, which are sorted as a group. Which means, that if I use WeSay to collect a wordlist, I (or some other computer savvy person) need to export that wordlist into a format that will print onto paper that can be cut into cards (not hard with mailmerge, and document templates, but it is work), and then after the sorting process, all the information gained (e.g., “fapa” really should be spelled “paba”) needs to be put back into the database — or else it will just remain on those cards. Even if we write up a summary of selected cards in a report, the wordlist/dictionary won’t improve, unless we can get the information off those cards, and back into the computer — which again is not hard, but it is time consuming, and prone to error.
Which got me thinking, would there be a way to simplify the round-trip, and make the same/different decisions in a way that the spelling change information would be immediately returned to the database? As nice as the simple card-sorting interface is, if we could make something (at least nearly) as simple on a computer, I know our colleagues would be up to it. If we could make it simple. Which got me thinking about orthography and WeSay.
Anyway, I’ve written up some thoughts on how this might work practically, which you can find here.