Tag Archives: Language Development

WeSay and BALSA: Thanks!

I just finished a trip to Bunia and Nia-Nia, DRC, where I helped the Ndaka [ndk] and Mbo [zmw] communities develop draft alphabet charts and transition primers, the material for each language including all nine vowels (with ATR harmony) and the egressive/implosive stops. The Mbo version also includes the [p]/[ɸ] contrast, as /p/ is normally [ɸ] there. Each booklet includes a short story in the new draft orthography.

I’ve written before about using WeSay to collect language data in highly illiterate language communities, which was a major part of this work.  And since I don’t want to do IT work full time (or rather I have other things to do with my time), I’m using WeSay in BALSA. So since much of this work would not have been possible without the work of many people, especially those working on BALSA and WeSay, I wanted to take a minute to thank them. Without a budget to do so materially, I’ll do that through describing the work here, and explicitly saying that if you work on WeSay and/or BALSA, please feel free to take and use this story and/or pictures in your own publications; this is your story, too.

We met for this workshop in Nia-Nia, DRC, about an hour into the rainforest (by MAF) from Bunia, which is about an hour (again by MAF) from Entebbe, Uganda. We met in a church, with the Ndaka covering most of the workshop logistics, since this is their home turf.  The Mbo are also a bit on the run these days from a militia conflict that hasn’t seemed to end yet. And they’re a smaller and more illiterate people group. So much so that one of the guys on the Mbo team didn’t participate in a dictation exercise, as we were practicing new letters. And yet, here they are, around a BALSA machine, using WeSay:
WeSay̠Ung'inMbo_IMG_4222_sm
Admittedly, the guy touching the computer isn’t Mbo, but he’s helping them deal with the interface, and they’re choosing pictures through the Art of Reading interface in WeSay, which would seem to be even more popular with less literate communities.

This one is of the Ndaka team, using WeSay in BALSA independently:
WeSay_ndk_IMG_4220_sm
They’re also picking images to go with dictionary entries we put there earlier, though some people started modifying entries by the end of the workshop. Lest this seem trite, let me point out that this was the first time that ANY of them had used a computer of any kind.  For some, it was the first time they had seen one. You can see in the background the current stabilizer, which is plugged into the generator we used to have electricity. Without the stabilizer, I wouldn’t plug in a computer, because of the risk of unstable current. After the stabilizer, we put a fridge guard. when I can, I put a second stabilizer in series, to even out what current irregularities the first one doesn’t catch. Which is all to say that this is not the most computer friendly place, even after the alternating dust and humidity, and the heat. But these guys took to the tasks, and were able to work somewhat independently on computers for the first time.  Having tried this in similar contexts with other software, I attribute this success entirely to WeSay and BALSA.  Thanks, guys, for making that possible.
Here is the same team from another angle:

WeSay_ndk_II_IMG_4223_sm
And here is one with other people hanging around, showing that this is truly a community affair:

WeSay_ndk_kids_IMG_4224_sm
So this workshop was a success in part because people who had never used computers before (including the elementary school principal, shown in the background of this last pic), were able to get up and running in very little time, with very few frustrations.  They even enjoyed the work so much, I had to kick them out several evenings, after it was already too dark to walk home. So thanks to everyone involved, for your part in making this happen.

Creating Tone fields in Fieldworks 7.0.6~beta7 (not useful for WeSay 0.9.28.0+)

Creating Tone Fields by the Method Native to FLEx –The Better Way

(N.B.: this entry started with FW7.05~b5 and WS0.9.28, though I’m finishing it on FW7.06~b7 and WS1.1.11. Some of the screenshots may look different between these versions, but I haven’t noticed any difference in functionality with regard to these fields.)
After creating custom fields in this way for tone and plural forms, I found that tone fields are already accounted for in FLEx, though not particularly transparently. There is a set of pronunciation fields, which can be inserted here:

This option puts the set of pronunciation fields in the record you’re editing, not the whole database. It gives tone, as well as a couple other fields. It looks like this in FLEx:

What’s nice about this is that you can do this a number of times, for the same entry. This gives you the chance to have a number of pronunciations, in different contexts –which is important in phonology, especially with regard to tone. The “Location” field is an empty, customizable field, so I presume we could put things like “Before a High Tone” or “phrase finally” or whatever there, then know that that pronunciation is valid for that context. Filling in some bogus data, we see the following in FLExː

Under the Hood

The above results in the following in the appropriate entry of the LIFT file:

<pronunciation>
<form lang=”gey”><text>ba</text></form>
<field type=”cv-pattern”><form lang=”en”><text>CV</text></form>
</field>
<field type=”tone”><form lang=”en”><text>?H</text></form>
</field>
</pronunciation>
<pronunciation>
<form lang=”gey”><text>bad</text></form>
<field type=”cv-pattern”><form lang=”en”><text>CVC</text></form>
</field>
<field type=”tone”><form lang=”en”><text>?HF</text></form>
</field>
</pronunciation>

So each pronunciation has a form/text set of nodes, and fields with type attributes for each of the visible fields with data in FLEx. Note that these fields are formatted exactly the same as the fields we created earlier here and here, that is

<field type=”NameofFieldinFLEx”>
<form lang=”LanguageCode”>
<text>Field Contents</text>
</form>
</field>

The only difference here is that the fields are under a <pronunciation> node, and not directly under the entry itself. But the fact that these fields are grouped together under repeatable pronunciation nodes should mean that we can organize contextually dependent pronunciation (tone or segmental) fields.

Sorting on Pronunciation Fields

I tried sorting on individual pronunciation nodes in FLEx, but wasn’t immediately impressed. I tried sorting the above fields for those with CVC in the cv-pattern, and this is what I got:

One can see that the entry is filtered, not the set of pronunciation fields. When working with Toolbox, it was possible to filter on either of a repeated field within an entry. Recalling that this was only when sorting on that field (therefore producing a record for each of the multiple fields), I tried that in FLEx, and it worked:

Note that there is only one pronunciation field listed, and the pronunciation form and tone fields listed are those that correspond to the CV field that was selected in the filter.
This data structure would also allow one to select only particular tone patterns, such as with an XPath expression like pronunciation[/field[@type=’cv-pattern’]/form/text = ‘CVC’]/field[@type=’tone’]/form/text to get the information in the tone field under only those pronunciation nodes that also have CV fields with ‘CVC’ in them.
Unfortunately, I haven’t been able to see these fields in WeSay (yet, I hope: see this bug report). Which is sad, because this is otherwise the best way to indicate tone in FLEx.

===Poetic Interlude===
I wrote most of the above several months ago, and had forgotten that I had worked this much out, until I ran into the problem of bulk editing on these fields. A quick Email to <Flex_Errors at sil.org>, and a fairly rapid response later, and I was back in business. When I went to write it up, I found the above in my drafts folder…
===End of Interlude===

So I’ve been doing a lot of data collection in the last couple months using the above paradigm, keeping different tone fields separate by their sibling location fields. I have XSL transforms to add this data to a LIFT file, and some reports to pull it out later, but how to mess with it in the mean time, should I need to? To get bulk editing on these fields to work, I needed two things:

  1. to sort on ‘pronunciation’ or one of it’s children (this I had apparently already figured out, but forgotten)
  2. to select the right columns for viewing in the bulk edit view.

Selecting the right columns for viewing in the bulk edit view

In case it isn’t obvious, the visible columns in the bulk edit view determine what fields you can act on. If “Lexeme” isn’t visible, you can’t copy to or from it, or modify it with a regular expression. So first, you need to make the fields you’re looking for visible, which is done through a dialog you can access by clicking in the upper right corner, with tooltip “Configure which columns to display”:

When you click on this, you get a menu of a number of (recently selected?) fields. To access other fields, to change column ordering, or to select language options, select “More column choices…” at the bottom:

This gives you access to the following dialog, where you can find fields not on the above list, select which of a number of writing systems you want to see (and therefore Bulk Edit). The Arrows on the right allow you to move the fields up and down (moving columns left and right on the Bulk Edit screen):

One trick that may not be obvious is that the ‘Tone’ field under ‘Pronunciation’ is available here as ‘Tones’. I presume this is because there are potentially a number of different Tone fields (as in my case). This is the same for ‘Location’ > ‘Locations’ and ‘CV Pattern’ > ‘CV Patterns’.

Sorting on Pronunciation Columns

Once all the fields you’re interested in are in the “current columns” (right) side of that dialog, you can select a column to sort on (showing light blue triangle). Selecting ‘Pronunciations’ gives three lines for this entry, and proclaims “Pronunciation” at the top of the page for slower ones like me.

If you’re in a context where you want to sort on two of these fields (if one doesn’t uniquely sort them, as the screenshot above), you can select one, then shift-select another, which will give a secondary sort (and a smaller triangle) as in the following:

Here the location is the first sort, then the tone. Note that the pronunciation form isn’t sorted (a…z…k…a), though the duplicate HAfter-sg field for titi is (correctly) showing up as another pronunciation/tone field (with pronunciation/form atíti nɛ) –showing that sorting by any of the pronunciation fields gives this layout.

Bulk Editing Pronunciation Fields

Getting back to the point of it all (for me, anyway), with this configuration it is now possible to bulk copy to/from these fields:

Locations didn’t show up for me under “Bulk Replace”; I’m not sure why, though that sounds familiar –perhaps I didn’t configure it right, or maybe that’s a bug.

Summary

Though tone fields created under pronunciation fields is not currently helpful for WeSay collaboration, it seems a much more principled way of treating tone data in FLEx, since it natively allows for varied contexts, CV patterns, segmental morphophonemics impacting the frame (since each pronunciation field has a form field, which can include the lexeme, frame, and any segmental interactions between them). In addition these fields are accessible to FLEx filtering and sorting, including bulk edit operations.
Given the complexity of this configuration, I would not recommend what I have described to the computer non-savvy (e.g., users more comfortable in WeSay). But for those comfortable manipulating these configurations, FLEx can be a powerful tool for manipulating tone data.

Marantz CF Media and Formatting

I have a Marantz PMD670, for which I recently got a bigger (8GB) CF card (the 256MB card can only handle 30 mins in mono…). I had trouble getting it to format in a way that the Marantz recognizes. After asking for help getting that card to work, I finally just bought a few different cards, and tested them all. Having done the slow and tedious, I thought I should share my results, to hopefully spare someone else the trouble (or at least some of it). You understand that your mileage may vary, but this is what I found.

Summary: Transcend and Sandisk Ultra worked; Kingston and Pretec didn’t.

Cards tested (I have no idea what most of these part and serial numbers mean but they’re here, in case they mean something to someone):
Lexar 256MB
Kingston 8GB 9904318-025.AOOLF / 5393610-0488698 X001
Pretec 8GB 233x P/N: CFS208G
Sandisk Ultra 8GB 30MB/s
Transcend 8GB 133x P/N: TS8GCF133 S/N:596260-0952
Transcend 4GB 133x P/N: TS4GCF133 S/N:598666-3236

The Easy Ones
The Lexar, Sandisk, and Transcend cards all worked out of the box. You plug them in, turn on the Marantz, get “BlankCard”, and can start recording immediately.

Pretec and Kingston
For these two cards, things are more difficult:

  1. Turn the Marantz on, and it displays “You need FormatOnPC”
  2. Clicking stop/cancel gets past this error to “BlankCard” (Before I figured this out, I thought these cards were completely useless).
  3. If you try to record at this point, it says “Full Card” (ironically, just under “BlankCard”)
  4. It is possible format at this point, following to the manual instructions (“executing–>Done”)
  5. Once formatted, record and playback work normally, until you shut the machine off (either to conserve power, or to go into I/O mode).

But once you’ve recorded as above,

  • Kingston: you cannot see the recorded files on PC, either through CF card reader, or through the Marantz in I/O mode. The card mounts, but appears blank.
  • Pretec: you can see recordings files on PC or in I/O mode, and manipulate/play them on a PC.
  • both: when the Marantz boots again, it gives “You need FormatOnPC”, and the whole process begins again (i.e., you can’t record/play until after formatting)

So, if you don’t turn off your Marantz much, and if don’t mind formatting your card every time you do, these two cards might do something for you. The Kingston will allow playback of what has been recorded, though only on the Marantz itself. The Pretec, on the other hand, will also allow you to get the the data off the card afterwards, which I assume most of us would want.

Personally, I’d go for one of the cards that really works. So much for standards. Unfortunately, the first larger card that I got was the one that performed the worst, and the later cards all performed at least a bit better. I hope the lesson isn’t that we need to just buy several cards whenever we need at least one to work, but maybe it is.

Creating a Custom Field II: in Fieldworks 7.0.5~beta5 for WeSay 0.9.28.0

Today I’m going to walk through creating a custom field in Fieldworks, and see how it looks in LIFT and in WeSay.

Fieldworks’ ‘Custom Fields’ Dialog

Creating custom fields in fieldworks is easy, if you know where to look. I created a Tone field via Tools/Configure/Custom fields:

Clicking there produces the Custom Fields dialog box, where one can set up the new field:

Here I have already added Tone and Plural fields. As far as I can tell, there are pros and cons to this method:

  1. Fields added to every record in the database (though I don’t think they take up space, at least in LIFT, until there is data in the field).
  2. Only one of these can appear in a record. I didn’t even notice this until I tried another kind of field (to come), but this may or may not be important to what you’re doing. If you want a couple tone fields for different environments (syntactic, tonal, or whatever), you would need to make them each here, or use another method (description to come).

This is what they look like in FLEx before they have been filled in (Note that I selected different options for the language of these fields):

These fields from the entry in the above screenshot didn’t show up in the LIFT file, since they were empty, but another took the following form (between lexical-unit and senses):

<field type=”Plural”>
<form lang=”gey”>
<text>baadisi</text>
</form>
</field>

And here it is in WeSay:

I saw it immediately on opening WeSay this time, since I had the field already configured earlier, like this:

Note that “Name in file” and “Name for display” are both “Plural”. This makes it a bit easier on the config, since you don’t have to keep track of a different name for the WeSay user to see as in the LIFT file (which is what you see in FLEx).
In the WeSayConfig file, you see this:

<field>
<className>LexEntry</className>
<dataType>MultiText</dataType>
<displayName>Plural</displayName>
<enabled>True</enabled>
<fieldName>Plural</fieldName>
<multiParagraph>False</multiParagraph>
<spellCheckingEnabled>False</spellCheckingEnabled>
<multiplicity>ZeroOr1</multiplicity>
<optionsListFile></optionsListFile>
<visibility>Visible</visibility>
<writingSystems>
<id>gey</id>
</writingSystems>
</field>

Note the fieldName and diplayName values each as ‘Plural.’
When adding (and therefore and naming) a new field in FLEx, that name would show in the same place as Plural in <field type=”Plural”> (the ‘type’ attribute of the field node) for that field in the LIFT file. That would be what you would need to put in the “Name in file” field of the Configuration Tool/Fields dialog above (or in the fieldName field of the WeSayConfig file), in order to see it in WeSay.
A couple caveats for creating custom fields for collaboration between FLEx and WeSay in this manner:

  1. You can’t use spaces. One of the first custom fields I made in FLEx was “Noun Class of Plural.” When I tried to create the corresponding field in WeSay, I got something like this:

    I recall FLEx being perfectly happy writing the field ‘type’ attribute with spaces into the LIFT file, but there was no way to get such a WeSay field, either through the config tool, or through editing the config file by hand. Not that I could find, anyway; perhaps a developer can contradict me here if there is.
  2. A related point is that when creating the field in FLEx first, one is obligated to then create the field in WeSay, or you won’t see it there (the data should still be preserved, but that’s not the kind of collaboration I’m looking for).

But when creating a custom field in WeSay first (As I described here), FLEx creates the field that you created in WeSay automatically. There was a limitation on the options (relative to creating a custom field in FLEx), but going in that direction removes one configuration step for each custom field. So that would depend on the kind of flexibility you need (I haven’t needed those options, yet).
Probably the first issue where I would want those grayed out options would be for fields with option lists. Even in FLEx, the instructions say to set up the options (or at least the list) first, then the field that references them. When trying to collaborate with such a field in WeSay, that would all need to be done first. But I haven’t figured out yet how to get such a field into WeSay, or if the option list fields from WeSay (e.g., POS and SemDom) can go into FLEx, or if they are incompatible data types. If someone figures that one out, please let us all know; if I get time to work on it, I’ll post here.

Notes for creating fields in FLEx’s ‘Custom Fields’ dialog to be used in WeSay

  1. Don’t use spaces in the field name.
  2. Plan on also creating the custom field name in WeSay, with the FLEx field name in the WeSay Configuration Tool’s “Name in file” field.
  3. Don’t use this method for fields that might need to appear more than once per sense/entry, or else make one for each possible iteration you need.
  4. Use this method if you need broader configuration of FLEx custom fields.

Creating Custom Fields in WeSay 0.9.28.0 for Fieldworks 7.0.5~beta5

I’ve been working with custom fields in FLEx and WeSay enough to feel the need to figure out what is really going on. The goal is to be able to straightforwardly create custom fields in one or the the other that are editable and round-trip-able in the other. To do this, I’m going to look into the interface of each program, and see what impact adding fields has on the LIFT (and config, for WeSay) file. Today I’m making a field in WeSay, and seeing what it looks like there, and then in FLEx.

The WeSay Configuration Tool

The WeSay config tool looks like this (once you click on ‘Fields’ then ‘New Field’):

Once you save and exit, you get a section under the <fields> node in the WeSayConfig file that looks like this:

<field>
<className>LexEntry</className>
<dataType>MultiText</dataType>
<displayName>*newField</displayName>
<enabled>True</enabled>
<fieldName>newField</fieldName>
<multiParagraph>False</multiParagraph>
<spellCheckingEnabled>False</spellCheckingEnabled>
<multiplicity>ZeroOr1</multiplicity>
<optionsListFile></optionsListFile>
<visibility>Visible</visibility>
<writingSystems>
<id>en</id>
<id>fr</id>
<id>hav</id>
</writingSystems>
</field>

Adding Data in WeSay

Returning to WeSay, one can add some bogus info to this field in one of the records:

Closing out WeSay and looking at the LIFT file, we see the following under this entry (between <lexical-unit> and the first <sense>):

<field type=”newField”>
<form lang=”fr”>
<text>BogusNewfield</text>
</form>
</field>

What this Means

Putting this all together, we see that

  1. The ‘Name in file’ from the WeSay Config Tool corresponds to the field/fieldName node in the WeSayConfig file.
  2. Both of the above correspond to the LIFT entry/field ‘type’ attribute (once data is entered):
    ‘Name in file’ = (xyz.WeSayConfig)/configuration/components/viewTemplate/fields/field/fieldName = (xyz.lift)/lift/entry/field/@type
  3. ‘Name for display’ from the WeSay Config Tool is the label the WeSay user sees on the field, which corresponds to the contents of the field/displayName node, i.e., (.WeSayConfig)/configuration/components/viewTemplate/fields/field/displayName
  4. Therefore, the name a WeSay user sees for a field will not necessarily relate to anything in FLEx. This is because the WeSay label is related to the proper LIFT field in the WeSayConfig file (which FLEx doesn’t see), and not in the LIFT file, which is what FLEx imports. So in setting up custom fields, we need to pay attention to what the config tool says for the ‘Name in file’, not the ‘Name for display’ (Note that it is ‘*newField,’ and not ‘newField,’ in the WeSay user interface. The asterisk, which is visible in WeSay, is only present in displayName in the WeSayConfig, not in either of fieldName from the WeSayConfig or field/@type from the LIFT file.)

Importing to FLEx

I was happy to see that the field created in WeSay shows up under FLEx custom fields (after importing the WeSay LIFT file):

Note that Location, Type, and Writing System(s) are all grayed out. There may be some way of modifying these settings in FLEx once they have been set in WeSay, but isn’t obvious at first glance. Here is the field in the lexicon editor:

I had to select ‘Show Hidden Fields’ to be able to see it the first time for some reason. But then I deselected it, and the field remained visible.
Note that the label in FLEx is ‘newField,’ without the asterisk, which comes from the type attribute of the field in the LIFT file. As far as I can see, there is no Distinction between file and display names in FLEx. This is appropriate for at least the following two reasons:

  1. FLEx seems to deal fine with spaces in field names (I’ve had problems with this in WeSay).
  2. FLEx users should be able to handle whatever complexity the field names throw at them. WeSay, on the other hand, needs to control carefully what the user sees, and it’s relationship to the LIFT field in question. For instance, the form in lexical-unit in a lift file is displayed as “Word” by default in WeSay, since people are putting words into it. But when I analyze those words into roots, it is nice to be able to change that field’s display name to “Root” in WeSay, without having to change the underlying LIFT structure. This flexibility of the display name can help keep the WeSay user from getting confused without unnecessarily complicating the database.

Notes for Creating fields in WeSay to be imported to FLEx

  1. Pay attention to ‘Name in file’ in the WeSay Config Tool, since that will be what the field will be called in the LIFT file, and in FLEx (and presumably in other programs that would use LIFT).
  2. You may need to click on ‘Show Hidden Fields’ to see the field in FLEx.
  3. There doesn’t seem to be a way to put fields anywhere than in the ‘Custom Fields’ section of FLEx, so I hope that’s where you want it (if not, stay tuned for the next installment, going the other way).

Wesay instructions

When I started working with WeSay in BALSA, it became clear that the people I’m working with were still going to need either a lot of hand-holding, or some instructions. So I wrote these down, and have massaged them a little (not everything was as clear as it could have been), and hope I have something highly (but not completely) fool-proof. But your mileage may vary. I’m submitting them in case someone finds them useful.

Instructions.pdf
Instructions_fr-FR.pdf

Proposal revisited

I’ve written before about using WeSay to collect language data, and at one point I even wrote up a proposal for features I think it would be nice to have there –specifically targeted at orthography development, but the same tools could be used for spell checking (lining up similar profile words, that is, not coming up with a list of correctly spelled words). Anyway, it doesn’t look like that proposal is going anywhere, so I thought I’d give it another try.

The Basic Problem

  1. I’m working in an increasingly large number of languages (five people representing three languages were in my office most of this morning), and I’m looking to see that trend continue (looking to the 60 unwritten languages in eastern DRC).
  2. I am one person, and can only work on one language at a time (however often the language may change throughout the day).
  3. The best tool we have for sorting and manipulating large amounts of lexical data (FLEx) is great at what it does, but is inaccessible to most of the people I work with (it was made for linguists, after all. :-))
  4. So I am left with doing all the work with each guy in FLEx (see #2), or else finding a tool that can be more easily used by the people in #1.

Getting from Data collection to an Orthography

For a number of tasks (word collection), WeSay does exactly what I want. Since most of these people are touching a computer for the first time, let’s get rid of as much of the complexity and room for error as possible. Do one thing at a time, in a constrained environment. But once I’ve collected a wordlist in WeSay, I still need to

  1. Parse the roots out of the forms collected (I can collect plural forms in WeSay now, but we still need to get the full word forms into a field that should be full word forms (e.g., <citation>), and just roots in the lexical-unit field. In case it isn’t obvious, a lot of the phonological analysis depends on the structure of the root — the first root consonant is more important to our analysis that the first word consonant, so we need to be able to sort/filter on it.
  2. Sort and filter the word forms, so we see just one kind of thing at a time (we don’t want to see if kupaka and kukapa have the same ‘p’, since the difference in word position might cause such a difference that is irrelevant to the phonemic system (English speakers pronounce p’s differently in different environments, but use one letter for them all — did you know?).
  3. Go through each controlled list of words, to see where the current writing system (we use a national language to get us started) is not making enough distinctions, and where it is making too many.
  4. Mark changes on the appropriate words, returning the corrected information to the database.

While FLEx can handle all these tasks fine (however slowly at times), if I’m going to help other people move forward in their own language development, I need to find another tool (or tool set). I have got these guys quite happily working in WeSay (yes, there are kinks, but still we all are often working 3-4 times as fast as if I was typing everything myself, since there are more of us typing), but I need to get ready for these next steps, in a way that allows us to build on this momentum, rather than tell everyone but one team to go home until I have more time.

An Attempt

So I’ve been playing with XForms lately (for a lot of other reasons), and I’ve toyed with the idea of getting at and manipulating the LIFT file by another engine. The idea of being able to write a simple form to control the complexity of the underlying XML, and to manipulate it and save back to XML, was very exciting. I even have one form (for collecting noun class permutation examples) deployed. Then I read about the post describing the death of the Firefox XForms extension, and I thought surely, there must be a better way to do this, and I’m sure someone out there knows what it is. So I’ll spend some time outlining exactly what I’d like to see, and maybe someone will know what to do to make it happen.

A Proposal

I took some screenshots of some xforms I did, displayed in firefox. Here’s the first form, for parsing roots (Havu is the name of the language I’m using to test this, and one of the next who will be looking for this to work):

The important aspects are

  1. The prefix of the original citation form (sg here), which filters through the database, allowing us to work on words that are (probably) just of one noun class at a time.
  2. A number of possible plural prefixes, to control the potential output forms
  3. The originally input word form, with gloss (where present), to clearly identify each word.
  4. A number of buttons which allow a non-linguist to simply push a button to input the plural form and parse the root (probably including a “none of the above,” which would skip processing for this word).
  5. Under the hood processing which would, on a click:
  1. copy the original form into a citation form field in the database (potentially a new node in the lift XML),
  2. remove the prefix from the original form, and
  3. input the plural form into a field for the plural form (again, potentially a new node in the lift XML).

It would probably be better to show the user one word at a time, rather than the list in this screen-shot, but I included it all here, to show how the removal of the prefix, and the application of the new prefix, would need to apply to the form of each entry. Also, it would be nice if the form could be adapted for suffixing languages.
What I had so far for an example trigger (in XForms) would be

<xf:trigger><xf:label><xf:output value=”concat(‘mo’, substring(lexical-unit/form[@lang=’hav’]/text[starts-with(.,
‘aka’)], 4))” /></xf:label>
<xf:action ev:event=”DOMActivate”>
<xf:insert context=”.” origin=”instance(‘init’)/citation” />
<xf:setvalue ref=”./citation/form/text” value=“lexical-unit/form[@lang=’hav’]/text[starts-with(., ‘aka’)]”/>
<xf:setvalue ref=“lexical-unit/form[@lang=’hav’]/text[starts-with(., ‘aka’)]” value=“substring(lexical-unit/form[@lang=’hav’]/text[starts-with(., ‘aka’)], string-length(‘aka’) +1)”/>
<xf:send submission=”Save”/>
</xf:action>
</xf:trigger>

As you can see, the value of the prefix is hard-coded here, since I haven’t been able to get variables to work. also, the setvalue expressions don’t really behave (neither node-creation, not setting the right value for an existing node). It’s hard to tell what is a limitation of XForms, and what is a limitation of the Firefox extension –I tried another XForms renderer, but no luck so far… Needless to say, this is not what I do best, so help, anyone?
The next form I’d like sets ATR values for whole words (usually the harmony around here is fairly strict, so it would help with a lot, but not all, vowel questions):

This form is similar to the above, in that I’m looking for a simple regular expression (or a more complicated on, if possible. :-)) to control the data we’re looking at at once, and a binary choice for which vowel group the word belongs to (showing the new word form on the button). A choice of one or another would set the word form accordingly.
Same caveats above about the list of words on a page, and should probably have a “neither” button for when a word doesn’t obey strict vowel harmony, which would bypass processing for that word.
The trigger I had so far (for the -ATR button, reverse for the +ATR) was

<xf:trigger><xf:label><xf:output value=”translate(lexical-unit/form[@lang=’hav’]/text, ‘aeiou’, ‘aɛɨɔʉ’)”/>(-ATR)</xf:label>
<xf:action ev:event=”DOMActivate”>
<xf:setvalue context=”.” value=”translate(lexical-unit/form[@lang=’hav’]/text, ‘aeiou’, ‘aɛɨɔʉ’)”></xf:setvalue>
<xf:send submission=”Save”/>
</xf:action>
</xf:trigger>

The translation includes a>a, but could be modified depending on what /a/ does in a given language (especially if there are 10 vowels).
The third and final form, where we would spend the bulk of our time, might look like this:

Here we have:

  1. The ubiquitous regex to control the data we’re looking at.
  2. The letter that is being evaluated (we’re only analyzing one written form, be it digraph or not, at a time).
  3. The position to find that letter (something like first or second in the root, would probably be good enough).
  4. The options for replacing it (buttons labeled with the new forms), in case a word in the list uses a sound that doesn’t sound the same as the sound in that position in the rest of the words in the list.

Again, a choice of new word form would write the new form to the database (at which point the word would likely disappear from the list, until the regexp matched the new form of that word). It would be nice if the same replacement could be made in the citation and plural forms, presuming the same sounds and written letters would apply. I’m not sure if it would be necessary to devise two forms, one for consonants, and another for vowels. It would depend (at least) on how the regex worked, since the underlying principle is the same for consonants and vowels — look at one thing in one position at a time, and mark each one as the same or different, compared to the other things on the list.

Specifications

Some things we would need to make this tool useful:

  1. Read and modify LIFT format in a predictable and non-destructive way (making all and only the changes we’re looking for, to the fields we’re looking at, leaving everything else alone). This is the format we’re keeping all our lexical data in, and we need to play nice with a number of other programs that use the same data structure standard.
  2. Use of regular expressions. For the root parsing tool, a simple prefix (or suffix) filter would do, but for the others it would be nice to be able to constrain syllable type, as well as position in the word (i.e., kupapata should not appear on the same list as kupapa, even if we’re looking at first ‘p’ in the root, since the second is a longer root). In FLEx, I use expressions like these (more included below), though a simpler format could do, if that kind of power were not possible. Ideally, these expressions would be put into a config file, and the user would only see the label (I have these all done, and can come up with more if needed).
  3. Cross-platform. We run most of our work on BALSA, so it would need to be able to run on at least Linux, which provides BALSA’s OS.
  4. Simple UI. It is probably not possible to overstate computer illiteracy we are dealing with here. People are eager to learn, and often capable to learn, but the less training we need, and the less room a given task has to screw everything else up, the better (WeSay‘s UI is a great model).
  5. Shareable. Even if not open sourced, any tool we might use here needs to be legally put on every computer we and our colleagues use, and they don’t often have the money or access to internet to buy licenses.
  6. Supportable. As hard as we try to keep the possibility of errors out of our workplace, it happens. Today we had two technological problems, each of which required non-significant amounts of my time. If I weren’t here (or if the problems had been beyond me), the teams would have been stuck. The simpler and more accessible (or absolutely error free!) the technology is under the hood, the more likely someone local will be able to deal with problems that arise in a timely manner.

Anyway, there it is.

Examples of Useful Regular Expressions for Filtering Lexical Data

The following are output by a script, which takes as input the kinds of graphs (e.g., d, t, ng’, and ngy) used in a given language’s writing system. For instance, these expressions do not allow ‘rh’ as a single consonant, but those I did for another language does. Similarly, these are based on ten particular vowel letters, which could also be changed for a given language.

^([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])([aiɨuʉeɛoɔʌ])([́̀̌̂]{0,1})([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])([aiɨuʉeɛoɔʌ])([́̀̌̂]{0,1})$

(all CVCV –short vowels only)

^([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])([aiɨuʉeɛoɔʌ]{1,2})([́̀̌̂]{0,1})([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])([aiɨuʉeɛoɔʌ]{1,2})([́̀̌̂]{0,1})$

(all CVCV –long vowels and dypthongs OK)

^([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])([aiɨuʉeɛoɔʌ]{1,2})([́̀̌̂]{0,1})12([aiɨuʉeɛoɔʌ]{1,2})([́̀̌̂]{0,1})$

(all CVCV with C1=C2 –counting prenasalization)

^([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])([aiɨuʉeɛoɔʌ])([́̀̌̂]{0,1})([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])3([́̀̌̂]{0,1})$

(all CVCV with V1=V2 –no long V’s or dypthongs)

^([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])(a)([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])3([́̀̌̂]{0,1})$

(all CVCV with V1=V2=a)

New Wesay Tasks II –Parsing and Tone

Having found out how to configure and add new tasks in WeSay, I can now make some serious changes to my workflow (read: I can actually have people do some of their own data entry and parsing!)

The following task is a follow-up to the one presented here, where the root and meaning were visible but not editable, helping a non-linguist focus on adding plural forms to his dictionary –based on those visible but not editable fields. In this next task, the plural form forms a basis for parsing the root, so the plural and definition fields are visible but the only editable fields are root (EntryLexicalForm, default English label ‘Word’) and citation form. This allows the non-linguist to copy the form from the root field to the citation form field, and remove affixes from the root.
This task is slightly unideal in that it doesn’t know if a word has been parsed yet, or not. It checks for the presence of a citation field (which is usually only made in this kind of process), but it doesn’t know if, for instance, you have a mono-morphemic word which is pronounceable as is (i.e., standard lexicography practice would be to not have a citation field), and it would think that that word was still “to do.”

Here is my new root parser task:

<task
taskName=”AddMissingInfo”
visible=”true”>
<label>Roots</label>
<longLabel>Parse Roots</longLabel>
<description>Move citation info to Citation form field, leave roots in root field.</description>
<field>citation</field>
<showFields>EntryLexicalForm, citation</showFields>
<readOnly>Plural, definition</readOnly>
<writingSystemsToMatch />
<writingSystemsWhichAreRequired />
</task>

While I was at it, I made another task to help populate the “Tone” field. Too bad I can’t get WeSay and FLEx to agree on a writing system to restrict the characters here, so we’re likely to get a lot of malformed data here. But it should be a place to start. Anyway, here it is:

<task
taskName=”AddMissingInfo”
visible=”true”>
<label>Tone</label>
<longLabel>Add Tone Info</longLabel>
<description>Input tone information for each record.</description>
<field>Tone</field>
<showFields>Tone</showFields>
<readOnly>Plural, citation, definition</readOnly>
<writingSystemsToMatch />
<writingSystemsWhichAreRequired />
</task>

This gives me a new desktop:

And a new parsing task:

And a new Tone input task:

And the nice thing about tasks in WeSay, is that you can enable just the ones relevant to the task at hand, to simplify the UI. Which means that I can put all these in my new default project, and leave them disabled (<task taskName=”AddMissingInfo” visible=”false”>, or unticking them in the config) until we need them.
I’ve already shared my dream, but in the mean time, this gets us a lot closer do doing low-level language development in WeSay.

New Wesay Tasks

I found out today that the tasks in WeSay are more configurable than I had thought. I add a “Plural” field to any language project I work in, but until today we have been stuck inputting that either in FLEx (i.e., by me) or in “Dictionary Browse & Edit,” which is unideal at best.
Looking at the config file (in a text editor — make a copy first, etc.!), there are a number of <task/> nodes, several of which have the taskName attribute of “AddMissingInfo.” This appears to be a generic task (unlike some of the others), that comes with a number of options, and can be used multiple times.
In addition to the labeling, there is a field it checks for (if that field is populated, the record doesn’t show in the task), fields it shows as editable, and fields that are shown, but not editable. I don’t know what the last two do, since they are always empty nodes (in all the tasks in my config, anyway). I found out (by trying) that if you tell it to look for a field that isn’t there, it will crash (so make your new fields, then the tasks to add them). I think if you tell it to display a field that isn’t there, it simply won’t. Here is my task node for the new “Add Plurals” task from my config file:

<task
taskName=”AddMissingInfo”
visible=”true”>
<label>Plurals</label>
<longLabel>Add Plurals</longLabel>
<description>Add plural forms to entries where they are missing.</description>
<field>Plural</field>
<showFields>Plural</showFields>
<readOnly>definition, EntryLexicalForm, citation</readOnly>
<writingSystemsToMatch />
<writingSystemsWhichAreRequired />
</task>
Here is the new desktop:

and Here is the page for the new task: