Category Archives: Language Development

Copying between fields in two writing systems in FLEx

Last month I taught the FLEx II course at CoLang, held at UTA. It was very interesting trying to teach to about 45 students, coming for a great variety of backgrounds, but I think we all learned something. There was one thing that I taught, which I thought deserved further write-up, so I’ll do that here.
The problem is not immediately obvious, unless you spend lots of time thinking about how FLEx does what it does, in particular how writing systems work. But when you want to copy data from a field that is encoded in a particular writing system, into a field that is encoded in another writing system, you can’t just bulk copy and get results you might expect. The reason is that the data itself is tagged for which writing system it is in, and not just the field. So you can, theoretically, have Spanish data in a field that is supposed to have English. But this is actually a strength, as it allows you to tag one word or sentence in its correct language, even if it is surrounded by another language, all within the same field (like if you write a note in English, but include in the note words in another language). All data is tracked by writing system, and you don’t loose that information when you copy from one field to another.
So, the task we were working on, which you may need to do some day, was copying data from one writing sytem, to use as a base for another. For instance, if you have data in a practical working orthography, and you want to also have an IPA field, you may notice (as I hope is true) that much of your orthography transfers directly over to the IPA. And for those things that don’t, there should be regular changes (like substituting [ɸ] for ‘ph’). This task is just begging to be done through bulk editing. Why type all that over again, just to change a few phonemes here and there? Why not make most of the systematic changes systematically? But we can’t just copy a practical orthography field into an IPA field, since the data would still be encoded as the practical orthography, even if the field should contain IPA data. So here’s what you should do.
First off, I assume you have your two writing systems set up; I’m using Mbo and the IPA variant in these screenshots:

Writing sytems

Go to the Lexicon pane:

To Lexicon

Then Select Bulk Edit Entries:

To Bulk Edit entries

To be able to operate on both fields, you need to make them both visible. Click on the “Configure which columns to display” button:

To Show Columns

Then click on More Column Choices (unless your IPA field is in the list, in which case you just select it):

To More column choices

In the dialog that comes up, you selct the Lexeme form field (or whatever field you’re copying to). Yes, it is already on the right; we want to display the Lexeme form field twice, once in each of two writing systems:

Select Lexeme Form

Click Add:

Click Add

Initially you will probably have the same writing system for each of the two fields:

Both fields in same WS

change one to the IPA variant:

Pick IPA WS

then I like to move that second field up, so it will display next to the other one. While the field is selected, click on the up arrow:

Move WS field up

Then keep clicking until it is in place:

Move WS field up 2

Now that you have everything situated, click on OK:

Click OK on WS

That should take you back to the bulk edit pane, where you should see your IPA field. Assuming you’re just starting to work in this field, it should be empty:

IPA Empty

Then you Bulk edit, like normal, by selecting the Source Field (the one with data in it):

From LF

and the Target Field:

To LF-IPA

As always, you want to preview your bulk edit, to make sure it’s doing what you expect:

Preview Bulk Copy

And you should see blue arrows going from nothing to data, which matches the field next to it:

Preview Bulk Copy results

If you don’t like what you see, just click clear, and fix whatever was wrong. But if you like it, click Apply:

Apply Bulk Copy

And then you’ll have both fields filled with the same data (and no more blue arrows):

Apply Bulk Copy results

But if you select data in the IPA field, the indicator above will show that the data is NOT in IPA, but still in the other writing system:

Wrong writing system

So this is the problem we need to fix. To do this, we’re going to Bulk Replace:

Select Bulk Replace

Select the target field (Just the one we want to change writing systems on, the IPA on in this case):

Select WS_IPA

Then click on Setup…:

Setup Bulk Replace

This will give you Bulk Replace Setup dialog:

Bulk Replace Setup dialog

Where you can select in the “Find what:” box, then (click “more” if you have to, and) select Format/Writing System/xyz –whatever writing system you copied your data from:

Select From WS

My experience is that at this point, FLEx will figure out what you’re trying to do, and set the other field for you. You can verify this by seeing “Format: <Writing system name>” under each field:

Bulk Replace Setup w Langs

You don’t need to add anything to the empty fields in this box; you want to find everything. So you can just click OK. Unfortunately, I don’t see anything when I click “Preview” here, so we just trust that we’ve set it up correctly (you backed up your data before starting this, right? If not, stop and do it now.), and hit Apply.
Back in the bulk replace field, we can verify that the data in the IPA field is now indicated as IPA:

Mbo_IPA

While the orthographic field is still in the orthographic writing system:

Mbo

At this point, you can go through your IPA field and convert orthographic letters to IPA equivalents, either systematically through bulk replace (if appropriate) or manually. Then you can enjoy your dictionary database with both orthography and IPA in your entries!

WeSay and BALSA: Thanks!

I just finished a trip to Bunia and Nia-Nia, DRC, where I helped the Ndaka [ndk] and Mbo [zmw] communities develop draft alphabet charts and transition primers, the material for each language including all nine vowels (with ATR harmony) and the egressive/implosive stops. The Mbo version also includes the [p]/[ɸ] contrast, as /p/ is normally [ɸ] there. Each booklet includes a short story in the new draft orthography.

I’ve written before about using WeSay to collect language data in highly illiterate language communities, which was a major part of this work.  And since I don’t want to do IT work full time (or rather I have other things to do with my time), I’m using WeSay in BALSA. So since much of this work would not have been possible without the work of many people, especially those working on BALSA and WeSay, I wanted to take a minute to thank them. Without a budget to do so materially, I’ll do that through describing the work here, and explicitly saying that if you work on WeSay and/or BALSA, please feel free to take and use this story and/or pictures in your own publications; this is your story, too.

We met for this workshop in Nia-Nia, DRC, about an hour into the rainforest (by MAF) from Bunia, which is about an hour (again by MAF) from Entebbe, Uganda. We met in a church, with the Ndaka covering most of the workshop logistics, since this is their home turf.  The Mbo are also a bit on the run these days from a militia conflict that hasn’t seemed to end yet. And they’re a smaller and more illiterate people group. So much so that one of the guys on the Mbo team didn’t participate in a dictation exercise, as we were practicing new letters. And yet, here they are, around a BALSA machine, using WeSay:
WeSay̠Ung'inMbo_IMG_4222_sm
Admittedly, the guy touching the computer isn’t Mbo, but he’s helping them deal with the interface, and they’re choosing pictures through the Art of Reading interface in WeSay, which would seem to be even more popular with less literate communities.

This one is of the Ndaka team, using WeSay in BALSA independently:
WeSay_ndk_IMG_4220_sm
They’re also picking images to go with dictionary entries we put there earlier, though some people started modifying entries by the end of the workshop. Lest this seem trite, let me point out that this was the first time that ANY of them had used a computer of any kind.  For some, it was the first time they had seen one. You can see in the background the current stabilizer, which is plugged into the generator we used to have electricity. Without the stabilizer, I wouldn’t plug in a computer, because of the risk of unstable current. After the stabilizer, we put a fridge guard. when I can, I put a second stabilizer in series, to even out what current irregularities the first one doesn’t catch. Which is all to say that this is not the most computer friendly place, even after the alternating dust and humidity, and the heat. But these guys took to the tasks, and were able to work somewhat independently on computers for the first time.  Having tried this in similar contexts with other software, I attribute this success entirely to WeSay and BALSA.  Thanks, guys, for making that possible.
Here is the same team from another angle:

WeSay_ndk_II_IMG_4223_sm
And here is one with other people hanging around, showing that this is truly a community affair:

WeSay_ndk_kids_IMG_4224_sm
So this workshop was a success in part because people who had never used computers before (including the elementary school principal, shown in the background of this last pic), were able to get up and running in very little time, with very few frustrations.  They even enjoyed the work so much, I had to kick them out several evenings, after it was already too dark to walk home. So thanks to everyone involved, for your part in making this happen.

Creating Tone fields in Fieldworks 7.0.6~beta7 (not useful for WeSay 0.9.28.0+)

Creating Tone Fields by the Method Native to FLEx –The Better Way

(N.B.: this entry started with FW7.05~b5 and WS0.9.28, though I’m finishing it on FW7.06~b7 and WS1.1.11. Some of the screenshots may look different between these versions, but I haven’t noticed any difference in functionality with regard to these fields.)
After creating custom fields in this way for tone and plural forms, I found that tone fields are already accounted for in FLEx, though not particularly transparently. There is a set of pronunciation fields, which can be inserted here:

This option puts the set of pronunciation fields in the record you’re editing, not the whole database. It gives tone, as well as a couple other fields. It looks like this in FLEx:

What’s nice about this is that you can do this a number of times, for the same entry. This gives you the chance to have a number of pronunciations, in different contexts –which is important in phonology, especially with regard to tone. The “Location” field is an empty, customizable field, so I presume we could put things like “Before a High Tone” or “phrase finally” or whatever there, then know that that pronunciation is valid for that context. Filling in some bogus data, we see the following in FLExː

Under the Hood

The above results in the following in the appropriate entry of the LIFT file:

<pronunciation>
<form lang=”gey”><text>ba</text></form>
<field type=”cv-pattern”><form lang=”en”><text>CV</text></form>
</field>
<field type=”tone”><form lang=”en”><text>?H</text></form>
</field>
</pronunciation>
<pronunciation>
<form lang=”gey”><text>bad</text></form>
<field type=”cv-pattern”><form lang=”en”><text>CVC</text></form>
</field>
<field type=”tone”><form lang=”en”><text>?HF</text></form>
</field>
</pronunciation>

So each pronunciation has a form/text set of nodes, and fields with type attributes for each of the visible fields with data in FLEx. Note that these fields are formatted exactly the same as the fields we created earlier here and here, that is

<field type=”NameofFieldinFLEx”>
<form lang=”LanguageCode”>
<text>Field Contents</text>
</form>
</field>

The only difference here is that the fields are under a <pronunciation> node, and not directly under the entry itself. But the fact that these fields are grouped together under repeatable pronunciation nodes should mean that we can organize contextually dependent pronunciation (tone or segmental) fields.

Sorting on Pronunciation Fields

I tried sorting on individual pronunciation nodes in FLEx, but wasn’t immediately impressed. I tried sorting the above fields for those with CVC in the cv-pattern, and this is what I got:

One can see that the entry is filtered, not the set of pronunciation fields. When working with Toolbox, it was possible to filter on either of a repeated field within an entry. Recalling that this was only when sorting on that field (therefore producing a record for each of the multiple fields), I tried that in FLEx, and it worked:

Note that there is only one pronunciation field listed, and the pronunciation form and tone fields listed are those that correspond to the CV field that was selected in the filter.
This data structure would also allow one to select only particular tone patterns, such as with an XPath expression like pronunciation[/field[@type=’cv-pattern’]/form/text = ‘CVC’]/field[@type=’tone’]/form/text to get the information in the tone field under only those pronunciation nodes that also have CV fields with ‘CVC’ in them.
Unfortunately, I haven’t been able to see these fields in WeSay (yet, I hope: see this bug report). Which is sad, because this is otherwise the best way to indicate tone in FLEx.

===Poetic Interlude===
I wrote most of the above several months ago, and had forgotten that I had worked this much out, until I ran into the problem of bulk editing on these fields. A quick Email to <Flex_Errors at sil.org>, and a fairly rapid response later, and I was back in business. When I went to write it up, I found the above in my drafts folder…
===End of Interlude===

So I’ve been doing a lot of data collection in the last couple months using the above paradigm, keeping different tone fields separate by their sibling location fields. I have XSL transforms to add this data to a LIFT file, and some reports to pull it out later, but how to mess with it in the mean time, should I need to? To get bulk editing on these fields to work, I needed two things:

  1. to sort on ‘pronunciation’ or one of it’s children (this I had apparently already figured out, but forgotten)
  2. to select the right columns for viewing in the bulk edit view.

Selecting the right columns for viewing in the bulk edit view

In case it isn’t obvious, the visible columns in the bulk edit view determine what fields you can act on. If “Lexeme” isn’t visible, you can’t copy to or from it, or modify it with a regular expression. So first, you need to make the fields you’re looking for visible, which is done through a dialog you can access by clicking in the upper right corner, with tooltip “Configure which columns to display”:

When you click on this, you get a menu of a number of (recently selected?) fields. To access other fields, to change column ordering, or to select language options, select “More column choices…” at the bottom:

This gives you access to the following dialog, where you can find fields not on the above list, select which of a number of writing systems you want to see (and therefore Bulk Edit). The Arrows on the right allow you to move the fields up and down (moving columns left and right on the Bulk Edit screen):

One trick that may not be obvious is that the ‘Tone’ field under ‘Pronunciation’ is available here as ‘Tones’. I presume this is because there are potentially a number of different Tone fields (as in my case). This is the same for ‘Location’ > ‘Locations’ and ‘CV Pattern’ > ‘CV Patterns’.

Sorting on Pronunciation Columns

Once all the fields you’re interested in are in the “current columns” (right) side of that dialog, you can select a column to sort on (showing light blue triangle). Selecting ‘Pronunciations’ gives three lines for this entry, and proclaims “Pronunciation” at the top of the page for slower ones like me.

If you’re in a context where you want to sort on two of these fields (if one doesn’t uniquely sort them, as the screenshot above), you can select one, then shift-select another, which will give a secondary sort (and a smaller triangle) as in the following:

Here the location is the first sort, then the tone. Note that the pronunciation form isn’t sorted (a…z…k…a), though the duplicate HAfter-sg field for titi is (correctly) showing up as another pronunciation/tone field (with pronunciation/form atíti nɛ) –showing that sorting by any of the pronunciation fields gives this layout.

Bulk Editing Pronunciation Fields

Getting back to the point of it all (for me, anyway), with this configuration it is now possible to bulk copy to/from these fields:

Locations didn’t show up for me under “Bulk Replace”; I’m not sure why, though that sounds familiar –perhaps I didn’t configure it right, or maybe that’s a bug.

Summary

Though tone fields created under pronunciation fields is not currently helpful for WeSay collaboration, it seems a much more principled way of treating tone data in FLEx, since it natively allows for varied contexts, CV patterns, segmental morphophonemics impacting the frame (since each pronunciation field has a form field, which can include the lexeme, frame, and any segmental interactions between them). In addition these fields are accessible to FLEx filtering and sorting, including bulk edit operations.
Given the complexity of this configuration, I would not recommend what I have described to the computer non-savvy (e.g., users more comfortable in WeSay). But for those comfortable manipulating these configurations, FLEx can be a powerful tool for manipulating tone data.

Round-tripping LIFT data through XLingpaper

Rationale

The LIFT specification allows for interchange between lexical databases we use, such as in FLEx and WeSay. As an XML specification, it is also subject to XSL transformation, and can be converted to XML documents that conform to other specifications, such as XLingPaper, an XML specification for writing linguistics papers. I described before a means to get data out of FLEx into XlingPaper, but that required a script generating regular expressions which were then put into a FLEx filter by hand (metaphorically speaking). Computers should be able to automate this, and so (following my “If computers can do a particular task, they should” motto) I developed a script to take that regular expression generator, and feed those expressions to an XSL stylesheet to produce XlingPaper XML from the LIFT XML automatically.
The other half of the rationale is that I hate exporting data from a database to a paper or report, seeing and error, and not being able to fix it once. Either I fix it in the paper and the database, or else in the database, then re-export to the paper. So a way to get data from LIFT to XlingPaper and back seemed helpful for drafting linguistics papers, even if one wasn’t dealing with the volume of reports I’m looking at generating.

Tools

One major caveat for this work is that these tools (FLEx, WeSay, and XLingpaper) are in active development, so functionality may vary over time. The tests in this post were run with the following:

  1. FLEx 7.0.6.40863 (for Linux)
  2. WeSay 1.1.9 (for Linux) –This doesn’t enter directly into these tests, but the LIFT files used often sync back and forth between these two programs.
  3. xsltproc from a standard Ubuntu Linux install (i.e., compiled against libxml 20706, libxslt 10126 and libexslt 815)
  4. GNU bash, also from standard Ubuntu Linux (i.e., version 4.1.5)
  5. GNU diffutils, also from standard Ubuntu Linux (i.e., version 2.8.1)
  6. XMLMind Xml Editor, version 5.1.0
  7. XLingPaper, version 2.18.0_3

All of these tools are free (or have a free version) and available online from their respective sources, and most are open source.
The scripts I’ve written (to generate reports and call the XSL transforms) are not yet publicly available; I hope to have them cleaned up and more broadly tested before long.

Test Goals

I want to see if I can

  1. Get data from LIFT to XLingPaper format,
  2. Modify the XLingPaper document in XXE (which keeps it in conformity to the XLingPaper DTD),
  3. Get it back into LIFT and imported to FLEx,
  4. Show that the FLEx import made all and only the changes made by modifying the XLingPaper document (i.e., no other data loss)

To do this I will be using an output of diff between two versions of the XLingPaper document (original and modified), and another diff between two versions of the LIFT file (originally exported, and exported after input). To achieve #4, I will show that the two diffs show all and only the same changes to data entries (the modifications to the XLingPaper doc are the same as the changes to the FLEx database, as evidenced by its export to LIFT). Fyi, this LIFT file has 2033 entries, and takes up almost 2MB (plain text), so we’re not talking about a trivial amount of data.

Test procedure

  1. Backup Wesay folder (this is real [gey] data I’m working with, after all…)
  2. Export “Full Lexicon” from FLEx, and copy it to gey.ori.lift
  3. Run report (vowel inventory) on exported gey.lift (This creates Report_VowelInventory.gey.xml)
  4. Open created report in XXE
  5. Modify and save (because XXE changes format –this helps diff see real changes, not those irrelevant to xml)
  6. Save as Report_VowelInventory.gey.mod.xml, and modify one example of each field we’re interested in, including @root (at this point both files have been saved by XXE, for easier comparison).
  7. Run `diff Report_VowelInventory.gey.{,mod.}xml` (results below)
  8. Run `xlp-extract2lift Report_VowelInventory.gey.mod.xml .` (This creates Report_VowelInventory.gey.mod.compiledfromXLP.lift)
  9. Backup FLEx project (just in case, as there’s real data here, too)
  10. Import Report_VowelInventory.gey.mod.compiledfromXLP.lift to FLEx project, selecting “import the conflicting data and overwrite the current data (importing data overrules my work).” and unticking “Trust entry modification times” (This is important because if that box is selected entries won’t import unless you have also changed the ‘dateModified’ attribute on an entry –which I generally don’t).
  11. Export again, producing a second LIFT file exported by FLEx (one before, and one after the import)
  12. Run `diff gey{,.ori}.lift`
  13. Compare diffs to see fidelity of the process.

Test results

Here is the diff showing the changes between the original report and the modifications:

$ diff Report_VowelInventory.gey.{,mod.}xml
11c11
< >Rapport de l’Inventaire des Voyelles de [gey]</title

> >Rapport de l’Inventaire des Voyelles de [gey]MOD</title
23c23
< >Kent Rasmussen</author

> >Kent RasmussenMOD</author
42c42
< >Voyelles</secTitle

> >VoyellesMOD</secTitle
65c65
< >mbata</langData

> >mbataMOD</langData
89c89
< >pl: mabata</langData

> >pl: mabataMOD</langData
113c113
< >fissure, fente</gloss

> >fissure, fenteMOD</gloss
137c137
< >mke / wake</gloss

> >mke / wakeMOD</gloss
155c155
< externalID=”ps=’Noun’|senseid=’hand_0d9c81ef-b052-4f61-bc6a-02840db4a49e’|senseorder=”|definition-swh=’mkono

/ mikono'”

> externalID=”ps=’Noun’|senseid=’hand_0d9c81ef-b052-4f61-bc6a-02840db4a49e’|senseorder=”|definition-swh=’mkono

/ mikonoMOD'”
171c171
< externalID=”ps=’Noun’|senseid=’orange_2924ca57-f722-44e1-b444-2a30d8674126’|senseorder=”|definition-fr=’orange'”

> externalID=”ps=’Noun’|senseid=’orange_2924ca57-f722-44e1-b444-2a30d8674126’|senseorder=”|definition-fr=’orangeMOD'”
180c180
< externalID=”root=’paka’|entrydateCreated=’2011-08-05T10:57:05Z’|entrydateModified=’2011-09-27T11:24:32Z’|entryguid=’44dcf55e-9cd7-47a9-ac66-1713a3769708’|entryid=’mopaka_44dcf55e-9cd7-47a9-ac66-1713a3769708′”

> externalID=”root=’pakaMOD’|entrydateCreated=’2011-08-05T10:57:05Z’|entrydateModified=’2011-09-27T11:24:32Z’|entryguid=’44dcf55e-9cd7-47a9-ac66-1713a3769708’|entryid=’mopaka_44dcf55e-9cd7-47a9-ac66-1713a3769708′”

As you can see from this diff output, I changed data in a number of different types of fields, including the report title, author, sectionTitle, langData (from citation), langData (from Plural), glosses in each of French and Swahili, and the last three are root and definitions, which are not visible in the printed report, but stored in an ExternalID attribute (recently added to XLingPaper to be able to store this kind of info, without having to put it elsewhere in the structure of the doc).

And here is the diff showing the changes between the original LIFT export and the one exported after importing the LIFT file with modifications:

$ diff gey{,.ori}.lift
2601c2601
< <form lang=”swh”><text>mkono / mikonoMOD</text></form>

> <form lang=”swh”><text>mkono / mikono</text></form>
10776c10776
< <gloss lang=”swh”><text>mke / wakeMOD</text></gloss>

> <gloss lang=”swh”><text>mke / wake</text></gloss>
15871c15871
< <form lang=”gey”><text>pakaMOD</text></form>

> <form lang=”gey”><text>paka</text></form>
23529c23529
< <form lang=”gey”><text>mbataMOD</text></form>

> <form lang=”gey”><text>mbata</text></form>
27587c27587
< <field type=”Plural”><form lang=”gey”><text>mabataMOD</text></form>

> <field type=”Plural”><form lang=”gey”><text>mabata</text></form>
31657c31657
< <form lang=”fr”><text>orangeMOD</text></form>

> <form lang=”fr”><text>orange</text></form>
32416c32416
< <gloss lang=”fr”><text>fissure, fenteMOD</text></gloss>

> <gloss lang=”fr”><text>fissure, fente</text></gloss>

Summary

  1. The first several MOD’s to the paper (to titles, etc.) are not in the second diff, since only example data is extracted into the LIFT file to import (this is what we want, right?).
  2. The other mods –root, citation, plural, gloss-swahili, gloss-french, definition-french and definition-swahili– all survived.
  3. No other changes existed between the exported LIFT files.

Discussion

Because FLEx exported essentially the same LIFT file (of 2033 entries and almost 2MB, remember), with all and only the changes made in XXE, I presume that there were no destructive changes to the underlying FLEx database, and this procedure is safe for further testing. I did not go so far as to diff the underlying fwdata file, as I probably wouldn’t understand its format anyway, and I wouldn’t know how to distinguish between differences in formatting and content (while it is also XML, I don’t understand its specification or how it is used in the program –which is not a bad thing).
Speaking of what I don’t know, I should be clear that my formal training is in Linguistics (M.A. Oregon 2002), not in IT. I’m doing this because there is a massive amount of linguistic data to collect, organize, analyze and verify, and I want to do that efficiently (the fact that this is fun is just a nice byproduct). In any case, I have certainly not followed best practices in my bash or XSL scripting. So if you read the attachments and think “this guy doesn’t know how to code efficiently or elegantly,” then we’re already in agreement on that. And you’d also be welcome to contribute on improvements. 🙂

Acknowledgements

I wouldn’t have gotten anywhere on this project without the work of many others, particularly including those that are giving of their own time and resources (which surely could have been spent elsewhere) on FLEx, WeSay, and the LIFT specification itself. Of particular note is Andy Black, who encouraged me to take another stab at XSLT (after telling him I’d tried and given up a few years ago), and who has provided invaluable and innumerable helps, both in the development of the XLingPaper specification, and in particular issues related to these transforms. Most of what is good here has roots in his work, though I hope no one holds him responsible for my errors and inelegance.

Wesay Wrapper

As happy as I am with WeSay, it is designed with a mindset of one user working on a given computer. It can work on a number of different languages (one at a time, of course!) out of the box, but it isn’t straightforward. You have to navigate to the WeSay folder, then click on the .lift file each time you want to work in a language other than the one WeSay opened last. Since one of the goals of working in BALSA is to hide the (often confusing) directory structure from the (initiate) user, this is unideal for switching between languages in Wesay on BALSA.

Since I’m doing just that (and a fair bit of switching on my own computer between WeSay in different languages), I wrote a script to invoke WeSay with the arguments calling for a given project each time it runs, so we will never need to think about what the last project was. It has a graphical tool (Zenity) to select which project to open, which is populated by the projects actually on that computer.
It assumes a structure of projects named by Ethnologue code, each in a folder with that name, each of which is in the same WeSay folder. It also assumes that each project has the full language name defined in the palaso:languageName tag in the language’s .ldml file (if it doesn’t, it will still work, but the gui will look off, and the next language will be on the same line).

I have the gui in English and French, since that’s what we use here. 🙂

For those interested, here it is:
#!/bin/bash
Version=2011.07.28
#set -x
#Note: this assumes a directory structure of WeSay projects named after three letter ISO/Ethnologue codes.

case $HOSTNAME in
Balsa*) wsfolder=/home/balsa/WeSay/;;
*) echo “Is WeSay set up on this computer? if so, update the $0 wrapper.”;exit;;
esac
langs=
PWD=`pwd`
cd $wsfolder
for xyz in `ls -d ???`
do
langs=”$langs $xyz”
langs=”$langs `grep –after-context=1 “palaso:languageName” $xyz/WritingSystems/$xyz.ldml|tr -d ‘n’| grep -o “<palaso:languageName.*>” |grep -o ‘”.*”‘|tr -d ‘”‘`”
done
cd $PWD
lang=`zenity –width=120 –height=400 –list –title “Open/Ouvrir Wesay” –text “Choose language / Choisissez langue:” –column=”ISO 639-3″ –column=”Name/Nom de langue” –multiple $langs`

echo “Ethnologue code entered: $lang”
if [ -f $wsfolder/$lang/$lang.WeSayConfig ]
then
wesay $wsfolder/$lang/$lang.lift
else
zenity –error –text “Sorry, please check the code and try again.n Désolé, SVP verifier la code, et éssayer encore.”
$0
fi
exit

Using computers to Help the Computer Illiterate Develop their Language

I’ve been working with orthography development a bit, and it has been a major challenge getting all the different pieces of the work to fit together well. One of the main divides I see in work is between people who would use computers exclusively, and others would who use them not at all (during the research process at least). I have never felt comfortable in either camp, in part because I am of the computer generation, but also because I have seen a lot that paper and pencil methodology has to offer. Perhaps the most relevant point to our work is that computers are inaccessible to most of our national colleagues. Which means that if I’m doing everything on a computer, I’m doing it by myself. Or I’m teaching people things I started learning in the third grade (they know what the 0/1 symbol is from cell phones, but other than that, I’m usually starting from scratch).
Fortunately for me, there are people working on making linguistic computer work more accessible to the less computer literate, so we can take advantage of computers, without pushing our national colleagues out of the work. One bright shining example is WeSay. Its interface is straightforward and simple. Think of a word and type it in. Give it a short meaning. Next word. Later, you can go back and add longer meanings, other senses, etc. You can work through a wordlist, or use semantic domains — both great ways to help people think of new words to put in their dictionary. I had someone working on it for several days, to bring his wordlist up to over 2,000 words, and we were both quite happy with how it went. I didn’t realize just how happy I should have been until I had the same guy do some other tasks on the same computer (I think it was still just typing, but in openoffice). Things immediately bogged down, and I was constantly needed to fix something.
So here’s my problem: I really like dictionaries, and I think a dictionary is a backbone to any other work done in a language. And it just doesn’t make sense to write a dictionary on pen and paper. But the majority of my time is developing writing systems, which is better done as a community process (i.e., not everyone standing around watching me use a computer). The first thing I do is collect a wordlist, which I eventually make the beginnings of a dictionary. But that dictionary is going to need to be put in a database in a standardized writing system, or it will be a mess. So writing system development and lexicography go best together, but how?
Constance Kutsch Lojenga (among others) has developed a participatory linguistic research methodology, which is good at bringing speakers of a language into the language discovery process from the beginning (c.f., “Participatory Research in Linguistics. ”Notes on Linguistics Vol. 73(2):13-27. Dallas, TX: SIL. 1996.). Language community members see the sound distinctions in their language (at about) the same time I do, and we get to make writing system decisions as a group, given the linguistic facts we have observed together. The basic discovery process puts two words together, and asks, “is the sound in question in these two words the same, or different?” If they are the same, they are put in the same pile, if different, in different piles. There is, of course, lots of background work to be done (like sorting words by syllable profiles, so we’re looking at the same position in the same type of word at a time), but we attempt to make the discovery process itself as attainable as possible, and I have seen it work well.
Which brings me to my dilemma. This methodology as currently conceived uses words written on paper, which are sorted as a group. Which means, that if I use WeSay to collect a wordlist, I (or some other computer savvy person) need to export that wordlist into a format that will print onto paper that can be cut into cards (not hard with mailmerge, and document templates, but it is work), and then after the sorting process, all the information gained (e.g., “fapa” really should be spelled “paba”) needs to be put back into the database — or else it will just remain on those cards. Even if we write up a summary of selected cards in a report, the wordlist/dictionary won’t improve, unless we can get the information off those cards, and back into the computer — which again is not hard, but it is time consuming, and prone to error.
Which got me thinking, would there be a way to simplify the round-trip, and make the same/different decisions in a way that the spelling change information would be immediately returned to the database? As nice as the simple card-sorting interface is, if we could make something (at least nearly) as simple on a computer, I know our colleagues would be up to it. If we could make it simple. Which got me thinking about orthography and WeSay.
Anyway, I’ve written up some thoughts on how this might work practically, which you can find here.

Wesay Orthography Proposal

I think it would be helpful to language communities to have an orthography development/checker tool in WeSay. For why I think that and what I’d want to do with it, see this other post. Here’s what I would like to see:
Config:Verification
Making the task simple for the user will necessarily require other complexities. We would need a config page, that would specify which words we’re looking for. It would basically be a filter for
1. Part of speech (maybe just from a list of noun, verb, and other)
2. Syllable type (e.g., px-CVCV, px-CVC, etc. –I’m not sure how hard this would be)
2b. position in the root: (C1, C2, V1, V2)
3. Consonants [] or Vowels [] (we just look at one or the other at a time)
4. Consonants available (for later selection by user)
5. Vowels available (for later selection by user)
6. Syllable types available (Options to choose from for dropdown in #2. For ortho development, this could contain just one or two canonical types; for full dictionary checking it would require more. I’m not sure how savvy WeSay could be at this point. It might be wise to split this into two groups, one for nouns, and another for verbs –or more.)

Based on the above information, the user is asked which letter to verify, and then would be presented with a “Verification” page where he would decide whether the sound in the position in question on each word is the same (or not) as others that are marked with the same letter (in the same position of the same types of words).
Verification
At the top of the page would be a new word, with LWC glosses and a picture, if available, and a list of words below it (all of which are filtered according to the settings on the config page, so we’re only looking at sounds/letters in one part of the same type of word) If it were simple to bold the letter in question in each word, that would be great, but not necessary to the task. In any case, it would be good to prominently display the letter under investigation, and more (very?) subtly the part of speech, syllable profile and root position. At the top of the page are two buttons: “Yes” and “No,” and at the bottom of the page are “Recheck” and “Verification OK/Next Letter”
The user task is to understand and pronounce the word at the top of the page, and decide if it belongs with the rest. If it does, he clicks “Yes”; otherwise, “No.” Once a button is clicked, another word is presented (If a mistake is made it will be sorted out later). Once the user has gone through all the input words and he is satisfied that the list of words all belong in the same group (if he isn’t sure, “Recheck” would redo the list –with just the “Yes” words), he clicks “Verification OK/Next Letter” and the following is done behind the scenes:
1. the letter (e.g., ‘p’, ‘b’, or ‘f’) in the place in question (e.g., first consonant) is changed to the letter being verified (e.g., ‘p’) for each word on the “Yes” list, if it isn’t already that letter.
2. those words are marked as verified for that letter (Is this possible? We could track progress elsewhere, if need be).
The user is then returned to the dialogue asking which letter to compare (next, and given the option to quit/pause). The same page is presented, but with words that had been taken out of the last group (i.e., “No” was clicked) along with words with the new letter, and the process repeats. If at any time the user terminates the task, it would be nice if the words that had been put in a box they might not belong in (i.e., marked “No” to ‘p,’ but not “Yes” to anything else yet) could be marked as unverified orthography (for that letter?). Then, when this task is started up again, it can start where it was left off.

At any point the user says, “give me the words with letter ‘x’,” he will get words that have been verified, and words that didn’t belong somewhere else (unless there were no “No” words on the last run). Thus, if there is a word that was wrongly put into a group (bad user input), a letter can always be re-verified (One can go back and click “No” on a word that one had just accidentally clicked “Yes” on). In any case, a spelling/letter is not corrected without comparing it to a list of words using the same letter in the same place of the same type of word. (I’m presuming good/new/consistent orthographies here, not ones like English, where a list of correctly spelled words would be more appropriate.)
Occasionally the process would need to be stopped to modify words that had been wrongly indicated for part of speech, root or syllable profile –unless the user doesn’t mind continually hitting “No,” or there is some other way of excluding them from the sort (maybe “Yes,” “No,” and “Wrong Category”?)
Once all the letters have been gone through for a given syllable type/position and part of speech, the settings on the config page could be changed, and the next set of words could be begun. If we put all but the letter to investigate on a hidden config page, then this would be just a one page task. Getting through all the letters in a given position can take awhile, so changes to those config settings (other than the letter to verify) wouldn’t need to be done all that often.

I originally had a “sorting” page before the verification page, since that is the way we do it with cards, but I think we could use a simple binary sort, as is present in the “verification” page, from the beginning. With cards on a table, a given sort might come up with a number of different piles (i.e., if there were m’s, f’s, and ɸ’s mixed up with p’s and b’s, for some reason. Or more likely, if there was under-differentiated vowels, and five vowels become nine.) But I think in either case, a binary sort would do, it would just take a bit longer (a few more sorts) to get each record in it’s proper pile.

Feasibility
I’m not sure how feasible this idea is; it is probably more like a “purple elephant” wish, but it would be nice to have, if possible. Without any real knowledge of the inner workings of WeSay, I imagine that knowledge of CV structure might be the largest hurdle to this idea’s implimentation. Normally when I collect words, I collect a couple forms: sg/pl for nouns, and infinitive/imperative for verbs, or whatever two forms will give me root structure information (a bit of preliminary research may be necessary to know what will work/help). While I can add fields for plural, infinitive and root in WeSay, there is no task to “add forms”, so giving WeSay information on syllable structure would be a more consultant-oriented task, using “Dictionary Browse and Edit”. Unless we had “Add forms: plural” and “Add forms: imperative” and then “Add Root forms”:
Add forms: plural:
word:
plural:
Add forms: imperative:
word:
imperative:
Add Root forms:
(in config:
noun root parsing fields to compare:
[ ] and
[ ]
verb root parsing fields to compare:
[ ] and
[ ])
field 1:
field 2:
root: (this might even be guessable?)

But there are probably other hurdles that I would be completely unaware of. Anyway, this is my idea.

EDIT:
Actually, looking back at our proceedures, we would like an ability to filte on words with V1=V2 (at least), and also C1=C2, if possible.
Then normally we would look at all vowels in V2 position with V1=a (for instance), then V1=i, etc.
This makes the investivation process much more straightforward (paka is more clearly ‘a’ than paki), though perhaps it would impossibly complicate the innards of the task, were they ever doable.