Category Archives: Language Development

Segment Interpretation

I Think I finished a new feature today, complete with a settings page for the user to make it work.

The question of segment interpretation has been an issue almost everywhere I’ve worked, and it is often a highly idiosyncratic (or language specific, as you like) thing. So I’ve set out four boolean settings (yes/no, True/False), which govern whether a given segment type is treated as a regular consonant, or separately from other C’s, in the syllable profile analysis:

  • N – Nasals, generally (or only word finally)
  • G – Glides/semivowels
  • S – Other sonorants (i.e., not the above)

In addition to this setting, there are also some segment combinations that could be treated separately:

  • NC – Nasal-Consonant sequences
  • CG – Consonant-Glide Sequences

That’s all for now, but the infrastructure is there, so if anyone REALLY needed something else, we could talk about it.

The following slides show the options for syllable profiles, after analysis given the settings as on the page (which shows current settings on open).

Default Operation

On first open, everything is a C, and no CC sequences are collapsed:

Note the number of each syllable profile, which are sorted with the largest on the top, and how quickly they taper off. I’ve always appreciated being able to do a quick syllable profile analysis, so this is nice. Good to know which are your more canonical forms (e.g., CVC and CCV here) and which are not (e.g., CCVCCV and CVCCCV here).

Distinguishing Segments by Type

By toggling the various settings (then hitting “Use these settings”, after noting the warning that this will trigger a data reanalysis), you can get other analyses. For instance, if you set N≠C, then you get the following (All nasals are distinct from other consonants, wherever they appear):

Or, you can just distinguish nasals only word finally:

In case it isn’t obvious, there is a clear practical trade off to these selections. That is, the more distinctions you make, the smaller your groups become, from max 97 in this case, to 80 or 67, as the number of distinctions increase. So one is typically advised to use the distinctions important in the language (as soon as those are known!), and keep everything else together —hence the number of distinctions I’m offering.

And of course, you can distinguish both glides and nasals at the same time:

And even sonorants, too:

Distinguishing Sequences of Segment Types

The other kind of setting on this page has to do with sequences of particular segment types. That is, should NC be interpreted as such, the same as other CC sequences, or as a single C (all of which I’ve heard people want)? One advantage of this setting is that one can get NC sequences marked as such, without otherwise distinguishing nasals (as in the first two settings):

One can do the same for CG, resulting also in NCG sequences in syllable profiles:

For these two settings, one can leave the default CC interpretation, or specify NC or CG as above, but one can also set either (or both) to just C, so these sequences lump together with other C’s in the profiles (I assume when appropriate for the language!), as here:

Note that this lumping greatly increases your group sizes, and reduces your number of groups, which can help a lot in the analysis —though again you want to be sure that this is appropriate. If you don’t know why it would or wouldn’t be, it probably isn’t.


Anyway, now as you go to sort words by surface tone pattern in various contexts, you can group words with certain segment types together or apart, as appropriate for the language, and work with the appropriate groups as you sort for tone. This will be available in version 0.5 of A→Z+T.

Record, Verify, Repeat

When I wrote earlier, I realized I probably was sending out a firehose of information. I really was just trying to get some feedback on the UI, but I realize a lot of people (based on responses) would have appreciated a bit more context on what we’re doing. So I’ll try to lay out things a bit more peacemeal here, and hopefully in future blog posts.

One thing that I’ve thought from the beginning is that the participatory methods I learned depend heavily on being able to fix things. We assume mistakes will be made, so we build into the method ways to check and fix things.

For recording, there is a simple three button method for taking the recording of a framed word with a Click-Speak-Release work flow, like in WeSay and FLEx. On release of the record button, a number of things happen:

Under the hood:

  1. The recording is written to file with a (long!) meaningful filename, including syllable profile, part of speach, guid and gloss. This enables better searching of a pile of files, with whatever operating system tools you have for doing that.
  2. The filename is added to the LIFT file in the appropriate place for an audio writing system (as used in FLEx and WeSay), so it is immediately available in any tool that reads LIFT.

Visible changes:

  1. The record button disappears, and is replaced by play and redo buttons. This allows the user to immediately hear what was just recorded, and approve it before going on. Or, to click redo, and do another take. In beta testing, there were lots of reasons do to this, including noise in the room, weirdness of sound cards starting and stopping recording, and users getting used to how to click and speak in syncrony, without excessive delays, while getting the whole utterance.

What it looks like

A→Z+T update: Surface Tone Group Names

So since before writing this, one questions I’ve gotten a number of times is why tone groups are just called 1, 2, and 3, etc. This was somewhat intentional, as I think it is healthy (both for the computer tool, and for the linguist) to remain agnostic of the value of sortings, until “same/different” piles are established. But while I stand by that statement at least initially, I recognize that most people are ultimately going to want to give even surface groupings some kind of descriptive names, and it would be nice for this tool to handle that.

So I decided to give myself a gift and implement this feature, and this is what I came up with:

The verify surface tone group page now has a “rename group” menu in the upper left, which when clicked on, provides the following very simple entry window:

I’ve given my position on how these labels should go, but you can of course call it ‘H’, ‘High’, or ‘Rocky Mountain High’, as you like. Analytically, though, I hope you get that giving any label at this point that implies an underlying form value would be inappropriate, as we are sorting on surface form here.

This name is then picked up in the tone report, providing more interesting results:

In fortunately, I haven’t figured out how to display fonts with graphite features (yet, anyone?), but fortunately these are just standard unicode points, so the text file output can be given a nicer font specification when viewed in LibreOffice, as usual:

And of course, wherever else you import this text, will be able to handle graphite font features, right? :-)

A→Z+T user interface (version 1)

I’ve written before (here and here —yes, not quite a decade ago!) about the need for a tool to collect data for my work in sound system analysis (phonology). This year, I decided to stop waiting for someone else to do this for me. So learned Python and Tkinter, and I now have a draft tool, that I’m going to try out in a live work environment next week. Here is a screenshot of the splash screen:

The database it refers to is a Lexical Interchange FormaT (LIFT) database, which allows collaboration with those using other tools that can read LIFT (e.g., FLEx, WeSay, and LexiquePro), as well as archiving online (e.g. at Once it loads (1.5 sec, unless it takes 3 minutes to reanalyze syllable profiles), you get the following status field:


This window has a number of menus, which are intentionally hidden a bit, as the user shouldn’t have to use them much/at all. One can change which language is used for the program (Currently English and French, other translations welcome!), which is being analyzed, and one or two gloss languages (these last options depend on what is actually in the database):

One can also change the words being looked at in a given moment. The primary search parameters are part of speech (indicated in the dictionary database) and syllable profile (analyzed by this program).

Once the group of words to be studied is selected (based on part of speech and syllable profile), The third section of the basic menu lists options relevant to that group:

Many of these options are only meaningful when another is selected. For instance, “sound type” must be selected for ‘Tone’, before you can select which tone frame you want to use. Similarly, if you are looking at CVC nouns, you won’t see V2 or V1=V2 as options!

The basic function of presenting and recording card sorting judgements can be done on the basis of this first menu alone. But other functions require a bit more work, either running reports or changing something most people shouldn’t. So in the Advanced menu, you can ask for a tone report (more on that later), or a basic report of CVs (not really doing much right now).

You can also record, with one window to let you test your system, another pulls automatically from draft underlying tone groups (more later!), and the third allows you to specify words by dictionary ID in a separate file, if you want to record a specific set of words that aren’t otherwise being picked up:

This Advanced menu also gives you access to the window where you can define a new tone frame. And another lets you add stuff, like a word you find wasn’t in your database but should be, or a group of words you skipped (more on that later).

Status indicators

As you make whatever changes you like through these menus and the windows the provide, the current status is reflected in the screen, as below:

This is now looking at CVC Verbs (not CVCV Nouns, as above, and we’re now looking at the infinitive frame (not the nominal “By Itself” frame).

If your preference is for Vowels (no accounting for taste; it happens), you’ll also get an “A” card on the big “Sort!” button, just to be clear we’re working on Vowels now:

But the Vowel sorting is not so advanced as the tone one at this point; it mainly lets you look at lists of data under different filters:

Once the verification process has started for a ps-profile combination (e.g., CVCV Nouns), the number of verified groups is displayed in the progress table:

No verified groups for CVCV nouns in “By Itself” frame

Sorting Data

When the user clicks “Sort!”, the appropriate searching and filtering is done, to provide the following screen to sort those words, one at a time:

The first framed word (here wara naha, “Here is chasse” —yes, a bad combo of an English glossed frame and a French glossed word, but you wouldn’t do that, would you?) is sorted automatically into the first group, and the user is asked if the second word is the same.

The user is also given the option “Different than the above”, as well as “Skip” —there are some word/frame combinations that just don’t work, either because of the meaning, or the syntax, or particular taboos, or whatever. Excluding one word (or a couple) isn’t the end of the world, as we’re looking to understand the system here. Words that are skipped can always be sorted again later.

If the user selects “Different than the above”, that word becomes another button to select a second group, for the third word:

If that word is also different, the program gives a third group from which to select when sorting the fourth word:

Up to as many as are needed:

Once all the words have been sorted into groups, each group is verified, to make sure that it contains just one tone melody. Once the user verifies a group, the main window updates:

One verified group for CVCV nouns in “By Itself” frame

Once the piles are sorted and confirmed to be different, the user is presented with a list of words like this:

The user can scroll through the whole list, clicking on any words that don’t belong to remove them. The button on the bottom of the page gives the user the opportunity to confirm that the remaining words are the same (People who care about this kind of detail may note that progress is indicated in the upper right –this is the second of three groups being verified):

Once that last button is clicked, the group is considered verified, and this is reflected on the status page:

Second CVCV noun “By Itself” group verified

And so on for a third and more groups, for as many as the user sorted the data into:

But as anyone who has run a participatory workshop can tell you, these processes are not particularly linear. Rather, there is often the need to circle back and repeat a process. So for instance, the fact that these three groups are each just one tone pattern does not mean that they are each different from each other. So we present the user this screen:

This allows the user to confirm that “these are all different”, or to say that two or more should be joined (something we would expect if there were lots of groups!). If the user selects one of the groups, the following window is presented; the first selected group is now presented as a given, and the user is asked which other group is the same.

Note that all of these windows have “Exit” buttons, in case the user arrives on a page by accident. Most changes are written to file as soon as a process is finished (e.g., joining groups), so this whole process should be fairly amenable to a non-linear and interrupted work flow.

If groups are joined (as above), the user is again presented the choice to join any of the remaining groups. If you made one too many groups, who is to say there isn’t another one (or two!) excessive divisions in your groupings

Back to the cyclical nature of this process, if groups are joined, that means the joined group is no longer verified! So we verify it again (this time as one of two groups).

The user is again offered the chance to remove from this group any words or phrases (depending on the frame) which do not belong (anyone want to sing the Sesame Street song?). Any words that are removed from this group in this process are then sorted again:

In the sorting process, if any words are added to the group, it is marked as unverified, and sent to the user to verify again:

If the cards that were pulled earlier are resorted into multiple groups, each is verified again:

So just like in a physical card sorting workshop, word/phrase groupings are sorted, verified, and joined until everyone is happy that each pile is just one thing, and no two piles are the same.

Only in this case, the computer is tracking all of this, so when you’re done, you’re done —no writing down on each card which group it was in, then forgetting to commit that to your database months later.

And this progress is visible in the main window:

The Tone Report

Now, the above is exciting enough for me, given the number of times I’ve run into people that seem to feel that sorting by part of speech and syllable profile, then rigorously controlling and documenting surface forms, is just too much work. But it doesn’t stop there; many of us want to do something with this data, like make a report. So the Tone report option produces a report which groups words (within the ps-profile we’re studying at the moment!) according to their groupings in each tone frame:

group 1
group 2

This is the same thing human linguists do, just on a computer — so it is done faster, and the resulting draft underlying forms are put into the database in that time. But perhaps you’re thinking you could find and sort 145 CVCV Nouns into tone groups based on a single frame, and mark which was which in FLEx, in under 100 seconds, too. And I admit that sorting by one frame just isn’t that impressive —and not what we tonologists do. We sort by how words split up across a number of frames, until new frames no longer produce new groups. So we look at another tone frame (recall the window where you can define these yourself?):

When you verify a group in that frame:

Then in another:

Once that second frame has a couple groups verified, the tone report becomes a little more interesting:

The group provisionally labelled “Nom_CVCV_11” (yes, there are too many groups, as a result of my highly random decisions on group placements) is defined by the combination of group 1 in the “By Itself” frame, and group 1 in the “Plural” frame.

Group Nom_CVCV_12 differs in the “By Itself” grouping, though not in the “Plural” grouping —in this way we handle neutralizing contexts: any word grouping that is distinguished from another group in any frame gets its own (draft underlying tone) group.

Yes, there are of course groupings which don’t indicate underlying forms, such as if one group differed from another by the presence and absence of depressor consonants.

But still, I’m impressed that we can sort 145 CVCV nouns by two different frame groupings, in less than 120 seconds (32 CVC Nouns sort across six tone frames in about 45 seconds). These groupings (e.g., Nom_CVCV_12) are immediately written to LIFT, so could be available in FLEx, to sort on. Because this report costs almost no time (unlike physically sorting cards), this report can be done after each of a number of different sortings, and the progression of group splits can be monitored in more or less real time.

One caveat I would mention. Assuming the words are found in more or less the same order, their names should be more or less stable, unless a new frame creates a real split. But there is no reason that it should be so —neither I nor A→Z+T make any claim that these groupings are anything more than draft underlying form groupings, including the shape of those underlying forms. That is, words marked Nom_CVCV_12 may ultimately be called H, Low, mid, or something else — that is for the analyst to figure out (Yay! more time analyzing, less time collecting, organizing, and archiving data!).

One perhaps less obvious extension of this caveat is that Nom_CVCV_1 has no implicit relationship with Nom_CVC_1, apart from that each was grouped first when their syllable profile was analyzed. So the linguist still needs to figure out which CV draft underlying tone melody groupings correspond to which CVC, CVCV, etc draft underlying tone melody grouping —but why would we wan to cut out the most fun part?


For many people, participatory card sorting is just uninteresting work —either because they question its scientific value, or because of its notoriety for data to not survive the workshop. How many times have you finished a workshop with hours of recordings, wondering when you would have the time to cut, label, organize and archive them? But what if you could make recordings in a way that immediately cropped, tagged, and made them available in your lexical database?

This tool provides a function which goes through the most recent tone report groupings (we’ll be doing that often, right?), and pulls (currently five) examples from each, and presents them one at a time on a page like this:

Below the bold title with word form and gloss are six rows with that word in each of the frames where it has been sorted: the word form is framed, as well as the gloss, so the user can see the entire context of what has been sorted (this info is taken from the LIFT file, where we put it earlier for safekeeping). Next to each line is a record button. The user presses down on the button, talks, and lifts up on the button, resulting in a sound file recording, storing in the repository’s “audio” folder, and a line in the LIFT example field telling us what that file is named (which is something like Nom_CVC_75438977-7cd0-49ac-8cae-ffb88cc606a1_ga_wur_grain_(m),graine(f)(pl).wav, for those interested). And the user now has a play button, to hear back what was recorded, and a redo button, in case it should be done again (chickens, anyone?)

The user can go down the page, until all are done:

Once the user selects “Done,” an example word from the next grouping is presented. Once each group has one word, the user is given a second word from each group. This was designed to allow even a minimal amount of recording to most likely include each of the draft underlying tone groups.


This has been a brief overview of the user interface of A→Z+T, the dictionary and orthography checker. I have not discussed the inner workings of the tool, apart from the logic relative to the user, but may do that at a later date.

Please let me know (in the comments or by Email) if there is anything about the user interface (e.g., placement, fonts/sizes, wording) that you think might hinder the user experience. I hope to finish translations into French early next week, so I hope to finalize at least this version of text for now soon.

Copying between fields in two writing systems in FLEx

Last month I taught the FLEx II course at CoLang, held at UTA. It was very interesting trying to teach to about 45 students, coming for a great variety of backgrounds, but I think we all learned something. There was one thing that I taught, which I thought deserved further write-up, so I’ll do that here.
The problem is not immediately obvious, unless you spend lots of time thinking about how FLEx does what it does, in particular how writing systems work. But when you want to copy data from a field that is encoded in a particular writing system, into a field that is encoded in another writing system, you can’t just bulk copy and get results you might expect. The reason is that the data itself is tagged for which writing system it is in, and not just the field. So you can, theoretically, have Spanish data in a field that is supposed to have English. But this is actually a strength, as it allows you to tag one word or sentence in its correct language, even if it is surrounded by another language, all within the same field (like if you write a note in English, but include in the note words in another language). All data is tracked by writing system, and you don’t loose that information when you copy from one field to another.
So, the task we were working on, which you may need to do some day, was copying data from one writing sytem, to use as a base for another. For instance, if you have data in a practical working orthography, and you want to also have an IPA field, you may notice (as I hope is true) that much of your orthography transfers directly over to the IPA. And for those things that don’t, there should be regular changes (like substituting [ɸ] for ‘ph’). This task is just begging to be done through bulk editing. Why type all that over again, just to change a few phonemes here and there? Why not make most of the systematic changes systematically? But we can’t just copy a practical orthography field into an IPA field, since the data would still be encoded as the practical orthography, even if the field should contain IPA data. So here’s what you should do.
First off, I assume you have your two writing systems set up; I’m using Mbo and the IPA variant in these screenshots:

Writing sytems

Go to the Lexicon pane:

To Lexicon

Then Select Bulk Edit Entries:

To Bulk Edit entries

To be able to operate on both fields, you need to make them both visible. Click on the “Configure which columns to display” button:

To Show Columns

Then click on More Column Choices (unless your IPA field is in the list, in which case you just select it):

To More column choices

In the dialog that comes up, you selct the Lexeme form field (or whatever field you’re copying to). Yes, it is already on the right; we want to display the Lexeme form field twice, once in each of two writing systems:

Select Lexeme Form

Click Add:

Click Add

Initially you will probably have the same writing system for each of the two fields:

Both fields in same WS

change one to the IPA variant:


then I like to move that second field up, so it will display next to the other one. While the field is selected, click on the up arrow:

Move WS field up

Then keep clicking until it is in place:

Move WS field up 2

Now that you have everything situated, click on OK:

Click OK on WS

That should take you back to the bulk edit pane, where you should see your IPA field. Assuming you’re just starting to work in this field, it should be empty:

IPA Empty

Then you Bulk edit, like normal, by selecting the Source Field (the one with data in it):

From LF

and the Target Field:


As always, you want to preview your bulk edit, to make sure it’s doing what you expect:

Preview Bulk Copy

And you should see blue arrows going from nothing to data, which matches the field next to it:

Preview Bulk Copy results

If you don’t like what you see, just click clear, and fix whatever was wrong. But if you like it, click Apply:

Apply Bulk Copy

And then you’ll have both fields filled with the same data (and no more blue arrows):

Apply Bulk Copy results

But if you select data in the IPA field, the indicator above will show that the data is NOT in IPA, but still in the other writing system:

Wrong writing system

So this is the problem we need to fix. To do this, we’re going to Bulk Replace:

Select Bulk Replace

Select the target field (Just the one we want to change writing systems on, the IPA on in this case):

Select WS_IPA

Then click on Setup…:

Setup Bulk Replace

This will give you Bulk Replace Setup dialog:

Bulk Replace Setup dialog

Where you can select in the “Find what:” box, then (click “more” if you have to, and) select Format/Writing System/xyz –whatever writing system you copied your data from:

Select From WS

My experience is that at this point, FLEx will figure out what you’re trying to do, and set the other field for you. You can verify this by seeing “Format: <Writing system name>” under each field:

Bulk Replace Setup w Langs

You don’t need to add anything to the empty fields in this box; you want to find everything. So you can just click OK. Unfortunately, I don’t see anything when I click “Preview” here, so we just trust that we’ve set it up correctly (you backed up your data before starting this, right? If not, stop and do it now.), and hit Apply.
Back in the bulk replace field, we can verify that the data in the IPA field is now indicated as IPA:


While the orthographic field is still in the orthographic writing system:


At this point, you can go through your IPA field and convert orthographic letters to IPA equivalents, either systematically through bulk replace (if appropriate) or manually. Then you can enjoy your dictionary database with both orthography and IPA in your entries!

WeSay and BALSA: Thanks!

I just finished a trip to Bunia and Nia-Nia, DRC, where I helped the Ndaka [ndk] and Mbo [zmw] communities develop draft alphabet charts and transition primers, the material for each language including all nine vowels (with ATR harmony) and the egressive/implosive stops. The Mbo version also includes the [p]/[ɸ] contrast, as /p/ is normally [ɸ] there. Each booklet includes a short story in the new draft orthography.

I’ve written before about using WeSay to collect language data in highly illiterate language communities, which was a major part of this work.  And since I don’t want to do IT work full time (or rather I have other things to do with my time), I’m using WeSay in BALSA. So since much of this work would not have been possible without the work of many people, especially those working on BALSA and WeSay, I wanted to take a minute to thank them. Without a budget to do so materially, I’ll do that through describing the work here, and explicitly saying that if you work on WeSay and/or BALSA, please feel free to take and use this story and/or pictures in your own publications; this is your story, too.

We met for this workshop in Nia-Nia, DRC, about an hour into the rainforest (by MAF) from Bunia, which is about an hour (again by MAF) from Entebbe, Uganda. We met in a church, with the Ndaka covering most of the workshop logistics, since this is their home turf.  The Mbo are also a bit on the run these days from a militia conflict that hasn’t seemed to end yet. And they’re a smaller and more illiterate people group. So much so that one of the guys on the Mbo team didn’t participate in a dictation exercise, as we were practicing new letters. And yet, here they are, around a BALSA machine, using WeSay:
Admittedly, the guy touching the computer isn’t Mbo, but he’s helping them deal with the interface, and they’re choosing pictures through the Art of Reading interface in WeSay, which would seem to be even more popular with less literate communities.

This one is of the Ndaka team, using WeSay in BALSA independently:
They’re also picking images to go with dictionary entries we put there earlier, though some people started modifying entries by the end of the workshop. Lest this seem trite, let me point out that this was the first time that ANY of them had used a computer of any kind.  For some, it was the first time they had seen one. You can see in the background the current stabilizer, which is plugged into the generator we used to have electricity. Without the stabilizer, I wouldn’t plug in a computer, because of the risk of unstable current. After the stabilizer, we put a fridge guard. when I can, I put a second stabilizer in series, to even out what current irregularities the first one doesn’t catch. Which is all to say that this is not the most computer friendly place, even after the alternating dust and humidity, and the heat. But these guys took to the tasks, and were able to work somewhat independently on computers for the first time.  Having tried this in similar contexts with other software, I attribute this success entirely to WeSay and BALSA.  Thanks, guys, for making that possible.
Here is the same team from another angle:

And here is one with other people hanging around, showing that this is truly a community affair:

So this workshop was a success in part because people who had never used computers before (including the elementary school principal, shown in the background of this last pic), were able to get up and running in very little time, with very few frustrations.  They even enjoyed the work so much, I had to kick them out several evenings, after it was already too dark to walk home. So thanks to everyone involved, for your part in making this happen.

Creating Tone fields in Fieldworks 7.0.6~beta7 (not useful for WeSay

Creating Tone Fields by the Method Native to FLEx –The Better Way

(N.B.: this entry started with FW7.05~b5 and WS0.9.28, though I’m finishing it on FW7.06~b7 and WS1.1.11. Some of the screenshots may look different between these versions, but I haven’t noticed any difference in functionality with regard to these fields.)
After creating custom fields in this way for tone and plural forms, I found that tone fields are already accounted for in FLEx, though not particularly transparently. There is a set of pronunciation fields, which can be inserted here:

This option puts the set of pronunciation fields in the record you’re editing, not the whole database. It gives tone, as well as a couple other fields. It looks like this in FLEx:

What’s nice about this is that you can do this a number of times, for the same entry. This gives you the chance to have a number of pronunciations, in different contexts –which is important in phonology, especially with regard to tone. The “Location” field is an empty, customizable field, so I presume we could put things like “Before a High Tone” or “phrase finally” or whatever there, then know that that pronunciation is valid for that context. Filling in some bogus data, we see the following in FLExː

Under the Hood

The above results in the following in the appropriate entry of the LIFT file:

<form lang=”gey”><text>ba</text></form>
<field type=”cv-pattern”><form lang=”en”><text>CV</text></form>
<field type=”tone”><form lang=”en”><text>?H</text></form>
<form lang=”gey”><text>bad</text></form>
<field type=”cv-pattern”><form lang=”en”><text>CVC</text></form>
<field type=”tone”><form lang=”en”><text>?HF</text></form>

So each pronunciation has a form/text set of nodes, and fields with type attributes for each of the visible fields with data in FLEx. Note that these fields are formatted exactly the same as the fields we created earlier here and here, that is

<field type=”NameofFieldinFLEx”>
<form lang=”LanguageCode”>
<text>Field Contents</text>

The only difference here is that the fields are under a <pronunciation> node, and not directly under the entry itself. But the fact that these fields are grouped together under repeatable pronunciation nodes should mean that we can organize contextually dependent pronunciation (tone or segmental) fields.

Sorting on Pronunciation Fields

I tried sorting on individual pronunciation nodes in FLEx, but wasn’t immediately impressed. I tried sorting the above fields for those with CVC in the cv-pattern, and this is what I got:

One can see that the entry is filtered, not the set of pronunciation fields. When working with Toolbox, it was possible to filter on either of a repeated field within an entry. Recalling that this was only when sorting on that field (therefore producing a record for each of the multiple fields), I tried that in FLEx, and it worked:

Note that there is only one pronunciation field listed, and the pronunciation form and tone fields listed are those that correspond to the CV field that was selected in the filter.
This data structure would also allow one to select only particular tone patterns, such as with an XPath expression like pronunciation[/field[@type=’cv-pattern’]/form/text = ‘CVC’]/field[@type=’tone’]/form/text to get the information in the tone field under only those pronunciation nodes that also have CV fields with ‘CVC’ in them.
Unfortunately, I haven’t been able to see these fields in WeSay (yet, I hope: see this bug report). Which is sad, because this is otherwise the best way to indicate tone in FLEx.

===Poetic Interlude===
I wrote most of the above several months ago, and had forgotten that I had worked this much out, until I ran into the problem of bulk editing on these fields. A quick Email to <Flex_Errors at>, and a fairly rapid response later, and I was back in business. When I went to write it up, I found the above in my drafts folder…
===End of Interlude===

So I’ve been doing a lot of data collection in the last couple months using the above paradigm, keeping different tone fields separate by their sibling location fields. I have XSL transforms to add this data to a LIFT file, and some reports to pull it out later, but how to mess with it in the mean time, should I need to? To get bulk editing on these fields to work, I needed two things:

  1. to sort on ‘pronunciation’ or one of it’s children (this I had apparently already figured out, but forgotten)
  2. to select the right columns for viewing in the bulk edit view.

Selecting the right columns for viewing in the bulk edit view

In case it isn’t obvious, the visible columns in the bulk edit view determine what fields you can act on. If “Lexeme” isn’t visible, you can’t copy to or from it, or modify it with a regular expression. So first, you need to make the fields you’re looking for visible, which is done through a dialog you can access by clicking in the upper right corner, with tooltip “Configure which columns to display”:

When you click on this, you get a menu of a number of (recently selected?) fields. To access other fields, to change column ordering, or to select language options, select “More column choices…” at the bottom:

This gives you access to the following dialog, where you can find fields not on the above list, select which of a number of writing systems you want to see (and therefore Bulk Edit). The Arrows on the right allow you to move the fields up and down (moving columns left and right on the Bulk Edit screen):

One trick that may not be obvious is that the ‘Tone’ field under ‘Pronunciation’ is available here as ‘Tones’. I presume this is because there are potentially a number of different Tone fields (as in my case). This is the same for ‘Location’ > ‘Locations’ and ‘CV Pattern’ > ‘CV Patterns’.

Sorting on Pronunciation Columns

Once all the fields you’re interested in are in the “current columns” (right) side of that dialog, you can select a column to sort on (showing light blue triangle). Selecting ‘Pronunciations’ gives three lines for this entry, and proclaims “Pronunciation” at the top of the page for slower ones like me.

If you’re in a context where you want to sort on two of these fields (if one doesn’t uniquely sort them, as the screenshot above), you can select one, then shift-select another, which will give a secondary sort (and a smaller triangle) as in the following:

Here the location is the first sort, then the tone. Note that the pronunciation form isn’t sorted (a…z…k…a), though the duplicate HAfter-sg field for titi is (correctly) showing up as another pronunciation/tone field (with pronunciation/form atíti nɛ) –showing that sorting by any of the pronunciation fields gives this layout.

Bulk Editing Pronunciation Fields

Getting back to the point of it all (for me, anyway), with this configuration it is now possible to bulk copy to/from these fields:

Locations didn’t show up for me under “Bulk Replace”; I’m not sure why, though that sounds familiar –perhaps I didn’t configure it right, or maybe that’s a bug.


Though tone fields created under pronunciation fields is not currently helpful for WeSay collaboration, it seems a much more principled way of treating tone data in FLEx, since it natively allows for varied contexts, CV patterns, segmental morphophonemics impacting the frame (since each pronunciation field has a form field, which can include the lexeme, frame, and any segmental interactions between them). In addition these fields are accessible to FLEx filtering and sorting, including bulk edit operations.
Given the complexity of this configuration, I would not recommend what I have described to the computer non-savvy (e.g., users more comfortable in WeSay). But for those comfortable manipulating these configurations, FLEx can be a powerful tool for manipulating tone data.

Round-tripping LIFT data through XLingpaper


The LIFT specification allows for interchange between lexical databases we use, such as in FLEx and WeSay. As an XML specification, it is also subject to XSL transformation, and can be converted to XML documents that conform to other specifications, such as XLingPaper, an XML specification for writing linguistics papers. I described before a means to get data out of FLEx into XlingPaper, but that required a script generating regular expressions which were then put into a FLEx filter by hand (metaphorically speaking). Computers should be able to automate this, and so (following my “If computers can do a particular task, they should” motto) I developed a script to take that regular expression generator, and feed those expressions to an XSL stylesheet to produce XlingPaper XML from the LIFT XML automatically.
The other half of the rationale is that I hate exporting data from a database to a paper or report, seeing and error, and not being able to fix it once. Either I fix it in the paper and the database, or else in the database, then re-export to the paper. So a way to get data from LIFT to XlingPaper and back seemed helpful for drafting linguistics papers, even if one wasn’t dealing with the volume of reports I’m looking at generating.


One major caveat for this work is that these tools (FLEx, WeSay, and XLingpaper) are in active development, so functionality may vary over time. The tests in this post were run with the following:

  1. FLEx (for Linux)
  2. WeSay 1.1.9 (for Linux) –This doesn’t enter directly into these tests, but the LIFT files used often sync back and forth between these two programs.
  3. xsltproc from a standard Ubuntu Linux install (i.e., compiled against libxml 20706, libxslt 10126 and libexslt 815)
  4. GNU bash, also from standard Ubuntu Linux (i.e., version 4.1.5)
  5. GNU diffutils, also from standard Ubuntu Linux (i.e., version 2.8.1)
  6. XMLMind Xml Editor, version 5.1.0
  7. XLingPaper, version 2.18.0_3

All of these tools are free (or have a free version) and available online from their respective sources, and most are open source.
The scripts I’ve written (to generate reports and call the XSL transforms) are not yet publicly available; I hope to have them cleaned up and more broadly tested before long.

Test Goals

I want to see if I can

  1. Get data from LIFT to XLingPaper format,
  2. Modify the XLingPaper document in XXE (which keeps it in conformity to the XLingPaper DTD),
  3. Get it back into LIFT and imported to FLEx,
  4. Show that the FLEx import made all and only the changes made by modifying the XLingPaper document (i.e., no other data loss)

To do this I will be using an output of diff between two versions of the XLingPaper document (original and modified), and another diff between two versions of the LIFT file (originally exported, and exported after input). To achieve #4, I will show that the two diffs show all and only the same changes to data entries (the modifications to the XLingPaper doc are the same as the changes to the FLEx database, as evidenced by its export to LIFT). Fyi, this LIFT file has 2033 entries, and takes up almost 2MB (plain text), so we’re not talking about a trivial amount of data.

Test procedure

  1. Backup Wesay folder (this is real [gey] data I’m working with, after all…)
  2. Export “Full Lexicon” from FLEx, and copy it to gey.ori.lift
  3. Run report (vowel inventory) on exported gey.lift (This creates Report_VowelInventory.gey.xml)
  4. Open created report in XXE
  5. Modify and save (because XXE changes format –this helps diff see real changes, not those irrelevant to xml)
  6. Save as Report_VowelInventory.gey.mod.xml, and modify one example of each field we’re interested in, including @root (at this point both files have been saved by XXE, for easier comparison).
  7. Run `diff Report_VowelInventory.gey.{,mod.}xml` (results below)
  8. Run `xlp-extract2lift Report_VowelInventory.gey.mod.xml .` (This creates Report_VowelInventory.gey.mod.compiledfromXLP.lift)
  9. Backup FLEx project (just in case, as there’s real data here, too)
  10. Import Report_VowelInventory.gey.mod.compiledfromXLP.lift to FLEx project, selecting “import the conflicting data and overwrite the current data (importing data overrules my work).” and unticking “Trust entry modification times” (This is important because if that box is selected entries won’t import unless you have also changed the ‘dateModified’ attribute on an entry –which I generally don’t).
  11. Export again, producing a second LIFT file exported by FLEx (one before, and one after the import)
  12. Run `diff gey{,.ori}.lift`
  13. Compare diffs to see fidelity of the process.

Test results

Here is the diff showing the changes between the original report and the modifications:

$ diff Report_VowelInventory.gey.{,mod.}xml
< >Rapport de l’Inventaire des Voyelles de [gey]</title

> >Rapport de l’Inventaire des Voyelles de [gey]MOD</title
< >Kent Rasmussen</author

> >Kent RasmussenMOD</author
< >Voyelles</secTitle

> >VoyellesMOD</secTitle
< >mbata</langData

> >mbataMOD</langData
< >pl: mabata</langData

> >pl: mabataMOD</langData
< >fissure, fente</gloss

> >fissure, fenteMOD</gloss
< >mke / wake</gloss

> >mke / wakeMOD</gloss
< externalID=”ps=’Noun’|senseid=’hand_0d9c81ef-b052-4f61-bc6a-02840db4a49e’|senseorder=”|definition-swh=’mkono

/ mikono'”

> externalID=”ps=’Noun’|senseid=’hand_0d9c81ef-b052-4f61-bc6a-02840db4a49e’|senseorder=”|definition-swh=’mkono

/ mikonoMOD'”
< externalID=”ps=’Noun’|senseid=’orange_2924ca57-f722-44e1-b444-2a30d8674126’|senseorder=”|definition-fr=’orange'”

> externalID=”ps=’Noun’|senseid=’orange_2924ca57-f722-44e1-b444-2a30d8674126’|senseorder=”|definition-fr=’orangeMOD'”
< externalID=”root=’paka’|entrydateCreated=’2011-08-05T10:57:05Z’|entrydateModified=’2011-09-27T11:24:32Z’|entryguid=’44dcf55e-9cd7-47a9-ac66-1713a3769708’|entryid=’mopaka_44dcf55e-9cd7-47a9-ac66-1713a3769708′”

> externalID=”root=’pakaMOD’|entrydateCreated=’2011-08-05T10:57:05Z’|entrydateModified=’2011-09-27T11:24:32Z’|entryguid=’44dcf55e-9cd7-47a9-ac66-1713a3769708’|entryid=’mopaka_44dcf55e-9cd7-47a9-ac66-1713a3769708′”

As you can see from this diff output, I changed data in a number of different types of fields, including the report title, author, sectionTitle, langData (from citation), langData (from Plural), glosses in each of French and Swahili, and the last three are root and definitions, which are not visible in the printed report, but stored in an ExternalID attribute (recently added to XLingPaper to be able to store this kind of info, without having to put it elsewhere in the structure of the doc).

And here is the diff showing the changes between the original LIFT export and the one exported after importing the LIFT file with modifications:

$ diff gey{,.ori}.lift
< <form lang=”swh”><text>mkono / mikonoMOD</text></form>

> <form lang=”swh”><text>mkono / mikono</text></form>
< <gloss lang=”swh”><text>mke / wakeMOD</text></gloss>

> <gloss lang=”swh”><text>mke / wake</text></gloss>
< <form lang=”gey”><text>pakaMOD</text></form>

> <form lang=”gey”><text>paka</text></form>
< <form lang=”gey”><text>mbataMOD</text></form>

> <form lang=”gey”><text>mbata</text></form>
< <field type=”Plural”><form lang=”gey”><text>mabataMOD</text></form>

> <field type=”Plural”><form lang=”gey”><text>mabata</text></form>
< <form lang=”fr”><text>orangeMOD</text></form>

> <form lang=”fr”><text>orange</text></form>
< <gloss lang=”fr”><text>fissure, fenteMOD</text></gloss>

> <gloss lang=”fr”><text>fissure, fente</text></gloss>


  1. The first several MOD’s to the paper (to titles, etc.) are not in the second diff, since only example data is extracted into the LIFT file to import (this is what we want, right?).
  2. The other mods –root, citation, plural, gloss-swahili, gloss-french, definition-french and definition-swahili– all survived.
  3. No other changes existed between the exported LIFT files.


Because FLEx exported essentially the same LIFT file (of 2033 entries and almost 2MB, remember), with all and only the changes made in XXE, I presume that there were no destructive changes to the underlying FLEx database, and this procedure is safe for further testing. I did not go so far as to diff the underlying fwdata file, as I probably wouldn’t understand its format anyway, and I wouldn’t know how to distinguish between differences in formatting and content (while it is also XML, I don’t understand its specification or how it is used in the program –which is not a bad thing).
Speaking of what I don’t know, I should be clear that my formal training is in Linguistics (M.A. Oregon 2002), not in IT. I’m doing this because there is a massive amount of linguistic data to collect, organize, analyze and verify, and I want to do that efficiently (the fact that this is fun is just a nice byproduct). In any case, I have certainly not followed best practices in my bash or XSL scripting. So if you read the attachments and think “this guy doesn’t know how to code efficiently or elegantly,” then we’re already in agreement on that. And you’d also be welcome to contribute on improvements. 🙂


I wouldn’t have gotten anywhere on this project without the work of many others, particularly including those that are giving of their own time and resources (which surely could have been spent elsewhere) on FLEx, WeSay, and the LIFT specification itself. Of particular note is Andy Black, who encouraged me to take another stab at XSLT (after telling him I’d tried and given up a few years ago), and who has provided invaluable and innumerable helps, both in the development of the XLingPaper specification, and in particular issues related to these transforms. Most of what is good here has roots in his work, though I hope no one holds him responsible for my errors and inelegance.