Tag Archives: XLingpaper

Writing Presentations in XLingPaper

I recently finished a series of presentations, each of which required me to prepare notes and a handout and/or slides. I figured out a workflow that made things a bit more streamlined than I’d been doing it, so I thought it worth documenting here.

Why do it

First of all, I’m sold on the use of XML standards, like XLingPaper, an XML standard for writing linguistic documents. One of the main reasons is because it just makes sense, if you’re re-purposing a document, to not start it from scratch each time (Simons and Black 2009 and Black 2009). For instance, the three presentations I gave last month were all on the same language, and a bit of the content was common to all, even if the focus (and audience) was radically different for each. So rather than starting each from scratch, I modified one presentation to make the next, and used copy and paste liberally. I think most of us would do this, because it just makes economical sense. But what about applying the same principle across notes, slides, and a handout, for the same presentation?
Each of my presentations required me to have notes to speak from, which were related to the slides I used, and the handout (for the academic talks). Why should I make three different documents in three different programs, when I can accomplish all three in one? Especially if I can use the same source document for all three, so I don’t have to update three separate documents when I need to make a change?

How to do it: Content Types

I accomplished this by writing my presentation in XLingPaper, through the XMLMind XML Editor (XXE), the recommended editor for XLingPaper. The basic mechanism to organize the different outputs was Content Control, which is described here in the XLingPaper documentation. Rather than repeat that information here, I’ll just describe how I used the system.
The circled section of this screenshot shows my content types (under content control, after the back matter):
ContentControlTypes

You can see that I have a content type for each of handouts, slides, and paper/notes, as well as one for each combination of the two (and a few others). This allows me to quickly mark a portion of the paper as belonging to just one of the three products, or else to two of them, but not the third (like if I want something in the handout and my notes, but not on a slide, or else in my notes and the slides, but not on a handout). Once this is set up, as I’m editing my paper, I can mark a paragraph, list, section, etc. as belonging to the appropriate output, and know it will only show there. So if I have a full table of data to put in the handout, but I don’t want all that data in a slide, I can make two different copies of that table, and label the one for the handout and paper, and the other for the slides. As I make any updates to the table, I can easily cut and paste between them, since they’re adjacent in the source document. I haven’t so far had much trouble keeping track of which copy goes for which product (this may be a bit of an issue, since it isn’t obvious in the UI), but one could easily put a comment (which would be highlighted in the UI, but not show in any output) in one or both copies to remind yourself which is which.

How to do it: Content Control Choices

Once I have my content (at least provisionally) set up as belonging to one output or another (or not marked, if I want it in all outputs), it makes sense to test it out by selecting an output, here:
ContentControlChoices
I’ve set up these content control choices to give me quick access to what I want showing in each of slides, handout, and paper, etc. For this selection, you see the following exclusions:
ContentControlExclusions
That is, when “publish paper (monolingual)” is selected, it excludes pieces of the document labeled as extra, handout, slides, or SlidesorHandout:
ContentControlPaper
Alternatively, when Publish Slides (Monolingual) is selected, it excludes pieces of the document labeled as extra, handout, paper, or PaperorHandout:
ContentControlSlides
And finally, when Publish Handout (Monolingual) is selected, it excludes pieces of the document labeled as extra, paper, slides, or PaperorSlides:
ContentControlHandouts
To be clear, this is just the way I’ve set up my documents, for my own convenience. I decided initially to have a ctExtra type, though I don’t really use it much. Additionally, I haven’t done many bilingual or student/teacher versioned documents since I started using content control (though there were some times when it would have been handy). Other categories may be relevant for you, both for tagging sections of your document, and for selecting them in content control choices.

Setting up a template

Regarding this setup, I wouldn’t want to organize all this for each paper I wrote, which is why I’ve made these options a part of my modular paper template. So just like I have the same (longer and longer) references section referenced in each paper, I now have the same content options available in each paper (which costs me nothing if I don’t use them…). But this was a bit tricky to set up, since we want the options to be universal across documents, but not the selection itself. So as you can see in the above screenshots, the Content Control section has a white background, since it is part of the document (since I want my content control choice to impact a particular document), but the content types node has a colored background, as it is in a referenced document (since I want a change to that section to impact all documents). This does mean that if I develop new content control choices, I would need to copy them into old documents where I wanted to use them, but I think that’s a worthwhile (and small) cost. You might see this as an unprincipled decision (and maybe it is); you’re more than welcome to organize your docs in a way that makes sense to you.

Outputting slides

One last point about this setup deals with the output formatting. My notes and the handout are both output to US letter, but you can set whatever you need in your stylesheet (e.g. A4), and assuming its the same, you can use the same stylesheet for both. But what about slides? I’ve tried showing letter-sized PDF’s in a slideshow before, and it is ugly. So what I did this time was set up a new stylesheet, that gives me 4″x6″ pages:
SlidesStylesheet
You may find a better solution for your material, but the important part is to get the aspect ratio right (assuming you know what your projector’s aspect ratio is). Additionally, I found this size to work nicely in terms of getting only so much information on a slide. If you did 8×12, you’d have much more information on each page, which you probably don’t want for your slides — or maybe you do. In any case, there will probably be some fiddling to find the right mix of size and aspect to get the right content on your slides. I also found that if my images were smaller in my paper, they came out just right on the slides, but that if I had full page width images, they would spill over the edge of the slides — again figure out what works for you; I could see using two different images with different sizes and content types, but I just made them smaller and used the same images for each.

Setting Stylesheets

One final bit of trouble is the fact that there is no method (currently) for associating a stylesheet with a content type. That is, if you set XXE to output a handout, but you have the slides stylesheet applied, you’ll get really big words on weirdly shaped paper for your handout PDF. Alternatively, if you set it to output to slides, but don’t have the slides stylesheet attached, you’ll get your slides material in a letter sized PDF. So that’s just something to watch out for. But the good thing is that once you set up your source document, making changes like which output to make, and which stylesheet to apply, are rather simple.

Conclusion

Over all, I was very happy with this application of the XLingPaper XML specification, in that it reduced duplication of my work, kept multiple versions of a single document in one place, and reduced my overhead in terms of the number of files I was managing, while producing each of the three files I needed to do each of my presentations. Just another reminder of why I’m editing my documents in XML (Black 2009).

Round-tripping LIFT data through XLingpaper

Rationale

The LIFT specification allows for interchange between lexical databases we use, such as in FLEx and WeSay. As an XML specification, it is also subject to XSL transformation, and can be converted to XML documents that conform to other specifications, such as XLingPaper, an XML specification for writing linguistics papers. I described before a means to get data out of FLEx into XlingPaper, but that required a script generating regular expressions which were then put into a FLEx filter by hand (metaphorically speaking). Computers should be able to automate this, and so (following my “If computers can do a particular task, they should” motto) I developed a script to take that regular expression generator, and feed those expressions to an XSL stylesheet to produce XlingPaper XML from the LIFT XML automatically.
The other half of the rationale is that I hate exporting data from a database to a paper or report, seeing and error, and not being able to fix it once. Either I fix it in the paper and the database, or else in the database, then re-export to the paper. So a way to get data from LIFT to XlingPaper and back seemed helpful for drafting linguistics papers, even if one wasn’t dealing with the volume of reports I’m looking at generating.

Tools

One major caveat for this work is that these tools (FLEx, WeSay, and XLingpaper) are in active development, so functionality may vary over time. The tests in this post were run with the following:

  1. FLEx 7.0.6.40863 (for Linux)
  2. WeSay 1.1.9 (for Linux) –This doesn’t enter directly into these tests, but the LIFT files used often sync back and forth between these two programs.
  3. xsltproc from a standard Ubuntu Linux install (i.e., compiled against libxml 20706, libxslt 10126 and libexslt 815)
  4. GNU bash, also from standard Ubuntu Linux (i.e., version 4.1.5)
  5. GNU diffutils, also from standard Ubuntu Linux (i.e., version 2.8.1)
  6. XMLMind Xml Editor, version 5.1.0
  7. XLingPaper, version 2.18.0_3

All of these tools are free (or have a free version) and available online from their respective sources, and most are open source.
The scripts I’ve written (to generate reports and call the XSL transforms) are not yet publicly available; I hope to have them cleaned up and more broadly tested before long.

Test Goals

I want to see if I can

  1. Get data from LIFT to XLingPaper format,
  2. Modify the XLingPaper document in XXE (which keeps it in conformity to the XLingPaper DTD),
  3. Get it back into LIFT and imported to FLEx,
  4. Show that the FLEx import made all and only the changes made by modifying the XLingPaper document (i.e., no other data loss)

To do this I will be using an output of diff between two versions of the XLingPaper document (original and modified), and another diff between two versions of the LIFT file (originally exported, and exported after input). To achieve #4, I will show that the two diffs show all and only the same changes to data entries (the modifications to the XLingPaper doc are the same as the changes to the FLEx database, as evidenced by its export to LIFT). Fyi, this LIFT file has 2033 entries, and takes up almost 2MB (plain text), so we’re not talking about a trivial amount of data.

Test procedure

  1. Backup Wesay folder (this is real [gey] data I’m working with, after all…)
  2. Export “Full Lexicon” from FLEx, and copy it to gey.ori.lift
  3. Run report (vowel inventory) on exported gey.lift (This creates Report_VowelInventory.gey.xml)
  4. Open created report in XXE
  5. Modify and save (because XXE changes format –this helps diff see real changes, not those irrelevant to xml)
  6. Save as Report_VowelInventory.gey.mod.xml, and modify one example of each field we’re interested in, including @root (at this point both files have been saved by XXE, for easier comparison).
  7. Run `diff Report_VowelInventory.gey.{,mod.}xml` (results below)
  8. Run `xlp-extract2lift Report_VowelInventory.gey.mod.xml .` (This creates Report_VowelInventory.gey.mod.compiledfromXLP.lift)
  9. Backup FLEx project (just in case, as there’s real data here, too)
  10. Import Report_VowelInventory.gey.mod.compiledfromXLP.lift to FLEx project, selecting “import the conflicting data and overwrite the current data (importing data overrules my work).” and unticking “Trust entry modification times” (This is important because if that box is selected entries won’t import unless you have also changed the ‘dateModified’ attribute on an entry –which I generally don’t).
  11. Export again, producing a second LIFT file exported by FLEx (one before, and one after the import)
  12. Run `diff gey{,.ori}.lift`
  13. Compare diffs to see fidelity of the process.

Test results

Here is the diff showing the changes between the original report and the modifications:

$ diff Report_VowelInventory.gey.{,mod.}xml
11c11
< >Rapport de l’Inventaire des Voyelles de [gey]</title

> >Rapport de l’Inventaire des Voyelles de [gey]MOD</title
23c23
< >Kent Rasmussen</author

> >Kent RasmussenMOD</author
42c42
< >Voyelles</secTitle

> >VoyellesMOD</secTitle
65c65
< >mbata</langData

> >mbataMOD</langData
89c89
< >pl: mabata</langData

> >pl: mabataMOD</langData
113c113
< >fissure, fente</gloss

> >fissure, fenteMOD</gloss
137c137
< >mke / wake</gloss

> >mke / wakeMOD</gloss
155c155
< externalID=”ps=’Noun’|senseid=’hand_0d9c81ef-b052-4f61-bc6a-02840db4a49e’|senseorder=”|definition-swh=’mkono

/ mikono'”

> externalID=”ps=’Noun’|senseid=’hand_0d9c81ef-b052-4f61-bc6a-02840db4a49e’|senseorder=”|definition-swh=’mkono

/ mikonoMOD'”
171c171
< externalID=”ps=’Noun’|senseid=’orange_2924ca57-f722-44e1-b444-2a30d8674126’|senseorder=”|definition-fr=’orange'”

> externalID=”ps=’Noun’|senseid=’orange_2924ca57-f722-44e1-b444-2a30d8674126’|senseorder=”|definition-fr=’orangeMOD'”
180c180
< externalID=”root=’paka’|entrydateCreated=’2011-08-05T10:57:05Z’|entrydateModified=’2011-09-27T11:24:32Z’|entryguid=’44dcf55e-9cd7-47a9-ac66-1713a3769708’|entryid=’mopaka_44dcf55e-9cd7-47a9-ac66-1713a3769708′”

> externalID=”root=’pakaMOD’|entrydateCreated=’2011-08-05T10:57:05Z’|entrydateModified=’2011-09-27T11:24:32Z’|entryguid=’44dcf55e-9cd7-47a9-ac66-1713a3769708’|entryid=’mopaka_44dcf55e-9cd7-47a9-ac66-1713a3769708′”

As you can see from this diff output, I changed data in a number of different types of fields, including the report title, author, sectionTitle, langData (from citation), langData (from Plural), glosses in each of French and Swahili, and the last three are root and definitions, which are not visible in the printed report, but stored in an ExternalID attribute (recently added to XLingPaper to be able to store this kind of info, without having to put it elsewhere in the structure of the doc).

And here is the diff showing the changes between the original LIFT export and the one exported after importing the LIFT file with modifications:

$ diff gey{,.ori}.lift
2601c2601
< <form lang=”swh”><text>mkono / mikonoMOD</text></form>

> <form lang=”swh”><text>mkono / mikono</text></form>
10776c10776
< <gloss lang=”swh”><text>mke / wakeMOD</text></gloss>

> <gloss lang=”swh”><text>mke / wake</text></gloss>
15871c15871
< <form lang=”gey”><text>pakaMOD</text></form>

> <form lang=”gey”><text>paka</text></form>
23529c23529
< <form lang=”gey”><text>mbataMOD</text></form>

> <form lang=”gey”><text>mbata</text></form>
27587c27587
< <field type=”Plural”><form lang=”gey”><text>mabataMOD</text></form>

> <field type=”Plural”><form lang=”gey”><text>mabata</text></form>
31657c31657
< <form lang=”fr”><text>orangeMOD</text></form>

> <form lang=”fr”><text>orange</text></form>
32416c32416
< <gloss lang=”fr”><text>fissure, fenteMOD</text></gloss>

> <gloss lang=”fr”><text>fissure, fente</text></gloss>

Summary

  1. The first several MOD’s to the paper (to titles, etc.) are not in the second diff, since only example data is extracted into the LIFT file to import (this is what we want, right?).
  2. The other mods –root, citation, plural, gloss-swahili, gloss-french, definition-french and definition-swahili– all survived.
  3. No other changes existed between the exported LIFT files.

Discussion

Because FLEx exported essentially the same LIFT file (of 2033 entries and almost 2MB, remember), with all and only the changes made in XXE, I presume that there were no destructive changes to the underlying FLEx database, and this procedure is safe for further testing. I did not go so far as to diff the underlying fwdata file, as I probably wouldn’t understand its format anyway, and I wouldn’t know how to distinguish between differences in formatting and content (while it is also XML, I don’t understand its specification or how it is used in the program –which is not a bad thing).
Speaking of what I don’t know, I should be clear that my formal training is in Linguistics (M.A. Oregon 2002), not in IT. I’m doing this because there is a massive amount of linguistic data to collect, organize, analyze and verify, and I want to do that efficiently (the fact that this is fun is just a nice byproduct). In any case, I have certainly not followed best practices in my bash or XSL scripting. So if you read the attachments and think “this guy doesn’t know how to code efficiently or elegantly,” then we’re already in agreement on that. And you’d also be welcome to contribute on improvements. 🙂

Acknowledgements

I wouldn’t have gotten anywhere on this project without the work of many others, particularly including those that are giving of their own time and resources (which surely could have been spent elsewhere) on FLEx, WeSay, and the LIFT specification itself. Of particular note is Andy Black, who encouraged me to take another stab at XSLT (after telling him I’d tried and given up a few years ago), and who has provided invaluable and innumerable helps, both in the development of the XLingPaper specification, and in particular issues related to these transforms. Most of what is good here has roots in his work, though I hope no one holds him responsible for my errors and inelegance.

Getting Fieldworks lexical data into XLingpaper

I’ve written before about using WeSay to collect language data, and Wesay lift files can be fairly easily imported into Fieldworks Language Explorer (FLEx) for analysis. Recently, I’ve been working on getting data from the FLEx lexicon into XLingpaper, to facilitate the writing of reports and papers than can be full of data (which is the way I like them…:-)).
I start with a lexicon (basically just a word list) in flex, that has been parsed for root forms (go through noun class categorization and obligatory morphology with a speaker of the language). Figure out the canonical root syllable profile (e.g., around here, usually CVCV), and look for complimentary distribution and contrast within that type, both (though separately) for nouns and verbs.
I have a script that is putting out regular expressions based on what graphs we expect to use (those in Swahili, plus those we have added to orthographies in the past –since we start with data encoded in the Swahili/Lingala orthography, this covers most of the data that we work with). This script puts out expressions like

^bu([mn]{0,1})([[ptjfvmlryh]|[bdgkcsznw][hpby]{0,1}])([aiɨuʉeɛoɔʌ]{1,2})([́̀̌̂]{0,1})$

which means that the whole word/lexeme form (between ^ and $) is just b, u, some consonant (one of m or n, or not, then a consonant letter that appears alone, or the first and second of a digraph), then some vowel (any of ten basic ones, long or short, plus diacritics, or not). In other notation, It is giving buCV, or canonical structures with root initial [bu]. This data is paired with data from other regular expression filters giving [b] before other vowels to show a complete distribution of [b] before all vowels (presumably…).

The script puts out another expression,

^([mn]{0,1})([[ptjfvmlryh]|[bdkgcsznw][hpby]{0,1}])(a)([mn]{0,1})([[ptjfvmlryh]|[bdkgcsznw][hpby]{0,1}])3([́̀̌̂]{0,1})$

which gives me CaCa, as in the following screenshot:

(The 3 refers to the third set of parentheses, (a), so that changing (a) to (i) gives you CiCi.) The data from these filters gives evidence of the independent identity of a vowel, as opposed to vowels created through harmony rules.
So these regular expressions allow filtering of data in the FLEx lexicon to show just the data that you I need to prove a particular point you’re trying to make (in my case, why just these letters should be in the alphabet). But then, how to get the data out of FLEx, and into a document you’re writing?
FLEx has a number of export options out of the box, but none of them seem designed for outputting words with their glosses, based on a particular filter/sort of the lexicon. In particular, I’m looking for export into a format that can be validated against an XLingpaper DTD, since I use XLingpaper XML for most of my writing, both for archivability and longevity of my data, as well as for cross-compatibility in differing environments (there are also developed stylesheets to make XLingpaper docs into html, pdf, and usually word processor docs, too). The basic XML export of the data on the above sort starts like this:

<?xml version=”1.0″ encoding=”utf-8″?>
<ExportedDictionary>
<LexEntry id=”hvo16380″>
<LexEntry_HeadWord>
<AStr ws=”gey”>
<Run ws=”gey”>bana</Run>
</AStr>
</LexEntry_HeadWord>
<LexEntry_Senses>
<LexSense number=”1″ id=”hvo16382″>
<MoMorphSynAnalysisLink_MLPartOfSpeech>
<AStr ws=”en”>
<Run ws=”en”>num</Run>
</AStr>
</MoMorphSynAnalysisLink_MLPartOfSpeech>
<LexSense_Definition>
<AStr ws=”en”>
<Run ws=”en”>four (4)</Run>
</AStr>
</LexSense_Definition>
<LexSense_Definition>
<AStr ws=”fr”>
<Run ws=”fr”>quatre (4)s</Run>
</AStr>
</LexSense_Definition>
<LexSense_Definition>
<AStr ws=”swh”>
<Run ws=”swh”>nne</Run>
</AStr>
</LexSense_Definition>
<LexSense_Definition>
<AStr ws=”pt”>
<Run ws=”pt”>quatro (4)</Run>
</AStr>
</LexSense_Definition>
<LexSense_Definition>
<AStr ws=”es”>
<Run ws=”es”>cuatro</Run>
</AStr>
</LexSense_Definition>
</LexSense>
</LexEntry_Senses>
</LexEntry>
<LexEntry id=”hvo11542″>
<LexEntry_HeadWord>… and so on…

But this is way more information than I need (I got most of these glosses for free using the CAWL to elicit the data), and in the wrong form. The cool thing about XML is that you can take structured information and put in in another structure/form, to get the form you need. To do this, I needed to look (again) and xsl, the extensible stylesheet language, which had succesfully intimidated me a number of times already. But with a little time, energy, and despration, I got a working stylesheet. And with some help from Andy Black, I made it simpler and more straightforward, so that it looks like XLPMultipleFormGlosses.xsl looks today. Put it, and an xml file describing it, into /usr/share/fieldworks/Language Explorer/Export Templates, and this is now a new export process from within FLEx. To see the power of this stylesheet, the above data is now exported from FLEx as

<?xml version=”1.0″ encoding=”utf-8″?>
<!DOCTYPE xlingpaper PUBLIC “-//XMLmind//DTD XLingPap//EN” “XLingPap.dtd”>
<xlingpaper version=”2.8.0″>
<lingPaper>
<section1 id=”DontCopyThisSection”>
<secTitle>Click the [+] on the left, copy the example that appears below, then paste it into your XLingpaper document wherever an example in allowed.</secTitle>
<example num=”examples-hvo16380″>
<listWord letter=”hvo16380″>
<langData lang=”gey”>bana</langData>
<gloss lang=”en”>four (4)</gloss>
<gloss lang=”fr”>quatre (4)s</gloss>
</listWord>
<listWord letter=”hvo11542″>

which contains just enough header/footer to validate against the XLingpaper DTD (so you don’t get errors opening it in XMLmind), and the word forms and just the glosses I want (English and French, but that is easily customizable in the stylesheet). The example node can be copied and pasted into an existing XLingpaper document, which then can eventually be transformed into other formats, like the pdf from this screenshot:

which I think is a pretty cool thing to be able to do, and an advance for the documentation of languages we are working with.