Showing posts with label xslt. Show all posts
Showing posts with label xslt. Show all posts

Monday, 27 July 2009

Learning XSLT 2.0 Part 1; Finding Names

We mark up a lot of names, so one of the first things I decided to do was to build an XSLT stylesheet that takes a list of names and tags those names when they occur in a separate XSLT file. To make things easier and clearer, I've ignored little things like namespaces, conformant TEI, etc, etc.



First up, the list of names, these are multi-word names. Notice the simple structure, this could easily be built from a comma seperated list or similar:



<?xml version="1.0" encoding="UTF-8"?>
<names>
<name>Papaver argemone</name>
<name>Papaver dubium</name>
<name>Papaver Rhceas</name>
<name>Zanthoxylum novæ-zealandiæ</name>
</names>

Next, some sample text:



<?xml version="1.0" encoding="UTF-8"?>
<doc>
<p> There are several names Papaver argemone in this document Papaver argemone</p>
<p> Some of them are the same as others (Papaver Rhceas Papaver rhceas P. rhceas)</p>
<p> Non ASCII characters shouldn't cause a problem in names like Zanthoxylum novæ-zealandiæ AKA Zanthoxylum novae-zealandiae</p>
</doc>

Finally the stylesheet. It consists of three parts: the regexp variable that builds a regexp from the names in the file; a default template for everything but text(); and a template for text()s that applies the rexexp.



<?xml version="1.0"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >

<!-- build a regexp of the names -->
<xsl:variable name="regexp">
<xsl:value-of select="concat('(',string-join(document('name-list.xml')//name/text(), '|'), ')')"/>
</xsl:variable>

<!-- generic copy-everything-except-texts template -->
<xsl:template match="@*|*|processing-instruction()|comment()">
<xsl:copy>
<xsl:apply-templates select="@*|*|processing-instruction()|comment()|text()"/>
</xsl:copy>
</xsl:template>

<!-- Look for binomal names in appreviated form where the genus name is in the immediately preceeding head -->
<xsl:template match="text()">
<xsl:analyze-string select="." regex="{$regexp}">
<xsl:matching-substring>
<name type="taxonomic" subtype="matched">
<xsl:value-of select="regex-group(1)"/>
</name>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>

</xsl:stylesheet>


The output looks like:



<?xml version="1.0" encoding="UTF-8"?><doc>
<p> There are several names <name type="taxonomic" subtype="matched">Papaver argemone</name> in this document <name type="taxonomic" subtype="matched">Papaver argemone</name></p>
<p> Some of them are the same as others (<name type="taxonomic" subtype="matched">Papaver Rhceas</name> Papaver rhceas P. rhceas)</p>
<p> Non ASCII characters shouldn't cause a problem in names like <name type="taxonomic" subtype="matched">Zanthoxylum novæ-zealandiæ</name> AKA Zanthoxylum novae-zealandiae</p>
</doc>

As you may notice, I've not yet worked out the best way to handle the 'æ'

Tuesday, 26 August 2008

Saxon joy!

I've just moved to saxon from libxml for some XSLT stuff I'm doing, and I'm really loving it.

Not only does saxon run take much less memory, it also speaks XSLT 2.0.