Using XSLT to Re-represent GATE Output

Once one has processed some documents with GATE, what can one do with the result? After all, information extraction implies that the information is extracted, not simply annotated. (See introductory notes on this and related posts.)

There are several paths. One is to use Annotations in Context (ANNIC), which searches for and returns a display of annotated elements; we discuss how to use ANNIC in a separate post. However, this does not appear to support an “export” function to further process the results. Another path is to export the document with inline annotations; this, with a bit of further manual work, can then be processed further with EXtensible Stylesheet Language Transformations — XSLT. There are other approaches (e.g. XQUERY), but this post provides an example of using XSLT to present output as a rule book.

In Legislative Rule Extraction, we annotated some legislation. We carry on with the annotated legislation.

Output of GATE

In addition to the graphic output from GATE’s application, we can output the results of the annotation either inline or offset. As we are interested to provide alternative presentations of the annotated material, we look at the inline annotation.

In GATE, by right clicking on the document file (after applying the application to it) and choose “Save preserving document format’”. For out sample text, the result is:

<ArticleFlag> Article 1 </ArticleFlag>
<SectionType> Subject matter </SectionType>
<ListStateTop> This Directive lays down rules concerning the
following </ListStateTop>:
<ListFlagLevel1> 1) </ListFlagLevel1>
<SubListStatementPrefinal> the taking-up and pursuit, within the Community,
of the self-employed activities of direct insurance and
reinsurance </SubListStatementPrefinal>;
<ListFlagLevel1> 2) </ListFlagLevel1>
<SubListStatementPrefinal> the supervision in the case of insurance and
reinsurance groups </SubListStatementPrefinal >;
<ListFlagLevel1> 3) </ListFlagLevel1>
<SubListStatementFinal> the reorganisation and winding-up of direct
insurance undertakings </SubListStatementFinal>.

Legal XML

The GATE output needs to be made into proper XML, having a root and being properly nested. As there will be several rules, each rule extracted should go between some legal XML annotation. There is an issue about how to save and process a full corpus, as the only options to save are XML or Datastore, but we leave this aside for the time being. For now, we ‘manually’ wrap our GATE output as below.

I used the online XSLT editor at w3schools, which has nice online functionality which allows one to experiment and get results right away. In particular, one can cut and paste the XML rulebook (below) into the left hand pane and the XSLT code (below) into the right hand pane, hit the edit button, and get the transformed output. Caveat, one might have to do a bit of editing on the XML rulebook for spaces and returns since there are some bumps between what appears in WordPress and what is needed to run code.

The XML Rulebook:

<?xml version="1.0" encoding="ISO-8859-1"?>
<rulebook>
<rule>
<ArticleFlag> Article 1 </ArticleFlag>
<SectionType> Subject matter </SectionType>
<ListStateTop> This Directive lays down rules concerning the
following </ListStateTop>:
<ListFlagLevel1> 1) </ListFlagLevel1>
<SubListStatementPrefinal> the taking-up and pursuit, within the Community,
of the self-employed activities of direct insurance and
reinsurance </SubListStatementPrefinal>;
<ListFlagLevel1> 2) </ListFlagLevel1>
<SubListStatementPrefinal> the supervision in the case of insurance and
reinsurance groups </SubListStatementPrefinal >;
<ListFlagLevel1> 3) </ListFlagLevel1>
<SubListStatementFinal> the reorganisation and winding-up of direct
insurance undertakings </SubListStatementFinal>.
</rule>
</rulebook>

The XSLT code:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- Edited by XMLSpy® -->
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
  <html>
  <body>
  <h3>My Rulebook</h3>
  <xsl:apply-templates/>
  </body>
  </html>
</xsl:template>

<xsl:template match="rule">
  <p>
    <xsl:apply-templates select="ArticleFlag"/>
    <xsl:apply-templates select="SectionType"/>
    <xsl:apply-templates select="ListStateTop"/>
    <xsl:apply-templates select="SubListStatementPrefinal"/>
    <xsl:apply-templates select="SubListStatementFinal"/>
  </p>
</xsl:template>

<xsl:template match="ArticleFlag">
  Reference Code: <span style="color:#ff0000">
  <xsl:value-of select="."/></span>
  <br />
</xsl:template>

<xsl:template match="SectionType">
  Title: <span style="color:#00ffff">
  <xsl:value-of select="."/></span>
  <br />
</xsl:template>

<xsl:template match="ListStateTop">
  Description: <span style="color:#00ff00">
  <xsl:value-of select="."/></span>
  <br />
</xsl:template>

<xsl:template match="SubListStatementPrefinal">
  Description: <span style="color:#00ff00">
  <xsl:value-of select="."/></span>
  <br />
</xsl:template>

<xsl:template match="SubListStatementFinal">
  Description: <span style="color:#00ff00">
  <xsl:value-of select="."/></span>
  <br />
</xsl:template>

</xsl:stylesheet>

XSLT Output

The result is the following:

Output of XSLT on the XML Rulebook

In general, one can create any number of rulebooks from the same underlying data, varying the layout and substance of the presentation. For example, we can change the colours or headers easily; we can present more or less information. This is a lot more powerful than the static book that now exists.

Problems and Issues

Our example is a simple illustration of what can be done. Note that we have not yet fulfilled the requirements from our initial post since we have not numbered the sections, but this can be added later.

An important problem is that GATE annotations are not always in accordance with XML standards. In particular, XML markups must be strictly embedded as in

 <x> <y> </y> <z> </z> </x>

There can be no crossover such as in

<x> <y> <z> </y> </z> </x>

though this may well occur for GATE annotations. There may be several approaches to this problem, but we leave that for future discussion.

Another problem is that “Save preserving document format” only works with documents and not corpora, and we might want to work with corpora.

Finally, XSLT is useful for transforming XSL files, not in extracting information from XML files, for which one would need something such as XQuery.

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

One Response to “Using XSLT to Re-represent GATE Output”

  1. [...] presented in a range of alternative ways and formats using a transformation language such as XSLT (click here for more on this point) so that we have an easier-to-read [...]