Archive for the ‘GATE’ Category

Information Extraction of Legal Case Factors

Wednesday, January 20th, 2010

This post reports initial steps in legal case factor annotation. We first give a very brief and highly simplified overview of case based reasoning using case factors, then present how case factors can be identified using text mining. (See introductory notes on this and related posts.)

Case based reasoning background

In Common Law legal systems such as in the USA and UK, judges make decisions concerning a case; we can say the judges make the law. This is in contrast to Civil Law legal systems as in Europe (excluding the UK) or elsewhere in which legislatures make law and which must be followed by judges. Neither legal system is common law or civil law in practice: the USA and UK have laws made by legislatures; in Europe, the application of legislative acts in particular circumstances (refining the law to apply to the facts) takes on aspects of common law.

In a Common Law system, judges and lawyers argue using case based reasoning: a current undecided case with respect to precedent cases, which are cases that have already been decided by a court and are accepted as “good law”. In essence, if the current case were exactly like a particular precedent case in all essential ways, then the current case ought to be decided as was the precedent case. Where the current case varies, one must argue comparatively with respect to other precedents. Among the ways in which cases are compared and contrasted, we find the case factors, where factors are prototypical fact patterns of a case. In virtue of the facts of a case and along with the applicable laws and precedents, a judge decides a case. It is, therefore, crucial to be able to identify the facts of a case in order to compare and contrast the cases.

In AI and Law, case based reasoning has a long and well developed history and literature (see the work of Hafner, Rissland, Ashley, and Bench-Capon among others. We make specific reference to Aleven’s 1997 Ph.D. Thesis. Given an analysis of cases in terms of factors, one can reason about how a current undecided case should, according to the precedents, be decided. However, a central problem is the knowledge bottleneck — how to analyse cases in terms of factors. By an large, this has been a manual labour. In the CATO database of cases discussed in Aleven 1997 (about 140 cases concerning intellectual property), the factors are manually annotated. There has been some effort to automate textual identification of factors in cases (see Bruninghaus and Ashley, but this is done with case summaries, not “actual” cases; moreover, the database, annotation, and other system supports are unavailable, so the results of their experiments are not independently verifiable and cannot be developed by other researchers.

Factors in text

In the CATO system, texts of case decisions are presented to the student along with a menu of factors; the student associates the factors with the text, in effect, annotating the case as a whole with the factors, but not the linguistic aspects which gave rise to the annotation. The factors are not extracted. The CATO system has other components to support case based argumentation, but these are not relevant to our discussion at this point.

Factors are legal concepts that range over facts. While Aleven 1997 has 27 factors and a factor hierarchy, we only look at two factors in order to give a flavour of our approach.

  • Security-Measures
    • Description: The plaintiff took efforts to maintain the secrecy of its information.
    • The factor applies if: The plaintiff limited access to and distribution of information. Examples: nondisclosure agreements, notification that the information is confidential, securing the information with passwords and secure storage facilities, secure document distribution systems, etc.
  • Secrets-Disclosed-Outsiders
    • Description: The information was disclosed to outsiders or was in the public domain. The plaintiff either did not have secret information or did not have an interest in maintaining the secrecy of information.
    • The factor applies if: The plaintiff disclosed the product information to licensees, customers, subcontractors, etc.
    • The factor does not apply if: Plaintiff published the information in a public forum. All we know is that plaintiff marketed a product from which the information could be ascretained by reverse engineering.

Aleven 1997 illustrates the association of factors with textual passages in a case.

Mason v. Jack Daniels Distillery

Given the factor description, we make lists and rules which at least highlight candidate passages in the case which might be relevant.

Output

The results of annotating terms and sentences appears in:

Annotations for Secret and Disclose Terms in Trandes v. Atkinson

Annotations for Secret and Disclose Sentences in Trandes v. Atkinson

Note that the disclosure sentence seems to be a reasonable candidate about the disclosure factor, but the secrecy sentence is a discussion about the factor rather than a presentation of the factor itself. As we have said, at this point we provide candidate expressions for the factors; further work must be done to more accurately automatically annotate the text.

GATE

The lists, JAPE rules, graphics, and application state are in the archive. See the related post Information Extraction with ANNIC which uses a GATE plugin to further analyse the results so they can be improved.

Lists

To highlight the relevant passages, we created Lookup lists and then JAPE rules. To create the Lookups, we turned to disclosure and secret in WordNet, taking the SynSets of each, as well as looking at hypernyms (superordinate terms). Making a selection, we created lists using the infinitival, lower case form. This gave us two lists — disclosure.lst and secret.lst.

  • disclosure.lst: announce, betray, break, bring out, communicate, confide, disclose, discover, divulge, expose, give away, impart, inform, leak, let on, let out, make known, pass on, reveal, tell
  • secret.lst: confidential, confidentiality, hidden, private, secrecy, secret

In the gazetteer itself, disclosure.lst has a majorType disclose, and secret.lst has a majorType secret. With these lists, we homogenize the alternative words for these concepts. It is importantly that these particular lists are integrated into a lists.def file; in our example, this is ListGaz, but is not included in the distribution. As the application uses the Flexible Gazetteer (not discussed here), we can Lookup alternative morphological forms of words in the lists.

JAPE rules

Then we write JAPE rules so we can more easily identify them. The first rules make the majorType into an annotation for the annotation set, highlighting any occurrence of the terms; we could have skipped this, but it is worthwhile to see where and how the terms appear. The second rules classify sentences as relating to disclosure and secrecy.

  • SecretFactor01.jape: Annotates any word from the secret.lst.
  • DisclosureFactor01.jape: Annotates any word from the disclosure.lst.
  • SecretFactorSentence01.jape: Annotates any sentence which contains an annotation Secret.
  • DisclosureFactorSentence01.jape: Annotates any sentence which contains an annotation Disclosure.

Application order

The order of application of the processing resources is:

  • Document Reset PR
  • ANNIE Sentence Splitter
  • ANNIE English Tokeniser
  • ListGaz
  • SecretFactor01.jape
  • DisclosureFactorSentence01.jape
  • SecretFactorSentence01.jape
  • DisclosureFactorSentence01.jape

Discussion

As we have already pointed out, the annotations highlight potentially relevant passages. Further refinement is needed. This would be clearer were one to look at more applications of the annotation. It will also be important to consider more factors on more cases and across more domains of case law.

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Information Extraction of Conditional Rules

Wednesday, January 20th, 2010

In this post, we extract conditional rules, such as If it rains, then the sidewalk is wet both in simple examples and from a sample fragment of legislation. (See introductory notes on this and related posts.)

Sample legislation

In legislation (and elsewhere in the law), conditional statements of the form If P, then Q are used. A well-researched example in AI and Law is the UK Nationality Act. In this post, we provide some initial JAPE rules to annotate conditional statements.

We work with a several variants of simple conditional statements and a (modified) conditional statement from the UK Nationality Act. For each statement, we want to annotate them as rules as well as to identify the portions of the rule.

    If Bill is happy, then Jill is happy.

    Jill is happy, if Bill is happy.

    Jill is happy if:

        1) Bill is happy;
        2) Bill and Jill are together.

    Acquisition by birth or adoption

        (1) A person born in the United Kingdom after commencement shall be a British citizen if –
        (a) at the time of the birth his father or mother is a British citizen; or
        (b) at the time of the birth his father or mother is settled in the United Kingdom.

Output

What we want to get is not only do we have a sentence which we have identified as being a rule, but that we can also identify the parts of the rule, namely the antecedent and the consequent. This may be useful for further processing.

The results appear in a graphic as:

Rule Output

Below, we discuss some of the problems with annotating the legislative rule.

GATE

In the zip file we have the application state, text, graphic, and JAPE rules.

Lists

There are no particular lists for this section; we used the same lists from the rulebook development.

JAPE Rules

We have a cascade of rules as follows.

  • AntecedentInd01: finds the token “if” in the text. We use this as an indicator that the sentence is or may be a rule. We may have a range of such rules that we take to indicate a rule. We can use them to examine results from a body of texts, refining what is identified as a rule and how. Overgenerate, then prune. After we are clear about the results from individual rules, we can gather the annotations together under another annotation, which generalises the result.
  • AntecedentInd02: finds the conditional indicator inside a sentence and annotates the resulting sentence as a rule with a conditional. A general rule like this can be used as we refine the indicators of rule. It also is an example of sentence annotation with respect to properties contained in the sentence.
  • ConditionalParts01: finds the string between if and some punctuation, then labels it antecedent. This labels Bill is happy as antecedent in simple sentences such as If Bill is happy, then Jill is happy and Jill is happy, if Bill is happy. It does not work for the list.
  • ConditionalParts02: finds the string between a preceding sentence and a comma followed by a conditional indicator, then labels it consequent. This labels Jill is happy as consequent in simple sentences such as Jill is happy, if Bill is happy.
  • ConditionalParts03: finds the string between then and the end of the sentence, labelling it consequent. This labels Jill is happy as consequent in simple sentences such as If Bill is happy, then Jill is happy.
  • ConditionalParts04: find the string between a preceding sentence and a conditional indicateor followed by a colon, then labels it consequent. This labels Jill is happy as consequent in constructions where the antecedents are presented in a list such as Jill is happy if: Bill is happy and Jill and Bill are together.
  • ConditionalParts05: finds the strings between list indicators (see the section on legislative presentation) and some punctuation (semi-colon or period), and labels them as antecedents. This labels Bill is happy as antecedent in Jill is happy if: Bill is happy and Jill and Bill are together.
  • ConditionalSentenceClass: annotates sentences as conditionals if they contain a conditional indicator.

Application order

The order of application of the processing resources is:

  • Document Reset PR
  • ANNIE English Tokeniser
  • ANNIE Sentence Splitter
  • ListFlagLevel1
  • AntecedentInd01
  • ConditionalParts01
  • ConditionalParts02
  • ConditionalParts03
  • ConditionalParts04
  • ConditionalParts05
  • ConditionalSentenceClass

Comments

While our application clearly works well for the simple samples of conditional statements, it does not do well with respect to our sample legislation. There are a range of problems: list recognition “(x)”, use of “;” , use of “–”, and use of “or”. Most of these have to do with refining the notions of lists that we inherited from the rulebook example, so we need to refine the rules to the particular context of use. We leave this as an exercise.

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Introduction to a Series of Posts on Legal Information Extraction with GATE

Wednesday, January 20th, 2010

This post has notes on and links to several other posts about legal information annotation and extraction using the General Architecture for Text Engineering system (GATE). The information in the posts was presented at my tutorial at JURIX 2009, Rotterdam, The Netherlands; the slides are available here. See the GATE website or my slides for introductory material about NLP and text annotation. For particulars about NLP and legal resources, see the posts and files at the links below.

The Posts

The following posts discuss different aspects of legal information extraction using GATE (live links indicate live posts):

Prototypes

The samples presented in the posts are prototypes only. No doubt there are other ways to accomplish similar tasks, the material is not as streamlined or cleanly presented as it could be, and each section is but a very small fragment of a much larger problem. In addition, there are better ways to present the lists and rules “in one piece”; however, during development and for discussion, it seems more helpful to have elements separate. Nonetheless, as a proof of concept, the samples make their point.

If there are any problems, contact Adam Wyner at adam@wyner.info.

Files

The posts are intended to be self-contained and to work with GATE 5.0. The archive files include the .xgapp file, which is a saved application state, along with text/corpus, the lists, and JAPE rules needed to run the application. In addition, the archive files include any graph outputs as reference. As noted, one may need to ‘fiddle’ a bit with the gazetteer lists in the current version.

Graphics

Graphics in the posts can be viewed in a larger and clearer size by right clicking on the graphic and selecting View Image. The Back button on your browser will close the image and return you to the post.

License

The materials are released under the following license:

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

If you want to commercially exploit the material, you must seek a separate license with me. That said, I look forward to further open development on these materials; see my post on Open Source Legal Information.

Using XSLT to Re-represent GATE Output

Wednesday, January 20th, 2010

Once one has processed some documents with GATE, what can one do with the result? After all, information extraction implies that the information is extracted, not simply annotated. (See introductory notes on this and related posts.)

There are several paths. One is to use Annotations in Context (ANNIC), which searches for and returns a display of annotated elements; we discuss how to use ANNIC in a separate post. However, this does not appear to support an “export” function to further process the results. Another path is to export the document with inline annotations; this, with a bit of further manual work, can then be processed further with EXtensible Stylesheet Language Transformations — XSLT. There are other approaches (e.g. XQUERY), but this post provides an example of using XSLT to present output as a rule book.

In Legislative Rule Extraction, we annotated some legislation. We carry on with the annotated legislation.

Output of GATE

In addition to the graphic output from GATE’s application, we can output the results of the annotation either inline or offset. As we are interested to provide alternative presentations of the annotated material, we look at the inline annotation.

In GATE, by right clicking on the document file (after applying the application to it) and choose “Save preserving document format’”. For out sample text, the result is:

<ArticleFlag> Article 1 </ArticleFlag>
<SectionType> Subject matter </SectionType>
<ListStateTop> This Directive lays down rules concerning the
following </ListStateTop>:
<ListFlagLevel1> 1) </ListFlagLevel1>
<SubListStatementPrefinal> the taking-up and pursuit, within the Community,
of the self-employed activities of direct insurance and
reinsurance </SubListStatementPrefinal>;
<ListFlagLevel1> 2) </ListFlagLevel1>
<SubListStatementPrefinal> the supervision in the case of insurance and
reinsurance groups </SubListStatementPrefinal >;
<ListFlagLevel1> 3) </ListFlagLevel1>
<SubListStatementFinal> the reorganisation and winding-up of direct
insurance undertakings </SubListStatementFinal>.

Legal XML

The GATE output needs to be made into proper XML, having a root and being properly nested. As there will be several rules, each rule extracted should go between some legal XML annotation. There is an issue about how to save and process a full corpus, as the only options to save are XML or Datastore, but we leave this aside for the time being. For now, we ‘manually’ wrap our GATE output as below.

I used the online XSLT editor at w3schools, which has nice online functionality which allows one to experiment and get results right away. In particular, one can cut and paste the XML rulebook (below) into the left hand pane and the XSLT code (below) into the right hand pane, hit the edit button, and get the transformed output. Caveat, one might have to do a bit of editing on the XML rulebook for spaces and returns since there are some bumps between what appears in WordPress and what is needed to run code.

The XML Rulebook:

<?xml version="1.0" encoding="ISO-8859-1"?>
<rulebook>
<rule>
<ArticleFlag> Article 1 </ArticleFlag>
<SectionType> Subject matter </SectionType>
<ListStateTop> This Directive lays down rules concerning the
following </ListStateTop>:
<ListFlagLevel1> 1) </ListFlagLevel1>
<SubListStatementPrefinal> the taking-up and pursuit, within the Community,
of the self-employed activities of direct insurance and
reinsurance </SubListStatementPrefinal>;
<ListFlagLevel1> 2) </ListFlagLevel1>
<SubListStatementPrefinal> the supervision in the case of insurance and
reinsurance groups </SubListStatementPrefinal >;
<ListFlagLevel1> 3) </ListFlagLevel1>
<SubListStatementFinal> the reorganisation and winding-up of direct
insurance undertakings </SubListStatementFinal>.
</rule>
</rulebook>

The XSLT code:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- Edited by XMLSpy® -->
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
  <html>
  <body>
  <h3>My Rulebook</h3>
  <xsl:apply-templates/>
  </body>
  </html>
</xsl:template>

<xsl:template match="rule">
  <p>
    <xsl:apply-templates select="ArticleFlag"/>
    <xsl:apply-templates select="SectionType"/>
    <xsl:apply-templates select="ListStateTop"/>
    <xsl:apply-templates select="SubListStatementPrefinal"/>
    <xsl:apply-templates select="SubListStatementFinal"/>
  </p>
</xsl:template>

<xsl:template match="ArticleFlag">
  Reference Code: <span style="color:#ff0000">
  <xsl:value-of select="."/></span>
  <br />
</xsl:template>

<xsl:template match="SectionType">
  Title: <span style="color:#00ffff">
  <xsl:value-of select="."/></span>
  <br />
</xsl:template>

<xsl:template match="ListStateTop">
  Description: <span style="color:#00ff00">
  <xsl:value-of select="."/></span>
  <br />
</xsl:template>

<xsl:template match="SubListStatementPrefinal">
  Description: <span style="color:#00ff00">
  <xsl:value-of select="."/></span>
  <br />
</xsl:template>

<xsl:template match="SubListStatementFinal">
  Description: <span style="color:#00ff00">
  <xsl:value-of select="."/></span>
  <br />
</xsl:template>

</xsl:stylesheet>

XSLT Output

The result is the following:

Output of XSLT on the XML Rulebook

In general, one can create any number of rulebooks from the same underlying data, varying the layout and substance of the presentation. For example, we can change the colours or headers easily; we can present more or less information. This is a lot more powerful than the static book that now exists.

Problems and Issues

Our example is a simple illustration of what can be done. Note that we have not yet fulfilled the requirements from our initial post since we have not numbered the sections, but this can be added later.

An important problem is that GATE annotations are not always in accordance with XML standards. In particular, XML markups must be strictly embedded as in

 <x> <y> </y> <z> </z> </x>

There can be no crossover such as in

<x> <y> <z> </y> </z> </x>

though this may well occur for GATE annotations. There may be several approaches to this problem, but we leave that for future discussion.

Another problem is that “Save preserving document format” only works with documents and not corpora, and we might want to work with corpora.

Finally, XSLT is useful for transforming XSL files, not in extracting information from XML files, for which one would need something such as XQuery.

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Legislative Rule Extraction

Wednesday, January 20th, 2010

In this post, we discuss the annotation of information from legislation, for example, to create a rule book from legislation. There are two distinct tasks and two tools. First, we want to take the original legislation and annotate it; for this, we use GATE. Second, we want to transform the output of GATE, using the annotations, into some alternative, web-compatible format; for this, we use EXtensible Stylesheet Language Transformations (XSLT). This is presented in STUB. John Cyriac of compliancetrack outlined the problem that is addressed in these two posts. (See introductory notes on this and related posts.)

Sample legislation and text

The text we are working with is a sample from Insurance and Reinsurance (Solvency II) from the European Parliament.

SUBJECT MATTER AND SCOPE
Article 1
Subject matter
This Directive lays down rules concerning the following:
1) the taking-up and pursuit, within the Community, of the self-employed activities of direct insurance and insurance;
2) the supervision in the case of insurance and reinsurance groups;
3) the reorganisation and winding-up of direct insurance undertakings.
Article 2
Scope
1. This Directive shall apply to direct life and non-life insurance undertakings which are established in the territory of a Member State or which wish to become established there. It shall also apply to reinsurance undertakings, which conduct only reinsurance activities, and which are established in the territory of a Member State or which wish to become established there with the exception of Title IV.

There are additional articles which we do not work with. The article is not a logical statement (an If, then statement), but identifies the matters which the directive is concerned with. Each statement of the article may be understood as a conjunct: the rules concern a, b, and c. However, this is not yet relevant to our analysis. See the separate post about rule extraction for conditionals.

Target result

We want to annotate the first article, picking out each section for extraction. In particular, for a practitioner to use the extraction, he should have it in a format which identifies the following:

Reference Code: Article 1
Title: Subject Matter
Level: 1.0
Description: This Directive lays down rules concerning the following:
Level: 1.1
Description: the taking-up and pursuit, within the Community, of the self-employed activities of direct insurance and reinsurance;
Level: 1.2
Description: the supervision in the case of insurance and reinsurance groups;
Level: 1.3
Description: the reorganisation and winding-up of direct insurance undertakings;

Output

The output of GATE appears in the following figure:

Annotating the structure of legislative rules

GATE

To get this output, we used the files and application state in GATELegislativeRulebook.tar.gz.

Text

The text is a fragment of the legislation above and is found in the SmallRulebookText.tex file.

Lists

We use the following lists in addition to standard ANNIE lists, meaning that a lists.def file ought to incorporate the files. This is the resource ListGaz given in the .xgapp file (though this may require some additional fiddling and files to work).

  • roman_numerals_i-xx.lst: It has majorType = roman_numeral. This is a list of roman numbers from i to xx.
  • rulebooksectionlabel.lst: It has majorType = rulebooksection. This is a list of section headings such as: Subject matter, Scope, Statutory systems, Exclusion from scope due to size, Operations, Assistance, Mutual undertakings, Institutions, Operations and activities.

The list of section headings is taken from the legislation, which presumably follows standard guidelines for section heading labels. For the list of roman numerals, there are more general methods using Regex to match well-formed numerals (see Roman Numerals in Python and Regex for Roman Numerals); however, for our purposes it is simpler to use limited lists rather than Regex. In either case, several problems arise, as we see later.

JAPE rules

  • ListArticleSection.jape: What is annotated with Article (from the lookup) and a number is annotated ArticleFlag.
  • ListFlagLevel1.jape: The string number followed by a period of closed parenthesis is annotated ListFlagLevel1.
  • ListFlagLevel1sub.jape: A number followed by a letter followed by a period is annotated ListFlagLevel1sub.
  • ListFlagLevel2.jape: A string of lower case letters followed by a closed parenthesis is annotated ListFlagLevel2.
  • ListFlagLevel3.jape: A roman number from a lookup list followed by a closed parenthsis is annotated ListFlagLevel3.
  • RuleBookSectionLabel.jape: Looks up section labels from a list and annotates them SectionType. For example, Subject matter, Scope, and Statutory systems.
  • ListStatement01.jape: A string which occurs between SectionType annotation and a colon is annotated ListStateTop.
  • ListStatement02.jape: A string which occurs between a ListFlagLevel1 and a semicolon is annotated SubListStatementPrefinal.
  • ListStatement03.jape: A string which occurs between a ListFlagLevel1 and a period is annotated SubListStatementFinal.

Application order

The order of application of the processing resources is:

  • Document Reset PR
  • ANNIE Sentence Splitter
  • ANNIE English Tokeniser
  • ListGaz
  • RulebookSectionLabel:
  • ListArticleSection
  • ListStatement01
  • ListFlagLevel01
  • ListStatement02
  • ListStatement03

Additional issues

This example does not show the other list flag levels (e.g. using letters, roman numerals etc.), nor the results on other parts of the legislation.

While the result for the specific text is attractive, there is much work to be done. The lists and rules overgenerate. For example, the rules indicate that avrt is a level flag because v is recognised as a roman numeral. In other cases, too long a passage is selected as the statement at the top of the list. Yet, the example is still useful to demonstrate a proof of concept, particularly in conjunction with the post on XSLT.

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Open Source Information Extraction: Data, Lists, Rules, and Development Environment

Wednesday, January 6th, 2010

Open source software development and standards are widely discussed and practiced. It has led to a range of useful applications and services. GATE is one such example.

However, one quickly learns that open source can easily mean open to a certain extent: GATE is open source, but the applications and additional functionalities that are developed with respect to GATE often are not. On the one hand, this makes perfect sense as the applications and functionalities are added value, labour intensive, and so on. On the other hand, the scientific community cannot verify, validate, or build on prior work unless the applications and functionalities are available. This can also hinder commercial development since closed development impedes progress, dissemination, and a common framework from which everyone benefits. It also does not recognise the fundamentally experimental aspect of information extraction. In contrast, the rapid growth and contributions of the natural (Biology, Physics, Chemistry, etc) or theoretical (Maths) sciences could only have occurred in an open, transparent development environment.

I advocate open source information extraction where an information extraction result can only be reported if it can be independently verified and built on by members of the scientific community. This means that the following must be made available concurrent with the report of the result:

  • Data and corpora
  • Lists (e.g. gazetteers)
  • Rules (e.g. JAPE rules)
  • Any additional processing components (e.g. information extraction to schemes or XSLT)
  • Development environment (e.g. GATE)

In other words, the results must be independently reproducible in full. The slogan is:

No publication without replicability.

This would:

  • Contribute to the research community and build on past developments.
  • Support teaching and learning.
  • Encourage interchange. The Semantic Web chokes on different formats.
  • Return academic research to the common (i.e. largely taxpayer funded) good rather than owned by the researcher or university. If someone needs to keep their work private, they should work at a company.
  • Lead to distributive, collaborative research and results, reducing redundancy and increasing the scale and complexity of systems.

Solving the knowledge bottleneck, particularly in relation to language, has not and likely will not be solved by any one individual or research team. Open source information extraction will, I believe, make greater progress toward addressing it.

Obviously, money must be made somewhere. One source is public funding, including contributions from private organisations which see a value in building public infrastructure. Another source is, like other open source software, systems, or other public information, to make money “around” the free material by adding non-core goods, services, or advertising.

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Natural Language Processing Techniques for Managing Legal Resources on the Semantic Web — Tutorial Slides

Sunday, December 20th, 2009

I gave a tutorial on natural language processing for legal resource management at the International Conference on Legal Information Systems (JURIX) 2009 in Rotterdam, The Netherlands. The slides are available below. Comments welcome.

The following people attended:

  • Andras Forhecz, Budapest University of Technology and Economics, Hungary
  • Ales Gola, Ministry of Interior of Czech Republic
  • Harold Hoffman, University Krems, Austria
  • Czeslaw Jedrzejek, Poznan University of Technology, Poland
  • Manuel Maarek, INRIA Grenoble, Rhone-Alpes
  • Michael Sonntag, Johannes Kepler University Linz, Austria
  • Vit Stastny, Ministry of Interior of Czech Republic

I thank the participants for their comments and look forward to continuing the discussions which we started in the tutorial.

At the link, one can find the slides. Comments are very welcome. The file is 2.2MB. The slides were originally prepared using Open Office’s Impress, then converted to PowerPoint.

Natural Language Processing Techniques for Managing Legal Resources on the Semantic Web

There is a bit more in the slides than was presented at the tutorial, covering in addition ontologies, parsers, and semantic interpreters.

In the coming weeks, I will make available additional detailed instructions as well as gazetteers and JAPE rules. I also plan to continue to add additional text mining materials.

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Instructions for GATE’s Onto Root Gazetteer

Tuesday, November 24th, 2009

In this post, I present User Manual notes for GATE’s Onto Root Gazetteer (ORG) and references to ORG. In Discussion of GATE’s Onto Root Gazetteer, I discuss aspects of Onto Root Gazetteer which I found interesting or problematic. These notes and discussion may be of use to those researchers in legal informatics who are interested in text mining and annotation for the semantic web.

Thanks to Diana Maynard, Danica Damljanovic, Phil Gooch, and the GATE User Manual for comments and materials which I have liberally used. Errors rest with me (and please tell me where they are so I can fix them!).

Purpose

Onto Root Gazetteer links text to an ontology by creating Lookup annotations which come from the ontology rather than a default gazetteer. The ontology is preprocessed to produce a flexible, dynamic gazetteer; that is, it is a gazetteer which takes into account alternative morphological forms and can be added to. An important advantage is that text can be annotated as an individual of the ontology, thus facilitating the population of the ontology.

Besides being flexible and dynamic, some advantages of ORG over other gazetteers:

  • It is more richly structured (see it as a gazetteer containing other gazetteers)
  • It allows one to relate textual and ontological information by adding instances.
  • It gives one richer annotations that can be used for further processes.

In the following, we present the step by step instructions for ‘rolling your own’, then show the results of the ‘prepackaged’ example that comes with the plugin.

Setup

Step 1. Add (if not already used) the Onto Root Gazetteer plugin to GATE following the usual plugin instructions.

Step 2. Add (if not already used) the Ontology Tools (OWLIM Ontology LR, OntoGazetteer, GATE Ontology Editor, OAT) plugin. ORG uses ontologies, so one must have these tools to load them as language resources.

Step 3. Create (or load) an ontology with OWLIM (see the instructions on the ontologies). This is the ontology that is the language resource that is then used by Onto Root Gazetteer. Suppose this ontology is called myOntology. It is important to note that OWLIM can only use OWL-Lite ontologies (see the documentation about this). Also, I succeeded in loading an ontology only from the resources folder of the Ontology_Tools plugin (rather than from another drive); I don’t know if this is significant.

Step 4. In GATE, create processing resources with default parameters:

  • Document Reset PR
  • RegEx Sentence Splitter (or ANNIE Sentence Splitter, but that one is likely to run slower
  • ANNIE English Tokeniser
  • ANNIE POS Tagger
  • GATE Morphological Analyser

Step 5. When all these PRs are loaded, create a Onto Root Gazetteer PR and set the initial parameters as follows. Mandatory ones are as follows (though some are set as defaults):

  • Ontology: select previously created myOntology
  • Tokeniser: select previously created Tokeniser
  • POSTagger: select previously created POS Tagger
  • Morpher: select previously created Morpher.

Step 6. Create another PR which is a Flexible Gazetteer. At the initial parameters, it is mandatory to select previously created OntoRootGazetteer for gazetteerInst. For another parameter, inputFeatureNames, click on the button on the right and when prompt with a window, add ‘Token.root’ in the provided text box, then click Add button. Click OK, give name to the new PR (optional) and then click OK.

Step 7. To create an application, right click on Application, New –> Pipeline (or Corpus Pipeline). Add the following PRS to the application in this order:

  • Document Reset PR
  • RegEx Sentence Splitter
  • ANNIE English Tokeniser
  • ANNIE POS Tagger
  • GATE Morphological Analyser
  • Flexible Gazetteer

Step 8. Run the application over the selected corpus.

Step 9. Inspect the results. Look at the Annotation Set with Lookup and also the Annotation List to see how the annotations appear.

Small Example

The ORG plugin comes with a demo application which not only sets up all the PRs and LRs (the text, corpus, and ontology), but also the application ready to run. This is the file exampleApp.xgapp, which is in resource folder of the plugin (Ontology_Based_Gazetteer). To start this, start GATE with a clean slate (no other PRs, LRs, or applications), then Applications, then right click to Restore application from file, then load the file from the folder just given.

The ontology which is used for an illustration is for GATE itself, giving the classes, subclasses, and instances of the system. While the ontology is loaded along with the application, one can find it here. The text is simple (and comes with the application): language resources and parameters.

FIGURE 1 (missing at the moment)

FIGURE 2 (missing at the moment)

One can see that the token “language resources” is annotated with respect to the class LanguageResource, “resources” is annotated with GATEResource, and “parameters” is annotated with ResourceParameter. We discuss this further below.

One further aspect is important and useful. Since the ontology tools have been loaded and a particular ontology has been used, one can not only see the ontology (open the OAT tab in the window with the text), but one can annotate the text with respect to the ontology — highlight some text and a popup menu allows one to select how to annotate the text. With this, one can add instances (or classes) to the ontology.

Documentation

One can consult the following for further information about how the gazetteer is made, among other topics:

Discussion

See the related post Discussion of GATE’s Onto Root Gazetteer.

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0