Archive for the ‘legal knowledge engineering’ Category

Introduction to a Series of Posts on Legal Information Extraction with GATE

Wednesday, January 20th, 2010

This post has notes on and links to several other posts about legal information annotation and extraction using the General Architecture for Text Engineering system (GATE). The information in the posts was presented at my tutorial at JURIX 2009, Rotterdam, The Netherlands; the slides are available here. See the GATE website or my slides for introductory material about NLP and text annotation. For particulars about NLP and legal resources, see the posts and files at the links below.

The Posts

The following posts discuss different aspects of legal information extraction using GATE (live links indicate live posts):

Prototypes

The samples presented in the posts are prototypes only. No doubt there are other ways to accomplish similar tasks, the material is not as streamlined or cleanly presented as it could be, and each section is but a very small fragment of a much larger problem. In addition, there are better ways to present the lists and rules “in one piece”; however, during development and for discussion, it seems more helpful to have elements separate. Nonetheless, as a proof of concept, the samples make their point.

If there are any problems, contact Adam Wyner at adam@wyner.info.

Files

The posts are intended to be self-contained and to work with GATE 5.0. The archive files include the .xgapp file, which is a saved application state, along with text/corpus, the lists, and JAPE rules needed to run the application. In addition, the archive files include any graph outputs as reference. As noted, one may need to ‘fiddle’ a bit with the gazetteer lists in the current version.

Graphics

Graphics in the posts can be viewed in a larger and clearer size by right clicking on the graphic and selecting View Image. The Back button on your browser will close the image and return you to the post.

License

The materials are released under the following license:

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

If you want to commercially exploit the material, you must seek a separate license with me. That said, I look forward to further open development on these materials; see my post on Open Source Legal Information.

Legislative Rule Extraction

Wednesday, January 20th, 2010

In this post, we discuss the annotation of information from legislation, for example, to create a rule book from legislation. There are two distinct tasks and two tools. First, we want to take the original legislation and annotate it; for this, we use GATE. Second, we want to transform the output of GATE, using the annotations, into some alternative, web-compatible format; for this, we use EXtensible Stylesheet Language Transformations (XSLT). This is presented in STUB. John Cyriac of compliancetrack outlined the problem that is addressed in these two posts. (See introductory notes on this and related posts.)

Sample legislation and text

The text we are working with is a sample from Insurance and Reinsurance (Solvency II) from the European Parliament.

SUBJECT MATTER AND SCOPE
Article 1
Subject matter
This Directive lays down rules concerning the following:
1) the taking-up and pursuit, within the Community, of the self-employed activities of direct insurance and insurance;
2) the supervision in the case of insurance and reinsurance groups;
3) the reorganisation and winding-up of direct insurance undertakings.
Article 2
Scope
1. This Directive shall apply to direct life and non-life insurance undertakings which are established in the territory of a Member State or which wish to become established there. It shall also apply to reinsurance undertakings, which conduct only reinsurance activities, and which are established in the territory of a Member State or which wish to become established there with the exception of Title IV.

There are additional articles which we do not work with. The article is not a logical statement (an If, then statement), but identifies the matters which the directive is concerned with. Each statement of the article may be understood as a conjunct: the rules concern a, b, and c. However, this is not yet relevant to our analysis. See the separate post about rule extraction for conditionals.

Target result

We want to annotate the first article, picking out each section for extraction. In particular, for a practitioner to use the extraction, he should have it in a format which identifies the following:

Reference Code: Article 1
Title: Subject Matter
Level: 1.0
Description: This Directive lays down rules concerning the following:
Level: 1.1
Description: the taking-up and pursuit, within the Community, of the self-employed activities of direct insurance and reinsurance;
Level: 1.2
Description: the supervision in the case of insurance and reinsurance groups;
Level: 1.3
Description: the reorganisation and winding-up of direct insurance undertakings;

Output

The output of GATE appears in the following figure:

Annotating the structure of legislative rules

GATE

To get this output, we used the files and application state in GATELegislativeRulebook.tar.gz.

Text

The text is a fragment of the legislation above and is found in the SmallRulebookText.tex file.

Lists

We use the following lists in addition to standard ANNIE lists, meaning that a lists.def file ought to incorporate the files. This is the resource ListGaz given in the .xgapp file (though this may require some additional fiddling and files to work).

  • roman_numerals_i-xx.lst: It has majorType = roman_numeral. This is a list of roman numbers from i to xx.
  • rulebooksectionlabel.lst: It has majorType = rulebooksection. This is a list of section headings such as: Subject matter, Scope, Statutory systems, Exclusion from scope due to size, Operations, Assistance, Mutual undertakings, Institutions, Operations and activities.

The list of section headings is taken from the legislation, which presumably follows standard guidelines for section heading labels. For the list of roman numerals, there are more general methods using Regex to match well-formed numerals (see Roman Numerals in Python and Regex for Roman Numerals); however, for our purposes it is simpler to use limited lists rather than Regex. In either case, several problems arise, as we see later.

JAPE rules

  • ListArticleSection.jape: What is annotated with Article (from the lookup) and a number is annotated ArticleFlag.
  • ListFlagLevel1.jape: The string number followed by a period of closed parenthesis is annotated ListFlagLevel1.
  • ListFlagLevel1sub.jape: A number followed by a letter followed by a period is annotated ListFlagLevel1sub.
  • ListFlagLevel2.jape: A string of lower case letters followed by a closed parenthesis is annotated ListFlagLevel2.
  • ListFlagLevel3.jape: A roman number from a lookup list followed by a closed parenthsis is annotated ListFlagLevel3.
  • RuleBookSectionLabel.jape: Looks up section labels from a list and annotates them SectionType. For example, Subject matter, Scope, and Statutory systems.
  • ListStatement01.jape: A string which occurs between SectionType annotation and a colon is annotated ListStateTop.
  • ListStatement02.jape: A string which occurs between a ListFlagLevel1 and a semicolon is annotated SubListStatementPrefinal.
  • ListStatement03.jape: A string which occurs between a ListFlagLevel1 and a period is annotated SubListStatementFinal.

Application order

The order of application of the processing resources is:

  • Document Reset PR
  • ANNIE Sentence Splitter
  • ANNIE English Tokeniser
  • ListGaz
  • RulebookSectionLabel:
  • ListArticleSection
  • ListStatement01
  • ListFlagLevel01
  • ListStatement02
  • ListStatement03

Additional issues

This example does not show the other list flag levels (e.g. using letters, roman numerals etc.), nor the results on other parts of the legislation.

While the result for the specific text is attractive, there is much work to be done. The lists and rules overgenerate. For example, the rules indicate that avrt is a level flag because v is recognised as a roman numeral. In other cases, too long a passage is selected as the statement at the top of the list. Yet, the example is still useful to demonstrate a proof of concept, particularly in conjunction with the post on XSLT.

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Discussion with Jeremy Tobias-Tarsh of Practical Law Company

Monday, January 18th, 2010

On Wednesday January 13, 2010, I had a meeting with Jeremy Tobias-Tarsh, director of Practical Law Company (PLC) and currently in charge of overseeing the company’s three year development plan. We had a very engaging, far-ranging discussion about the company’s interests in technological innovation in the legal domain. His colleagues at the meeting where Brigitte Kaltenbacher, who works on usability tests for searches among the company’s resources, and Sara Stangalini, who works with Brigitte.

The post gives an overview of our discussion — what PLC does, the ambitions for the future, a range of issues and tools to handle them, and some suggestions about moving ahead.

About PLC

PLC provides know-how for lawyers, meaning written analysis of current legal developments, practice notes (legal situations lawyers face and how the law treats them), standard draft documents, and checklists for managing actions. The services cover a range of legal areas such as arbitration, competition, corporate, construction, employment, finance, pensions, tax, and so on.

Jeremy spoke of an ambition at the company to use Semantic Web technologies on the company’s resources in order to give users faster, more precise, more meaningful and relevant results for searches in the resources — making the company’s content more findable. This might be done by annotating the content of the resources and supporting search with respect to the annotations. (Along these lines, an important advantage is that the company has been using an XML editor (Epic) for its documents for some time, so there is broad and widespread familiarity with what XML offers.)

Similarly, PLC could develop tools which improve the searches among a law firm’s documents. This is especially crucial where searches are done by junior staff with less knowledge of how and where to search. As made clear in discussions of knowledge management in law firms, an important task of senior lawyers in a firm is to train the new and junior lawyers in the details of the practice. While law schools may train law students in legal analysis and the law, the students may be unprepared for how to practice, which may have less to do with the law and more to do with finding and working with the relevant documents.

Any technology which can support junior lawyers in learning their tasks would be an advantage. In addition, any technology which could encode a senior lawyer’s knowledge would be useful to share throughout the firm and to preserve that knowledge where the lawyer is unavailable.

Some Sample Problems and Tools

Contracts

An instance of such a tool might apply to contracts. PLC and firms have catalogues of preformatted draft documents, each of which may have variants developed over time. This may be seen as a contract base. A junior lawyer may be asked to find among this contract base a contract which is either an exact match for the current circumstances or close enough so that with some modifications it would suit. This can be viewed as an instance of case based reasoning, where the ‘factors’ are the particulars of the contracts and the current contractual setting. So, not only must there be some way to match similarity and difference among the documents, but there ought also to be some systematic way to manage the modifications.

To address this, three technologies could be used. Contracts could be annotated with the factors, then we apply case based reasoning. Alternatively, contracts could be linked to an ontology, so that the properties and relationships among the documents are made explicit. Researchers could search for the relevant documents using the ontology. Along with this, a contract modification tracking system, such as a modified version of which meets the MetaLex standard, could be developed.

Due Diligence

Another problem relates to due diligence. Law firms are up against constraints in terms of time and money in satisfying the requirements of due diligence. Firms increasingly are responsible to show due diligence in a wider range of areas. This means that more lawyers must be hired and more billable hours accrued. However, the companies hired by the law firms are reluctant to pay more for due diligence. Consequently, firms have a motivation to find ways to make due diligence more efficient. Moreover, it is not a task that junior lawyers can easily undertake without extensive training. Natural language expert systems might provide a useful technology.

Policy Consultations

We also had a discussion about policy consultations. PLC helped formed and serves as secretariat for the General Counsel 100 Group, which is comprised of senior legal officers drawn from FTSE 100 companies. The group is a forum for businesses to give input on policy consultations and to share best practices in law, risk management, compliance, and other common interests (see the various public papers on the link). In my EU Framework 7 proposal on argumentation, we explicitly referred to policy consultation as a key area to develop and apply the tool. Broadly speaking, we had a systematic plan to develop a tool which takes as input statements in natural language, then translates them into a logical formalism. Claims pro and con on a particular issue are systematically structured into an ‘argument’ network in order to ‘prove’ outcomes given premises as well as to provide sets of consistent statements for and against a claim. Other argument mapping technologies might be useful here as well.

Ontologies

We also talked about the development of ontologies and whether they can be automatically extracted from textual sources. This is an area where there is a lot of current interest and some significant progress.

Moving Ahead

Finally, we also touched on how to move ahead. A brainstorming and road-mapping exercisea could be very valuable experience. The exercise would include not only company representatives, but also clients served by PLC. Parties on ‘both sides of the fence’ could discover more about what they know, want, and imagine could be done. In addition, Jeremy suggested that I might be engaged to present some of the ‘main points’ about Semantic Web technologies and the law to some of PLC’s editors and clients.

It was an enjoyable and spirited discussion, which I hope we will find the opportunity in the near future to continue.

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

CFP: Workshop on Semantic Processing of Legal Texts

Tuesday, January 12th, 2010

LREC 2010 Workshop on

SEMANTIC PROCESSING OF LEGAL TEXTS (SPLeT-2010)

CALL FOR PAPERS

23 May 2010, Malta

Workshop description

The legal domain represents a primary candidate for web-based information distribution, exchange and management, as testified by the numerous e-government, e-justice and e-democracy initiatives worldwide. The last few years have seen a growing body of research and practice in the field of Artificial Intelligence and Law which addresses a range of topics: automated legal reasoning and argumentation, semantic and cross-language legal information retrieval, document classification, legal drafting, legal knowledge discovery and extraction, as well as the construction of legal ontologies and their application to the law domain. In this context, it is of paramount importance to use Natural Language Processing techniques and tools that automate and facilitate the process of knowledge extraction from legal texts.

With the last two years, a number of dedicated workshops and tutorials specifically focussing on different aspects of semantic processing of legal texts has demonstrated the current interest in research on Artificial Intelligence and Law in combination with Language Resources (LR) and Human Langugage Technologies (HLT). The LREC 2008 Workshop on “Semantic processing of legal texts” was held in Marrakech, Morocco, on the 27th of May 2008. The JURIX 2008 Workshop on “the Natural Language Engineering of Legal Argumentation: Language, Logic, and Computation (NaLEA)”, which focussed on recent advances in natural language engineering and legal argumentation. The ICAIL 2009 Workshops “LOAIT ’09 – the 3rd Workshop on Legal Ontologies and Artificial Intelligence Techniques joint with the 2nd Workshop on Semantic Processing of Legal Texts” and “NALEA’09 – Workshop on the Natural Language Engineering of Legal Argumentation: Language, Logic, and Computation”, the former focussing on Legal Knowledge Representation with particular emphasis on the issue of ontology acquisition from legal texts, the latter tackling issues related to legal argumentation and linguistic technologies.

To continue this momentum, a 3rd Workshop on “Semantic Processing of Legal Texts” is being organised at the Language Resources and Evaluation Conference to bring to the attention of the broader LR/HLT community the specific technical challenges posed by the semantic processing of legal texts and also share with the community the motivations and objectives which make it of interest to researchers in legal informatics. The outcome of these interactions are expected to advance research and applications and foster interdisciplinary collaboration within the legal domain.
The main goals of the workshop are to provide an overview of the state-of-the-art in legal knowledge extraction and management, to explore new research and development directions and emerging trends, and to exchange information regarding legal LRs and HLTs and their applications.

Areas of Interest

The workshop will focus on the topics of the automatic extraction of information from legal texts and the structural organisation of the extracted knowledge. Particular emphasis will be given to the crucial role of language resources and human language technologies.

Papers are invited on, but not limited to, the following topics:

  • Building legal resources: terminologies, ontologies, corpora
  • Ontologies of legal texts, including subareas such as ontology acquisition, ontology customisation, ontology merging, ontology extension, ontology evolution, lexical information, etc.
  • Information retrieval and extraction from legal texts
  • Semantic annotation of legal texts
  • Legal text processing
  • Multilingual aspects of legal text semantic processing
  • Legal thesauri mapping
  • Automatic Classification of legal documents
  • Logical analysis of legal language
  • Automated parsing and translation of natural language arguments into a logical formalism
  • Linguistically-orientied XML mark up of legal arguments
  • Dialogue protocols for argumentation
  • Legal argument ontology
  • Computational theories of argumentation that are suitable to natural language
  • Controlled language systems for law.

Submissions

Submissions are solicited from researchers working on all aspects of semantic processing of legal texts. Authors are invited to submit papers describing original completed work, work in progress, interesting problems, case studies or research trends related to one or more of the topics of interest listed above. The final version of the accepted papers will be published in the Workshop Proceedings.

Short or full papers can be submitted. Short papers are expected to present new ideas or new visions that may influence the direction of future research, yet they may be less mature than full papers. While an exhaustive evaluation of the proposed ideas is not necessary, insight and in-depth understanding of the issues is expected. Full papers should be more well developed and evaluated. Short papers will be reviewed the same way as full papers by the Program Committee and will be published in the Workshop Proceedings.

Full paper submissions should not exceed 10 pages, short papers 6 pages; both should be typeset using a font size of 11 points. Style files will be made available by LREC for the camera-ready versions of accepted papers. Papers should be submitted electronically, no later than February 10, 2010. The only accepted format for submitted papers is Adobe PDF. Submission will be electronic using START paper submission software available at

SPLeT 2010 Workshop

Note that when submitting a paper through the START page, authors will be kindly asked to provide relevant information about the resources that have been used for the work described in their paper or that are the outcome of their research. In this way, authors will contribute to the LREC2010 Map, our new feature for LREC 2010. For further information on this initiative, please refer to

LREC2010 Map of Language Resources

Important Dates

Paper submission deadline: 10 February 2010
Acceptance notification sent: 5 March 2010
Final version deadline: 21 March 2010
Workshop date: 23 May 2010

Workshop Chairs

  • Enrico Francesconi (Istituto di Teoria e Tecniche dell’Informazione Giuridica of CNR, Florence, Italy)
  • Simonetta Montemagni (Istituto di Linguistica Computazionale of CNR, Pisa, Italy)
  • Wim Peters (Natural Language Processing Research Group, University of Sheffield, UK)
  • Adam Wyner (Department of Computer Science, University College London, UK)

Address any queries regarding the workshop to: lrec10_legalWS@ilc.cnr.it

Program Committee

  • Johan Bos (University of Rome, Italy)
  • Danièle Bourcier (Humboldt Universität, Berlin, Germany)
  • Thomas R. Bruce (Cornell Law School, Ithaca, NY, USA)
  • Pompeu Casanovas (Institut de Dret i Tecnologia, UAB, Barcelona, Spain)
  • Alessandro Lenci (Dipartimento di Linguistica, Università di Pisa, Pisa, Italy)
  • Leonardo Lesmo (Dipartimento di Informatica, Università di Torino, Torino, Italy)
  • Raquel Mochales Palau (Catholic University of Leuven, Belgium)
  • Paulo Quaresma (Universidade de Évora, Portugal)
  • Erich Schweighofer (Universität Wien, Rechtswissenschaftliche Fakultät, Wien, Austria)
  • Manfred Stede (University of Potsdam, Germany)
  • Daniela Tiscornia (Istituto di Teoria e Tecniche dell’Informazione Giuridica of CNR, Florence, Italy)
  • Tom van Engers (Leibniz Center for Law, University of Amsterdam, Netherlands)
  • Radboud Winkels (Leibniz Center for Law, University of Amsterdam, Netherlands)

Open Source Information Extraction: Data, Lists, Rules, and Development Environment

Wednesday, January 6th, 2010

Open source software development and standards are widely discussed and practiced. It has led to a range of useful applications and services. GATE is one such example.

However, one quickly learns that open source can easily mean open to a certain extent: GATE is open source, but the applications and additional functionalities that are developed with respect to GATE often are not. On the one hand, this makes perfect sense as the applications and functionalities are added value, labour intensive, and so on. On the other hand, the scientific community cannot verify, validate, or build on prior work unless the applications and functionalities are available. This can also hinder commercial development since closed development impedes progress, dissemination, and a common framework from which everyone benefits. It also does not recognise the fundamentally experimental aspect of information extraction. In contrast, the rapid growth and contributions of the natural (Biology, Physics, Chemistry, etc) or theoretical (Maths) sciences could only have occurred in an open, transparent development environment.

I advocate open source information extraction where an information extraction result can only be reported if it can be independently verified and built on by members of the scientific community. This means that the following must be made available concurrent with the report of the result:

  • Data and corpora
  • Lists (e.g. gazetteers)
  • Rules (e.g. JAPE rules)
  • Any additional processing components (e.g. information extraction to schemes or XSLT)
  • Development environment (e.g. GATE)

In other words, the results must be independently reproducible in full. The slogan is:

No publication without replication.

This would:

  • Contribute to the research community and build on past developments.
  • Support teaching and learning.
  • Encourage interchange. The Semantic Web chokes on different formats.
  • Return academic research to the common (i.e. largely taxpayer funded) good rather than owned by the researcher or university. If someone needs to keep their work private, they should work at a company.
  • Lead to distributive, collaborative research and results, reducing redundancy and increasing the scale and complexity of systems.

Solving the knowledge bottleneck, particularly in relation to language, has not and likely will not be solved by any one individual or research team. Open source information extraction will, I believe, make greater progress toward addressing it.

Obviously, money must be made somewhere. One source is public funding, including contributions from private organisations which see a value in building public infrastructure. Another source is, like other open source software, systems, or other public information, to make money “around” the free material by adding non-core goods, services, or advertising.

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Research on Argumentation at the Leibniz Center for Law in Amsterdam

Monday, January 4th, 2010

I have a 3 month research job at the Leibniz Center for Law, University of Amsterdam starting February 1 and working with Tom van Engers. This is part of the IMPACT project:

IMPACT is an international project, partially funded by the European Commission under the 7th framework programme. It will conduct original research to develop and integrate formal, computational models of policy and arguments about policy, to facilitate deliberations about policy at a conceptual, language-independent level. To support the analysis of policy proposals in an inclusive way which respects the interests of all stakeholders, research on tools for reconstructing arguments from data resources distributed throughout the Internet will be conducted. The key problem is translation from these sources in natural language to formal argumentation structures, which will be input for automatic reasoning.

My role will be to set up a Ph.D. research project concerning the key problem. This is based on an unsuccessful larger research proposal that I made with Tom. I’ll be organising the database, the literature, some of the software, and outlining the approach the student would take. I’ll make notes on the progress as it happens.

I’m looking forward to living for a while in Amsterdam, working with Tom and my other colleagues at the center — Joost Breuker, Rinke Hoekstra, Emile de Maat. The Netherlands also has a very lively Department of Argumentation Theory. As an added bonus, my colleagues from Linguistics, Susan Rothstein and Fred Landman, are in Amsterdam on sabbatical. Will be a very interesting and fun period.

Natural Language Processing Techniques for Managing Legal Resources on the Semantic Web — Tutorial Slides

Sunday, December 20th, 2009

I gave a tutorial on natural language processing for legal resource management at the International Conference on Legal Information Systems (JURIX) 2009 in Rotterdam, The Netherlands. The slides are available below. Comments welcome.

The following people attended:

  • Andras Forhecz, Budapest University of Technology and Economics, Hungary
  • Ales Gola, Ministry of Interior of Czech Republic
  • Harold Hoffman, University Krems, Austria
  • Czeslaw Jedrzejek, Poznan University of Technology, Poland
  • Manuel Maarek, INRIA Grenoble, Rhone-Alpes
  • Michael Sonntag, Johannes Kepler University Linz, Austria
  • Vit Stastny, Ministry of Interior of Czech Republic

I thank the participants for their comments and look forward to continuing the discussions which we started in the tutorial.

At the link, one can find the slides. Comments are very welcome. The file is 2.2MB. The slides were originally prepared using Open Office’s Impress, then converted to PowerPoint.

Natural Language Processing Techniques for Managing Legal Resources on the Semantic Web

There is a bit more in the slides than was presented at the tutorial, covering in addition ontologies, parsers, and semantic interpreters.

In the coming weeks, I will make available additional detailed instructions as well as gazetteers and JAPE rules. I also plan to continue to add additional text mining materials.

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Annotating Rules in Legislation

Friday, November 27th, 2009

Over the last couple of months, I have had discussions about text mining and annotating rules in legislation with several people (John Sheridan of The Office of Public Sector Information, Richard Goodwin of The Stationery Office, and John Cyriac of Compliance Track). While nothing yet concrete has resulted from these discussions, it is clearly a “hot topic”.

In the course of these discussions, I prepared a short outline of the issues and approaches, which I present below. Comments, suggestions, and collaborations are welcome.

Vision, context, and objectives

One of the main visions of artificial intelligence and law has been to develop a legislative processing tool. Such a tool has several related objectives:

      [1.] To guide the drafter to write well-formed legal rules in natural language.
      [2.] To automatically parse and semantically represent the rules.
      [3.] To automatically identify and annotate the rules so that they can be extracted from a corpus of legislation for web-based applications.
      [4.] To enable inference, modeling, and consistency testing with respect to the rules.
      [5.] To reason with respect to domain knowledge (an ontology).
      [6.] To serve the rules on the web so that users can use natural language to input information and receive determinations.

While no such tool exists, there has been steady progress on understanding the problems and developing working software solutions. In early work (see The British nationality act as a logic program (1986)), an act was manually translated into a program, allowing one to draw inferences given ground facts. Haley is a software and service company which provides a framework which partially addresses 1, 2, 4, and 6 (see Policy Automation). Some research addresses aspects of 3 (see LKIF-Core Ontology). Finally, there are XML annotation schemas for legislation (and related input support) such as The Crown XML Schema for Legislation and Akoma Ntoso, both of which require manual input. Despite these advances, there is much progress yet to be made. In particular, no results fulfill [3.].

In consideration of [3.], the primary objective of this proposal is to use the General Architecture for Text Engineering (GATE) framework in order to automatically identify and annotate legislative rules from a corpus. The annotation should support web-based applications and be consistent with semantic web mark ups for rules, e.g. RuleML. A subsidiary objective is to define an authoring template which can be used within existing authoring applications to manually annotate legislative rules.

Benefits

Attaining these objectives would:

  • Support automated creation, maintenance, and distribution of rule books for compliance.
  • Contribute to the development of a legislative processing tool.
  • Make legislative rules accessible for web-based applications. For example, given other annotations, one could identify rules that apply with respect to particular individuals in an organisation along with relevant dates, locations, etc.
  • Enable further processing of the rules such as removing formatting, parsing the content of the rules, and representing them semantically.
  • Allow an inference engine to be applied over the formalised rule base.
  • Make legislation more transparent and communicable among interested parties such as government departments, EU governments, and citizenry.

Scope

To attain the objectives, we propose the following phases, where the numbers represent weeks of effort:

  • Create a relatively small sample corpus to scope the study.
  • Manually identify the forms of legislative rules within the corpus.
  • Develop or adapt an annotation scheme for rules.
  • Apply the analysis tools of GATE and annotate the rules.
  • Validate that GATE annotates the rules as intended.
  • Apply the annotation system to a larger corpus of documents.

For each section, we would produce a summary of results, noting where difficulties are encountered and ways they might be addressed.

Extending the work

The work can be extended in a variety of ways:

  • Apply the GATE rules to a larger corpus with more variety of rule forms.
  • Process the rules for semantic representation and inference.
  • Take into consideration defeasiblity and exceptions.
  • Develop semantic web applications for the rules.

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Instructions for GATE’s Onto Root Gazetteer

Tuesday, November 24th, 2009

In this post, I present User Manual notes for GATE’s Onto Root Gazetteer (ORG) and references to ORG. In Discussion of GATE’s Onto Root Gazetteer, I discuss aspects of Onto Root Gazetteer which I found interesting or problematic. These notes and discussion may be of use to those researchers in legal informatics who are interested in text mining and annotation for the semantic web.

Thanks to Diana Maynard, Danica Damljanovic, Phil Gooch, and the GATE User Manual for comments and materials which I have liberally used. Errors rest with me (and please tell me where they are so I can fix them!).

Purpose

Onto Root Gazetteer links text to an ontology by creating Lookup annotations which come from the ontology rather than a default gazetteer. The ontology is preprocessed to produce a flexible, dynamic gazetteer; that is, it is a gazetteer which takes into account alternative morphological forms and can be added to. An important advantage is that text can be annotated as an individual of the ontology, thus facilitating the population of the ontology.

Besides being flexible and dynamic, some advantages of ORG over other gazetteers:

  • It is more richly structured (see it as a gazetteer containing other gazetteers)
  • It allows one to relate textual and ontological information by adding instances.
  • It gives one richer annotations that can be used for further processes.

In the following, we present the step by step instructions for ‘rolling your own’, then show the results of the ‘prepackaged’ example that comes with the plugin.

Setup

Step 1. Add (if not already used) the Onto Root Gazetteer plugin to GATE following the usual plugin instructions.

Step 2. Add (if not already used) the Ontology Tools (OWLIM Ontology LR, OntoGazetteer, GATE Ontology Editor, OAT) plugin. ORG uses ontologies, so one must have these tools to load them as language resources.

Step 3. Create (or load) an ontology with OWLIM (see the instructions on the ontologies). This is the ontology that is the language resource that is then used by Onto Root Gazetteer. Suppose this ontology is called myOntology. It is important to note that OWLIM can only use OWL-Lite ontologies (see the documentation about this). Also, I succeeded in loading an ontology only from the resources folder of the Ontology_Tools plugin (rather than from another drive); I don’t know if this is significant.

Step 4. In GATE, create processing resources with default parameters:

  • Document Reset PR
  • RegEx Sentence Splitter (or ANNIE Sentence Splitter, but that one is likely to run slower
  • ANNIE English Tokeniser
  • ANNIE POS Tagger
  • GATE Morphological Analyser

Step 5. When all these PRs are loaded, create a Onto Root Gazetteer PR and set the initial parameters as follows. Mandatory ones are as follows (though some are set as defaults):

  • Ontology: select previously created myOntology
  • Tokeniser: select previously created Tokeniser
  • POSTagger: select previously created POS Tagger
  • Morpher: select previously created Morpher.

Step 6. Create another PR which is a Flexible Gazetteer. At the initial parameters, it is mandatory to select previously created OntoRootGazetteer for gazetteerInst. For another parameter, inputFeatureNames, click on the button on the right and when prompt with a window, add ‘Token.root’ in the provided text box, then click Add button. Click OK, give name to the new PR (optional) and then click OK.

Step 7. To create an application, right click on Application, New –> Pipeline (or Corpus Pipeline). Add the following PRS to the application in this order:

  • Document Reset PR
  • RegEx Sentence Splitter
  • ANNIE English Tokeniser
  • ANNIE POS Tagger
  • GATE Morphological Analyser
  • Flexible Gazetteer

Step 8. Run the application over the selected corpus.

Step 9. Inspect the results. Look at the Annotation Set with Lookup and also the Annotation List to see how the annotations appear.

Small Example

The ORG plugin comes with a demo application which not only sets up all the PRs and LRs (the text, corpus, and ontology), but also the application ready to run. This is the file exampleApp.xgapp, which is in resource folder of the plugin (Ontology_Based_Gazetteer). To start this, start GATE with a clean slate (no other PRs, LRs, or applications), then Applications, then right click to Restore application from file, then load the file from the folder just given.

The ontology which is used for an illustration is for GATE itself, giving the classes, subclasses, and instances of the system. While the ontology is loaded along with the application, one can find it here. The text is simple (and comes with the application): language resources and parameters.

FIGURE 1 (missing at the moment)

FIGURE 2 (missing at the moment)

One can see that the token “language resources” is annotated with respect to the class LanguageResource, “resources” is annotated with GATEResource, and “parameters” is annotated with ResourceParameter. We discuss this further below.

One further aspect is important and useful. Since the ontology tools have been loaded and a particular ontology has been used, one can not only see the ontology (open the OAT tab in the window with the text), but one can annotate the text with respect to the ontology — highlight some text and a popup menu allows one to select how to annotate the text. With this, one can add instances (or classes) to the ontology.

Documentation

One can consult the following for further information about how the gazetteer is made, among other topics:

Discussion

See the related post Discussion of GATE’s Onto Root Gazetteer.

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Meeting with John Sheridan on the Semantic Web and Public Administration

Tuesday, August 11th, 2009

I met today with John Sheridan, Head of e-Services, Office of Public Sector Information, The National Archives, located at the Ministry of Justice, London, UK. Also at the meeting was John’s colleague Clare Allison. John and I had met at the ICAIL conference in Barcelona, where we briefly discussed our interests in applications of Semantic Web technologies to legal informatics in the public sector. Recently, John got back in contact to talk further about how we might develop projects in this area.

Perhaps most striking to me is that John made it clear that the government (at least his sector) is proactive, looking for research and development projects that make government data available and usable in a variety of ways. In addition, he wanted to develop a range of collaborations to better understand the opportunities the Semantic Web may offer.

As part of catching up with what is going on, I took a look around the web for relatively recent documents on related activities.

In our discussion, John gave me an overview of the current state of affairs in public access to legislation, in particular, the legislative markup and API. The markup is intended to support publication, revision, and maintenance of legislation, among other possibilities. We also had some discussion about developing an ontology of goverment which would be linked to legislation.

Another interesting dimension is that John’s office is one of a few that I know of which are actively engaged to develop a knowledge economy partly encouraged by public administrative requirements and goals. Others in this area are the Dutch and the US (with xml.gov). All very promising and discussions well worth following up on.

Copyright © 2009 Adam Wyner