Archive for the ‘text mining’ Category

Text Mining Legal Resources with GATE — Study 1

Monday, October 12th, 2009

This page reports the results of a first study of applying GATE to a legal resource. The focus of this study was to annotate a list of cases.

I used a web page from BAILII which contains a list of cases with the following information:

  • The first party in a case, e.g. Meade.
  • The second party in a case, e.g. Mason.
  • The citation date, e.g. [1999]
  • The court level in which the case was decided, e.g. England and Wales Court of Appeals.
  • The court within the level, e.g. Civil.
  • The citation number, e.g. 780.
  • The date of the decision, e.g. 12 February 1999.

A sample of entries from the page I worked with is:

McSpadden v Keen [1999] EWCA Civ 1515 (27 May 1999)
McTaggart, R v [1997] EWCA Crim 3050 (24th November, 1997)
McTaggart, R v [1997] EWCA Crim 3137 (2nd December, 1997)
McVeigh & Anor, R v [1998] EWCA Crim 784 (3rd March, 1998)
McWhirter & Anor, R (on the application of) v Secretary of State for Foreign and Commonwealth Affairs [2003] EWCA Civ 384 (05 March 2003)
M-D v D [2008] EWHC 1929 (Fam) (19 December 2008)
MD (Guinea) v Secretary of State for the Home Department [2009] EWCA Civ 733 (17 June 2009)
MD (Iran) v Secretary of State for the Home Department [2007] EWCA Civ 532 (27 April 2007)

Below, we have a screenshot of the result of annotation in GATE. The parts of the annotation are colour coded as appear in the column on the right. In Firefox, one can right click on the image, then View Image in order to view a larger version, then click the back button on the browser to return to the post.

GATE annotations on a list of legal case information

There were range of irregularities in the source which had to be accommodated:

  • v and v. for the versus relation.
  • Decision date formats.
  • Length of the names of the parties.
  • Different orders of court and court level
  • Variations that arise as a consequence of using a page stripped of HTML annotations. The first name in the image is an artifact.

In this approach, I did not annotate the parties as plaintiff and defendant as the case decisions themselves associate the parties with different roles in different court contexts; our approach is more general. In consideration of the variants among case citations, I opted to identify each piece of the citation, which will allow one to extract and reconstruct the citation in a subsequent work.

While a small scale and relatively simple task, the result has one main strength — it gives us a list of parties to cases. It is difficult to automatically identify parties in general, but with this approach, we can extract those entities which have been involved in a case, then use that information for subsequent annotation tasks. Another strength is that we have isolated the components of the case citation, which can then be reconstructed as we wish.

The list of parties could be further refined by isolating last names, distinguishing among parties which appear in a list, differentiating persons from organisations, and filtering out additional information that appears. This is left for future work.

The Case Base List zip file contains the following files, which were used with GATE.

  • ew-cases-0133.html, which is the HTML file that lists the cases.
  • ew-cases-0133SHORT.xml, which is the XML file with the result of annotation. This is file related to the graphic above. The file is a short version of ew-cases-0133.html so that one can more easily see the results of the annotation. These appear as stand-off annotations. In the first part of the file, one can see the tokens of the file with numerical ranges (node numbers); later in the file, one can see indications of the annotations, making reference to the starting and ending numbers of each token.
  • GraphicListAnnotation.png, the graphic above.
  • CiteYear.jape, this annotates out the citation year for use in the citation as in [1998]
  • Courts_abbr.jape, this annotates the court level in terms of abbreviations as in EWCA, which is the English and Wales Court of Appeals.
  • dateAWynerMods.jape, this annotates the decision date such as (23rd June, 2001) and (21 July 2000).
  • FirstParty.jape, this annotates the first party, which is that party to the left of versus.
  • SecondParty.jape, this annotates the second party, which is that party to the right of versus.
  • SubCourts_abbr.jape, this annotates the courts within a court level such as civil courts (Civ) and criminal courts (Crim).
  • Versus.jape, this annotates the versus divider.
  • england_wales_courts_hierarchy.lst, this is a list of courts in England and Wales.
  • england_wales_courts_hierarchy_abbr.lst, this is a list of abbreviations for the courts in England and Wales.
  • england_wales_courts_subclass.lst, this is a list of the divisions within a court level.
  • england_wales_courts_subclass_abbr.lst, this is a list of abbreviations of courts within a court level.
  • cite_year.lst, this is a list of years with square brackets as in [1999]. Perhaps a rule can be written for this, taking into account the brackets.
  • list.def, the ‘master list’ of lists for use in GATE.

The files are released under a Creative Commons Attribute and ShareAlike license. The main objective of the contribution is to foster open, public, and collaborative development of text mining tools for legal resources.

Advice, suggestions, alternatives, and contributions along the lines of this work are very welcome.

Cheers,
Adam

Copyright © 2009 Adam Wyner

Meeting with John Sheridan on the Semantic Web and Public Administration

Tuesday, August 11th, 2009

I met today with John Sheridan, Head of e-Services, Office of Public Sector Information, The National Archives, located at the Ministry of Justice, London, UK. Also at the meeting was John’s colleague Clare Allison. John and I had met at the ICAIL conference in Barcelona, where we briefly discussed our interests in applications of Semantic Web technologies to legal informatics in the public sector. Recently, John got back in contact to talk further about how we might develop projects in this area.

Perhaps most striking to me is that John made it clear that the government (at least his sector) is proactive, looking for research and development projects that make government data available and usable in a variety of ways. In addition, he wanted to develop a range of collaborations to better understand the opportunities the Semantic Web may offer.

As part of catching up with what is going on, I took a look around the web for relatively recent documents on related activities.

In our discussion, John gave me an overview of the current state of affairs in public access to legislation, in particular, the legislative markup and API. The markup is intended to support publication, revision, and maintenance of legislation, among other possibilities. We also had some discussion about developing an ontology of goverment which would be linked to legislation.

Another interesting dimension is that John’s office is one of a few that I know of which are actively engaged to develop a knowledge economy partly encouraged by public administrative requirements and goals. Others in this area are the Dutch and the US (with xml.gov). All very promising and discussions well worth following up on.

Copyright © 2009 Adam Wyner

Session I of “Automated Content Analysis and the Law” Workshop

Monday, August 3rd, 2009

Today is session I of the NSF sponsored workshop on Automated Content Analysis and the Law. The theme of today’s meeting is the state of judicial/legal scholarship in order to:

  • Identify the theoretical and substantive puzzles in legal and judicial scholarship which might benefit from automated content analysis
  • Discuss the kinds of data/measures that are required to address these puzzles which automated content analysis could provide.

Further comments later in the day after the session.

–Adam Wyner

Copyright © 2009 Adam Wyner

London GATE Users Group

Saturday, August 1st, 2009

At the recent GATE Summer School in Sheffield, there was some discussion among people from London to form an occasional, informal users group where GATE users based in London can arrange to meet to go over tutorials, develop tutorials, discuss how we work with GATE, help one another with problems, and generally have a bit of a blab over tea with others who have similar interests.

As the informal organiser of this informal group, I thought my blog (which touches on topics related to text analytics) might be an acceptable place to announce and maintain the group. If things really get going, then perhaps the group will hive off to its own site.

I would like to suggest Thursday, August 20 in the early evening (e.g. 19:00) as our first meeting time. Likely the meeting would be till 20:30. Place (somewhere in central London — Covent Garden/Leicester Square) to be announced. Please let me know if this time and vicinity suits you, as we are looking to have more than one person show up.

Likely people will bring laptops, but we’ll try to arrange a projector as well for public show and tell. If you have something you would like to discuss or show, that would be good, but we can always find something to do and discuss.

It is an open group, and if you would like to be kept informed of any upcoming meetings, please send an email to Adam Wyner (adam@wyner.info). Feel free also to join this blog as one way to keep in touch with this group.

The group currently has the following participants:

  • Dipti Garg (Fizzback)
  • Hercules Fisherman (Fizzback)
  • Adam Wyner (University College London)
  • Auhood Alfaries (Brunel University)
  • Helen Flatley (EqualMedia)
  • Gerhard Brey (King’s College London)
  • Daniel Elias (Hawk Ridge Capital Management)
  • Renato Souza (Universidade Federal de Minas Gerais)

We look forward to our first meeting and to hearing from other people who may be interested in working with GATE. Comments on this topic are very welcome.

Cheers!
Adam Wyner

NSF sponsored workshop: Automated Content Analysis and the Law

Wednesday, July 22nd, 2009

I was invited to participate in an NSF ­Sponsored Workshop
 Automated Content Analysis and Law, August 3 and 4 at NSF HQ in Arlington, VA and organised by Georg Vanberg (UNC).

There are two sessions planned. The first session will focus on identifying the theoretical/substantive puzzles in legal and judicial scholarship that might benefit from automated content analysis as well as what data and measurements are required. For the second session, the focus is on the state of automated content analysis/natural language processing, exploring the extent to which current technology is relevant to providing results with respect to issues raised in the first session and what might be needed.

There is an interesting mix of people, with a strong emphasis on legal scholarship bearing on the US Supreme Court and opinion mining. I had an email exchange with Georg, the workshop organiser about this, and we agree that attention ought to turn from the Supreme Court to lower levels of the legal system. I also suggested that participants consider some of the following points which bear on the motives and objectives of these lines of research in terms of who is being served and how the data or conclusions would be used.

Questions for Discussion

  • What sorts of artifacts and technologies (if any) will emerge from the research?
  • How does the research relate to the Semantic Web?
  • What public service does the research provide or support?
  • How does this research relate to:
    • E-discovery
    • Textual legal case based reasoning
    • Legislative XML Markup
    • Other research communities e.g. ICAIL and JURIX

Participants

  • Scott Barclay (NSF) – Barclay@uamail.albany.edu
  • Cliff Carrubba (Emory) – ccarrub@emory.edu
  • Skyler Cranmer (UNC) – skylerc@email.unc.edu
  • Barry Friedman (NYU)- friedmab@juris.law.nyu.edu
  • Susan Haire (NSF) – shaire@nsf.gov
  • Lillian Lee (Cornell) – llee@cs.cornell.edu
  • Jimmy Lin (Maryland) – jimmylin@umd.edu
  • Stefanie Lindquist (Texas) – SLindquist@law.utexas.edu
  • Will Lowe (Nottingham) – will.lowe@nottingham.ac.uk
  • Andrew Martin (Wash U) – admartin@wustl.edu
  • Wendy Martinek (NSF) – wemartin@nsf.gov
  • Kevin McGuire (UNC) – kmcguire@unc.edu
  • Wayne McIntosh (Maryland) – wmcintosh@gvpt.umd.edu
  • Burt Monroe (Penn State) – blm24@psu.edu
  • Kevin Quinn (Harvard) – kevin_quinn@harvard.edu
  • Jonathan Slapin (Trinity College) – jonslapin@gmail.com
  • Jeff Staton (Emory) – jkstato@emory.edu
  • Georg Vanberg (UNC) – gvanberg@unc.edu
  • Adam Wyner (University College London) – adam@wyner.info

General Architecture for Text Engineering Summer School

Wednesday, July 22nd, 2009

Next week I’m attending a week long summer school on General Architecture for Text Engineering (GATE). GATE is an open-source and extensible toolkit for text mining, which has been used in a variety of areas. After having worked with people who had their “hands on” the tools, I decided it would better suit me to be able to work the material myself. I’ve been looking forward to this summer school for some time and am excited at the prospect of applying GATE tools to a DB of legal cases as well as developing an ontology.