Legal resources such as legislation, public notices, case law, and other legally relevant documents are increasingly freely available on the internet. They are almost entirely presented in natural language and in text. Legal professionals, researchers, and students need to extract and represent information from such resources to support compliance monitoring, analyse cases for case based reasoning, and extract information in the discovery phase of a trial (e-discovery), amongst a range of possible uses. To support such tasks, powerful text analytic tools are available. The tutorial presents an in depth demonstration of one toolkit the General Architecture for Text Engineering (GATE) with examples and several briefer demonstrations of other tools.
Goals
Participants in the tutorial should come away with some theoretical sense of what textual information extraction is about. They will also see some practical examples of how to work with a corpus of materials, develop an information extraction system using GATE and the other tools, and share their results with the research community. Participants will be provided with information on where to find additional materials and learn more.
Intended Audience
The intended audience includes legal researchers, legal professionals, law school students, and political scientists who are new to text processing as well as experienced AI and Law researchers who have used NLP, but wish to get a quick overview of using GATE.
Covered Topics
Motivations to annotate, extract, and represent legal textual information.
Uses and domains of textual information extraction. Sample materials from legislation, case decisions, gazettes, e-discovery sources, among others.
Motivations to use an open source tool for open source development of textual information extraction tools and materials.
The relationship to the semantic web, linked documents, and data visualisation.
Linguistic/textual problems that must be addressed.
Alternative approaches (statistical, knowledge-light, machine learning) and a rationale for a particular bottom-up, knowledge-heavy approach in GATE.
Outline of natural language processing modules and tasks.
Introduction to GATE – loading and running simple applications, inspecting the results, refining the search results.
Development of fragments of a GATE system – lists, rules, and examination of results.
Discussion of more complex constructions and issues such as fact pattern identification, which is essential for case-based reasoning, named entity recognition, and structures of documents.
Introduction to ontologies.
Link textual information extraction to ontologies.
Introduction to related tools and approaches: C&C/Boxer (parser and semantic interpreter), Attempto Controlled English, scraperwiki, among others.
Date, Time, Location, and Logistics
Monday, June 10, afternoon session. Exact time will be announced as the conference program becomes available.
The tutorial will be held at the Casa dell’Aviatore, viale dell’Università 20 in Rome, Italy.
Dr. Adam Wyner
Lecturer, Department of Computing Science, University of Aberdeen
Aberdeen, Scotland
azwyner at abdn dot ac dot uk Website
The lecturer has a PhD in Linguistics, a PhD in Computer Science, and research background in computational linguistics. The lecturer has previously given a tutorial on this topic at JURIX 2009 and ICAIL 2011 along with an invited talk at RuleML 2012, has published several conference papers on text analytics of legal resources using GATE and C&C/Boxer, and continues to work on text analysis of legal resources.
Dan, co-organiser Renee Knake at Michigan State University, and their colleagues at the University of Westminster are up to good things in law and technology – well worth watching.
To cap off the Law Program, the summer program organised a Law Tech Camp of short and TED style presentations on topics. It is an excellent program of talks from members of the legal industry, practicing lawyers, and academics. I have a talk about Crowdsourcing Legal Text Annotation, which is also discussed in a previous post. The talks are videotaped and made available online (TBA).
Abstract
Legislation and regulations are expressed in natural language. Machine-readable forms of the texts may be represented as linked documents, semantically tagged text, or translation to a logic. The paper considers the latter form, which is key to testing consistency of laws, drawing inferences, and providing explanations relative to input. To translate laws to a machine-readable logic, sentences must be parsed and semantically translated. Manual translation is time and labour intensive, usually involving narrowly scoping the rules. While automated translation systems have made significant progress, problems remain. The paper outlines systems to automatically translate legislative clauses to a semantic representation, highlighting key problems and proposing some tasks to address them.
Abstract
Large corpora of legal texts are increasing available in the public domain. To make them amenable for automated text processing, various sorts of annotations must be added. We consider semantic annotations bearing on the content of the texts – legal rules, case factors, and case decision elements. Adding annotations and developing gold standard corpora (to verify rule-based or machine learning algorithms) is costly in terms of time, expertise, and cost. To make the processes efficient, we propose several instances of GATE’s Teamware to support annotation tasks for legal rules, case factors, and case decision elements. We engage annotation volunteers (law school students and legal professionals). The reports on the tasks are to be presented at the workshop.
A study in online, collaborative legal informatics
Adam Wyner, University of Liverpool Wim Peters, University of Sheffield
– Introduction –
This is an academic research study on legal informatics (information processing of the law). The study uses an online, collaborative tool to crowdsource the annotation of legal cases. The task is similar to legal professionals’ annotation of cases. The result will be a public corpus of searchable, richly annotated legal cases that can be further processed, analysed, or queried for conceptual annotations.
Adam and Wim are computer scientists who are interested in language, law, and the Internet.
We are inviting people to participate in this collaborative task. This is a beta version of the exercise, and we welcome comments on how to improve it. Please read through this blog post, look at the video, and get in contact.
– Highlighting, Annotations, and Legal Case Briefs –
In reading, analysing, and preparing a summary of a legal case, law students and legal professionals annotate cases by highlighting and colour coding elements of the case to make for easy identification. Different elements are annotated: the holding, the parties, the facts, and so on. A sample image of annotations is:
Annotations for Case Citations, Legal Roles, Jurisdiction, Hearing Date
– Problem –
To analyse a legal case, legal professionals annotate the case into its constituent parts. The analysis is summarised in a case brief. However, the current approach is very limited:
Analysis is time-consuming and knowledge-intensive.
Case briefs may miss relevant information.
Case analyses and briefs are privately held.
Case analyses are in paper form, so not searchable over the Internet.
Current search tools are for text strings, not conceptual information. We want to search for concepts such as for the holdings by a particular judge and with respect to causes of action against a particular defendant.
With annotated legal cases, we can enable conceptual search.
– Solution: Crowdsource Annotation –
We use an online legal case annotation tool and share the results to support:
Online search in legal cases for case details and concepts.
Semantic web applications and information extraction.
Crowd-source a legal case corpus.
The results of the study would be useful to:
Law school students learning case analysis.
Legal professionals in identifying relevant cases.
Researchers of legal informatics.
Broadly speaking, a corpus of analysed cases makes case law a public resource.
– Annotations: types and features –
To crowdsource conceptual annotations of legal cases, we use the General Architecture of Text Engineering (GATE) Teamware tool. Teamware is a web-based application that provides an annotator with a text to annotate and a list of annotations to use. The task is a web-based version of what legal analysts of cases already do.
We use familiar annotations for legal cases, divided (for ease of reference) into types and features. For example, we have a type Legal Roles and various features to select among, e.g. defendant. We are counting on you to have learned and used these annotations in the course of your legal study and practice.
You do not need to memorise the types and features as they will appear in the GATE Teamware tool. It may be handy to keep this webpage open so you can consult it or you could also print out the page.
The annotations we use are:
Argument For Party – arguments for a particular party, using the most general notion:
for Appellee, for Appellant, for Defendant, for Plaintiff.
Facts – legal and procedural facts:
Cause of Action – the specific legal theory upon which the plaintiff brings the suit.
Defenses raised by Defendant – the defendant defenses against the cause of action.
Legal Facts – the legally relevant facts of the case that are used in arguing the issues.
Remedy requested by Plaintiff – what the plaintiff asks the court to grant.
Indexes – various indicative information:
Case Citation – the citation of the particular case being annotated.
Court Address – the address of the court.
Hearing Date – the date of the hearing.
Judge Name – the names of the judge, annotated one at a time.
Jurisdiction – the legal jurisdiction of the case.
Issues – the issues before the court:
Procedural Issue – what the appellee claims that the lower court did wrong.
Substantive Issue – the point of law that is in dispute (legal facts have their own annotation).
Legal Roles – the role of the parties in the case:
General – buyer/seller, employer/employee, landlord/tenant, etc.
Other – relevant information not covered by the other annotations.
Procedural History – the disposition of the case with respect to the lower court(s):
Appeal Information – who appealed and why they appealed.
Damages – the damages awarded by the lower court.
Lower Court Decision – the lower court’s decision.
Reasoning Outcomes – various parts of the legal decision:
Concurring Opinion.
Dicta – commentary about the judgement and holding, but not part of the rationale.
Dissenting Opinion.
Holding – the rule of law or legal principle that was applied in making the judgement.
Judgement – the court’s final decision about the rights of the parties, the court’s response to a party’s request for relief, and bearing on prior decisions (e.g. affirmed, reversed, remanded, etc.).
Rationale – the court’s analysis of the issues and the reasons for the holding.
– Collaborate –
Take a look at the instructional video below. If you wish to collaborate on the task, send an email to Adam Wyner – adam@wyner.info
In the email, please include brief information for:
Your name
Your professional affiliation, e.g. institution, company, firm…
Your role where you work
Your background as a legal professional
This will help us know who we are collaborating with; from the pool of candidates, we will select participants for this early study.
You will be sent a user name and password so you can login to Teamware.
We respect your privacy. We are only interested in data in the aggregate and will not reveal any personal data to third parties.
– Next –
We have an instructional video that you can open in a new tab or window and that uses QuickTime. It lasts about 14 minutes. This will give you a good idea of what you will be doing. The presenter is Adam Wyner. You can see this here:
There are additional points about using the tool in section below on questions, problems, and observations.
After reading this blog, viewing the instructional video, and receiving your username and password, you can login to begin annotating at — GATE Teamware
– Survey –
When you are done with your task, please answer the questions on the survey to give us feedback on your experience using the annotation tool. The survey is available below. You can scroll down and answer the questions. Don’t forget to hit the “Done” button to submit your responses, which will be very useful in helping us understand your experience and thoughts about using the tool:
Create your free online surveys with SurveyMonkey, the world’s leading questionnaire tool.
– What Then? –
We analyse the annotations from several annotators, comparing and contrasting them (interannotator agreement). This will show us similarities and differences in the understanding of the annotations and cases. As well, the results will help us develop a Gold Standard Corpus of legal cases, which are annotations of cases that annotators agree on. A Gold Standard is essential for information extraction and the development of advanced processing. We will publicly report the analysis of the exercise and make the annotated cases publicly available for re-use.
Once we have a better sense of how this study goes, we plan to roll out a larger version with more cases. And this is only the start….
– Questions, Problems, and Observations –
Thanks to participants for letting us know about their problems and sending their observations.
How easy is it to learn to use the tool? Take a look at the video to get a sense of this. With a little bit of practice, it is rather straightforward.
What if I don’t agree with some of your annotations or features? Write a comment or send us an email, and we will consider your comment. Try to be as clear and specific as you can. We are not lawyers, and we are dealing with a global community with local variation, so it is likely there will be some disagreement and variation.
Can I get the results of my annotations? Our approach is to make individual contributions to the whole. So, you will be able to access annotated cases after the exercise. There will be further information on how to work with the material.
How many cases must I do? You can do one or you can do as many as we have (not many in the beta project).
How much time will it take? About as long as it would take you to do a similar highlighting and annotation task with paper and markers.
What if I have a problem with using the tool or if the tool is buggy? Be patient and try to work with the tool. Sometimes things go wrong. Write a comment or send us an email, and we will try to advise. Note – we are only consumers of GATE Teamware, so are not responsible for the system.
How thoroughly should I annotate the cases? The more cases that are annotated fully and accurately, the better. Apply the same diligence as you would to thoroughly and carefully analyse cases with pen and paper. As you will be the beneficiary of the work of others, so too should you work to benefit them.
Do we track good annotators and bad annotators? We are interested in data in the aggregate, and are only interested in interannotator agreement and disagreement. This information will help us better understand differences in how the cases are understood and annotated. But, we can see how much time each person takes with each annotation task and measure how they perform against other annotators or a gold standard. If we have bad annotators, we will see this in the results; we would contact the annotator and see how best to improve the situation. As we noted above, we are not sharing information with third parties.
I cannot login with the username and password. Please let me know if you have this problem, and I will look into it.
I can login, but I cannot get the java webstart file to start. This is a tough problem to address over the internet. Some people have no problem, but some people are. Please let me know if you have this problem. Do check that you have followed the instructions (on blog and in movie).
I can login and start the annotation tool, but I cannot get the task. Please let me know, and I will look into it.
The text is too small and single spaced. At the moment, there is nothing we can do about this. We’ll try to keep this in mind for the future.
The highlighting tool is not easy to use. When I want to move from one annotated text to some new text, the tool doesn’t move to the new text. This is bit of a problem with the tool, which is not entirely reliable in the functionality. Try to play around with this to see what works for you. One strategy that I have found that improves performance is to annotate something. Then the annotation types appears in the upper right hand corner window among the list of annotations. Sometimes it is a good idea, when the problem occurs, is to click the annotations in that upper right hand corner window off and on (toggle them on and off). This seems to clear the system a bit so that one can go on to the next annotation. Give this a try. If you have problems, please let me know.
I found it very challenging. It is important to us to know this information to gauge how much text and the variety of annotations. We might reduce the number of annotations, breaking up the whole set into parts of the overall task.
Decision date is more important than hearing date, or at least should be provided in addition to hearing date. Probably this will be added to future iterations.
A participant, e.g. “Cone”, was originally a defendant, but was dismissed out before this appeal. I wonder if he should still be coded as “Defendant” or if he should be coded as an other role-holder. Good observation. I’ll have to consult with some lawyers further about this point.
There are sentences where the court introduced a fact and also appeared to reason using it. Is it right to code the whole sentence both as a legal fact and as a rationale. Yes, this is the way to handle this. Double annotations are always possible.
A similar problem occurred where the court offered a fact but also put a gloss on it as to its legal significance. Double annotations are always possible.
Some of the names of the categories were confusing or unclear. For example, using “Holding” for the name of the legal rule or principle was confusing (“Legal Rule” might be more intuitive). This is another point that we will need to consult further with other lawyers. There may also be some variation in terminology.
There is sometimes unclarity about role-players. A case involved a plaintiff, who was an appellee but also a cross-appellant, and a defendant who was thus an appellant and cross-appellee. These can be coded where on is plaintiff and appellee and the other defendant and appellant. But, they could have both been coded as appellee and appellant, given the existence of the cross appeal. Double (or more) annotating is fine.
Procedural History/Damages might be better framed as Procedural History/Remedies, as courts often provide injunctive relief or, as in this case, an accounting, as a remedy. This is another point that we will need to consult further with lawyers about terminology.
What if a case does not state any legal rules? Can implicit legal rules be annotated. For example, where novelty and non-obviousness are a sine qua non of a valid patent, one would not have known to mark some of the sentences as rationales. This isn’t a problem. If something is not in the case, then it is not annotated. We are not (yet) concerned with implicit information. But, if you know the implicit information, then annotate it.
How can I automatically search for and annotate the same string with the same annotation? In the instructional video, we wanted to keep the material short and to the point, so there are aspects of the annotation tool we did not cover. However, it is tedious to manually search for the same string and annotate it with the same annotation. Teamware’s Annotation Editor has a tool to support automatic search and annotation. To see how to do this, we have the video here:
How should I annotate holdings which may appear as holdings in cited cases and as part of the procedural history, as holdings in the current case, or as part of the rationale in the current case? This is an interesting and subtle point for us, and we will have to have a full consultation with lawyers to decide. But, for the time being, there can be no harm in multiple annotations, which we can then look at and work with later.
– Paper –
If you are interested in some of the ideas behind this project, please see our paper:
The paper will appear in May 2012 in the Proceedings of the LREC Conference Workshop on Semantic Processing of Legal Texts, Istanbul, Turkey. The exercise here is a version of the exercise proposed in the paper.
Adam Wyner, University of Liverpool, adam@wyner.info
Neil Benn, University of Leeds, n.j.l.benn@leeds.ac.uk
Paper Submission Deadline: May 28, 2012
We invite submission of papers on modelling policy-making. Below we outline the intended audience, context, the topics of interest, and submission details.
Context
We live in an age where citizens are beginning to demand greater transparency and accountability of their political leaders. Furthermore, those who govern and decide on policy are beginning to realise the need for new governance models that emphasise deliberative democracy and promote widespread public participation in all phases of the policy-making cycle: 1) agenda setting, 2) policy analysis, 3) lawmaking, 4) implementation, and 5) monitoring. As governments must become more efficient and effective with the resources available, modern information and communications technology (ICT) are being drawn on to address problems of information processing in the phases. One of the key problems is policy content analysis and modelling, particularly the gap between on the one hand policy proposals and formulations that are expressed in quantitative and narrative forms and on the other hand formal models that can be used to systematically represent and reason with the information contained in the proposals and formulations.
Special Issue Theme
The editors invite submissions of original research about the application of ICT and Computer Science to the first three phases of the policy cycle – agenda setting, policy analysis, and lawmaking. The research should seek to address the gap noted above. The journal volume focusses particularly on using and integrating a range of subcomponents – information extraction, text processing, representation, modelling, simulation, reasoning, and argument – to provide policy making tools to the public and public administrators. While submissions about tool development and practice are welcome, the editors particularly encourage submission of articles that address formal, conceptual, and/or computational issues. Some specific topics within the theme are:
information extraction from natural language text
policy ontologies
formal logical representations of policies
transformations from policy language to executable policy rules
argumentation about policy proposals
web-based tools that support participatory policy-making
tools for increasing public understanding of arguments behind policy decisions
visualising policies and arguments about policies
computational models of policies and arguments about policies
integration tools
multi-agent policy simulations
Submission Details:
Authors are invited to submit an original, previously unpublished, research paper of up to 30 pages pertaining to the special issue theme. The paper should follow the journal’s instructions for authors and be submitted online. See the dropdown tab under the section FOR AUTHORS AND EDITORS.
Each submitted paper will be carefully peer-reviewed based on originality, significance, technical soundness, and clarity of exposition and relevance for the journal.
Abstract
Rules in regulations such as found in the US Federal Code of Regulations can be expressed using conditional and deontic rules. Identifying and extracting such rules from the language of the source material would be useful for automating rulebook management and translating into an executable logic. The paper presents a linguistically-oriented, rule-based approach, which is in contrast to a machine learning approach. It outlines use cases, discusses the source materials, reviews the methodology, then provides initial results and future steps.
Abstract
The paper addresses the extraction, formalisation, and presentation of public policy arguments. Arguments are extracted from documents that comment on public policy proposals. Formalising the information from the arguments enables the construction of models and systematic analysis of the arguments. In addition, the arguments are represented in a form suitable for presentation in an online consultation tool. Thus, the forms in the consultation correlate with the formalisation and can be evaluated accordingly. The stages of the process are outlined with reference to a working example.
Wednesday December 14, 2011
University of Vienna
Vienna, Austria
Context:
As the European Union develops, issues about governance, legitimacy, and transparency become more pressing. National governments and the EU Commission realise the need to promote widespread, deliberative democracy in the policy-making cycle, which has several phases: 1) agenda setting, 2) policy analysis, 3) lawmaking, 4) administration and implementation, and 5) monitoring. As governments must become more efficient and effective with the resources available, modern information and communications technology (ICT) are being drawn on to address problems of information processing in the phases. One of the key problems is policy content analysis and modelling, particularly the gap between on the one hand policy proposals and formulations that are expressed in quantitative and narrative forms and on the other hand formal models that can be used to systematically represent and reason with the information contained in the proposals and formulations.
Submission Focus:
The workshop invites submissions of original research about the application of ICT to the early phases of the policy cycle, namely those before the legislators fix the legislation: agenda setting, policy analysis, and lawmaking. The research should seek to address the gap noted above. The workshop focuses particularly on using and integrating a range of subcomponents – information extraction, text processing, representation, modelling, simulation, reasoning, and argument – to provide policy making tools to the public and public administrators.
Intended Audience:
Legal professionals, government administrators, political scientists, and computer scientists.
Areas of Interest:
information extraction from natural language text
policy ontologies
formal logical representations of policies
transformations from policy language to executable policy rules
argumentation about policy proposals
web-based tools that support participatory policy-making
tools for increasing public understanding of arguments behind policy decisions
visualising policies and arguments about policies
computational models of policies and arguments about policies
integration tools
multi-agent policy simulations
Preliminary Workshop Schedule:
09:45-10:00 Workshop Opening comments
10:00-11:00 Paper Session 1
Using PolicyCommons to support the policy-consultation process: investigating a new workflow and policy-deliberation data model
Neil Benn and Ann Macintosh
A Problem Solving Model for Regulatory Policy Making
Alexander Boer, Tom Van Engers and Giovanni Sileno
11:00-11:15 Break (coffee, tea, air etc.)
11:15-12:15 Paper Session 2
Linking Semantic Enrichment to Legal Documents
Akos Szoke, Andras Forhecz, Krisztian Macsar and Gyorgy Strausz
Semantic Models and Ontologies in Modelling Policy-making
Adam Wyner, Katie Atkinson and Trevor Bench-Capon
12:15-13:15 Lunch break
13:15-14:45 Paper Session 3
Consistent Conceptual Descriptions to Support Formal Policy Model Development: Metamodel and Approach
Sabrina Scherer and Maria Wimmer
The Policy Modeling Tool of the IMPACT Argumentation Toolbox
Thomas Gordon
Ontologies for Governance, Risk Management and Policy Compliance
Jorge Gonzalez-Conejero, Albert Merono-Penuela and David Fernandez Gamez
14:45-15:00 Break (coffee, tea, air etc.)
15:00-16:00 Paper Session 4 and Closing discussion
Policy making: How rational is it?
Tom Van Engers, Ignace Snellen and Wouter Van Haaften
Closing discussion
Workshop Registration and Location:
Please see the JURIX 2011 website for all information about registration and location.
Submit position papers of between 2-5 pages in length in PDF format and using the IOS Press style files and authors’ guidelines at: IOS Press Author Instructions
A call for selected extended versions of the papers will be issued for a special issue of AI and Law on Modelling Policy-making.
Contact Information:
Adam Wyner, adam@wyner.info
Neil Benn, n.j.l.benn@leeds.ac.uk
Program Committee Co-Chairs:
Adam Wyner (University of Liverpool, UK)
Neil Benn (University of Leeds, UK)
Program Committee (Preliminary):
Katie Atkinson
Trevor Bench-Capon
Bruce Edmonds
Tom van Engers
Euripidis Loukis
Tom Gordon
Ann Macintosh
Gunther Schefbeck
Maria Wimmer
Radboud Winkels
In this note, I point to various parts of a discussion on developing and analysing legal textual data raised at ICAIL 2011. Please feel free to add comments to this document (or to me in person, by email, on your blog and linked to this, etc), which I can then add to the post (I’m very happy to attribute contributions). The intention is to stimulate discussion on these matters to help the community of researchers move ahead on common interests.
Corpus Development
Unlike the situation from several years ago, we have accessible sources of large corpora of legal textual information. The World Legal Information Institutes provide free, independent and non-profit access to worldwide law. For example, one can go to the US site and download cases: United States v Grant [1961] USCA9 19; 286 F.2d 157 (19 January 1961); one can request zipped files or screen scrap cases. The LIIs have introduced standardised references and formats for cases. There are boolean and regex searches.
From the contacts that I have had (e.g. in the US and UK), the LIIs would be very happy to collaborate with academic researchers in the analysis of their data and in keeping with their primary mission. In particular, developing tools that can be integrated and deployed with their platforms might be a way to go, thereby addressing significant platform and dissemination issues.
Another source of corpora is public.resource.org, which distributes a range of corpora covering legislation, codes, and cases.
Analysis and Annotation
There are a range of issues about information retrieval and extraction. Others can speak about IR, statistical, machine learning approaches. What I know better is annotation, whether fully or semi automatic and manual. Here we have issues about what to annotate and how. Some low level information is unproblematic (e.g. entities of a range of sorts, sections, and sentiment); higher level information (e.g. factors) might be more complex. I have some suggestions for annotations for low level information; a good starting point for factors are the CATO factors, though there is a general issue about how to extend factor identification to other domains (CATO factors are specific for intellectual property).
One general problem with analysis is that different researchers might use different tools in their work and just report the results. This means results are not interchangeable, which is particularly problematic with annotation work. If a common ‘framework’ tool is used and some consensus is developed about (at least) low level annotation types, then work can proceed more collaboratively, transparently, and reproducibly. One can develop a more forceful argument for researchers (public service bodies and information providers) to promote such an open development methodology (among them are justification and traceability, see Wyner and Peters 2010 and David Lewis’s ICAIL 2011 keynote address on related points). General Architecture for Text Engineering is an open framework for text processing modules.
There are ‘open’ systems for text annotation — Open Calais and Open Up platform’s data enrichment service from The Stationery Office. However, there are intellectual property issues that need to be considered.
Another general issue is how to carry out manual annotation, for example to build gold standards, which are required for machine learning systems. There has been significant progress, for example, with TeamWare, which provides for curated, web-based annotation tools along with annotation analysis (e.g. inter-annotator agreement). For a short tutorial (for an experiment) on using TeamWare for annotation of some legal case factors, see Web-based Annotation Support for the Law. Wim Peters and I proposed to law school faculty to use this tool to support their student exercises for first and second year students since these exercises often require identifying and extracting information from cases. Wim and I think integrating annotation exercises into legal e-learning could both help to develop large annotated sets of data and to serve an important educational purpose. See our paper about some of these points and proposals.
Research Questions
Large corpora can be formed, tools can be applied to them, but for fund raising, the community needs to develop a range of motivating research questions and use cases. Asides from questions pursued in the AI and Law community, we might consult further with public bodies (National Center for State Courts and similar), legal information service providers (Lexis-Nexis, ThomsonReuters, Practical Law Company, law societies, political scientists, etc. The kinds of answers we look for partially guide how we structure not only the corpora, but moreso the annotations.
Funding Opportunities
Digging into Data and the Request for Proposals, but the due date is June 16 (I had been working on a proposal, but needed better research questions to hold local interest). Though the deadline is too soon to submit a proposal, it does demonstrate a widespread interest in funding bodies in the development and analysis of large corpora in the humanities and social sciences. The other obvious funding sources are national (US, UK, French, etc) and international (EU and Digging into Data).
I’m giving a talk tomorrow, April 11 2011, at BILETA, the annual conference of the British & Irish Law, Education and Technology Association at Manchester Metropolitan University School of Law. My collaborators are Wim Peters (University of Sheffield) and Fiona Beveridge (University of Liverpool).
The abstract and slides are below:
Web-based Software Tools to Support Students’ Empirical Study of the Law Adam Wyner (University of Liverpool, Computer Science), Wim Peters (University of Sheffield, Computer Science), and Fiona Beveridge (University of Liverpool, Law School)
The paper investigates and proposes tools to support students in empirically investigating legal cases using text analytic software. Web-based tools can be used to engage and leverage the collective skills and ambitions of law students to crowd-source the development of legal resource materials. Law school students must develop skills in close textual analysis of legal source material such as legal cases. To use source material such as case decisions to reason about how precedents apply in case-based reasoning, law students must learn to identify a range of elements in legal cases, for example, parties, jurisdiction, material facts, legislative and case citations, cause of action, ratio decideni, and others. Moreover, students should be able to address complex queries to a case or a case base (a corpus of cases) in order to answer questions of particular legal interest; for example, about relationships between a judge, parties, cause of action, and ratio. Currently students either simply rely on their own analytic abilities to read a case or find answers to questions; legal search tools (e.g. Lexis-Nexis) provide search support, but are restricted to a limited number of coarse-grained parameters and cannot search for deep, particular semantic relationships in the text. To enable automated support of queries of the corpus, and so enable deep empirical research on cases, it is essential to have a corpus of legal cases which are annotated with machine readable (XML) tags that signal the semantic properties of passages of text. To create such a corpus requires a tool to annotate the text. Such a tool would reinforce students’ examination of the source document. The paper describes recent developments of tools using Semantic Web technologies, text analysis, and web-based annotation support. With the text analysis software, General Architecture for Text Engineering (GATE), which is customised for legal applications, law students can annotate legal cases for a fine-grained range of legally relevant concepts and linguistic relations; they can also use GATE to write grammars and automatically annotate the text. Using GATE TeamWare, an online text annotation tool that automatically evaluates interannotator agreement, students can collaboratively analyse and agree on a gold standard corpus of legal cases. The corpus can be automatically indexed using Lucene, thereby allowing fast results to complex queries over any string or annotation used.