Open source software development and standards are widely discussed and practiced. It has led to a range of useful applications and services. GATE is one such example.
However, one quickly learns that open source can easily mean open to a certain extent: GATE is open source, but the applications and additional functionalities that are developed with respect to GATE often are not. On the one hand, this makes perfect sense as the applications and functionalities are added value, labour intensive, and so on. On the other hand, the scientific community cannot verify, validate, or build on prior work unless the applications and functionalities are available. This can also hinder commercial development since closed development impedes progress, dissemination, and a common framework from which everyone benefits. It also does not recognise the fundamentally experimental aspect of information extraction. In contrast, the rapid growth and contributions of the natural (Biology, Physics, Chemistry, etc) or theoretical (Maths) sciences could only have occurred in an open, transparent development environment.
I advocate open source information extraction where an information extraction result can only be reported if it can be independently verified and built on by members of the scientific community. This means that the following must be made available concurrent with the report of the result:
- Data and corpora
- Lists (e.g. gazetteers)
- Rules (e.g. JAPE rules)
- Any additional processing components (e.g. information extraction to schemes or XSLT)
- Development environment (e.g. GATE)
In other words, the results must be independently reproducible in full. The slogan is:
No publication without replication.
This would:
- Contribute to the research community and build on past developments.
- Support teaching and learning.
- Encourage interchange. The Semantic Web chokes on different formats.
- Return academic research to the common (i.e. largely taxpayer funded) good rather than owned by the researcher or university. If someone needs to keep their work private, they should work at a company.
- Lead to distributive, collaborative research and results, reducing redundancy and increasing the scale and complexity of systems.
Solving the knowledge bottleneck, particularly in relation to language, has not and likely will not be solved by any one individual or research team. Open source information extraction will, I believe, make greater progress toward addressing it.
Obviously, money must be made somewhere. One source is public funding, including contributions from private organisations which see a value in building public infrastructure. Another source is, like other open source software, systems, or other public information, to make money “around” the free material by adding non-core goods, services, or advertising.
By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0
[...] original post here: Open Source Information Extraction: Data, Lists, Rules, and … By admin | category: language software | tags: entries, legal-informatics, logic, wyner | [...]