One of the key concepts for this work is the datument, a term coined in 2004 (P. Murray-Rust & Rzepa 2004):
The term is not widely used but it is useful here because it encapsulates the goals for an eResearch publication. For the purposes of this paper let us assume that a datument will consist of an HTML document or documents, which include machine readable graphics and data files appropriate to a discipline. How would one create such a thing?
The abstract of this paper glibly refers to 'average' and 'ordinary' researchers. This is not derogatory, it is a placeholder for the assumptions made here about which researchers we are considering. We are interested in people who, in the course of their research wish to:
This article is taking a very narrow view of a complex field – looking at a slice through the eResearch process from the point of view of the word processor users, so the intricate interconnections of repositories and services described by the DART project (Treloar & Groenewegen 2007) are not in focus here, nor are the dynamics of researcher collaboration or the impact it will have on their writing, nor the data networks that will support this enterprise.
A reasonable assumption about the baseline technology available to a typical researcher is that they will have access to Microsoft Office, with Microsoft Word for writing papers, books and theses, with a reference manager which can insert citations in-text and format bibliographies and on-campus training in how to use it. For the most part journals and conferences to which these authors submit will request papers be submitted in Word format, giving guidance ranging from nil, through to detailed templates with strict rules about how papers are to be formatted and structured and which referencing format to follow. Some disciplines and sub disciplines use different tools, such as LaTeX or XML (Bray et al. 2006) – this paper is not about those researchers although they may face some of the same issues described here. At the end of this paper this is a call to action which included gathering statistics about how many researchers in which disciplines use word processors in writing up their research. To give an idea of the numbers who choose to use word processors, at USQ around 4% of courses are authored using LaTex and submitted camera-ready. Less that one percent are authored directly in the legacy in-house XML system which was intended to be used for all courseware. With the remainder all editing is done by academic staff in Word or increasingly in OpenOffice.org.
Peter Murray-Rust, who invented the term datument with Henry Rzepa is a senior chemist and Open Data activist with a high level of awareness of web technologies and the semantic web. He related on his blog the problems he had in submitting a datument to his university repository:
Further hurdles needed to be cleared. The paper had to be converted to a word processing format for it to be submitted to the journal, a process that necessarily lost some of its semantic richness. For example, there is no practical way to embed SVG graphics (Ferraiolo 2001) in a Microsoft Word document in such a way as downstream users will be able to render them, so graphics that were infinitely scalable and machine readable (literally – the text in an SVG picture is text that could be read out by a speech synthesiser) have been turned into bitmaps.
So the outcome for this eResearcher was not optimal – the paper as disseminated in the journal can talk about semantic richness but has been stripped of its own richness, and the reposited version is delivered to readers as a zipped package, not as part of the web site for the repository; making it impossible to realise it's full potential.
If it's this hard for technically savvy user with software development skills, then how is a typical non-specialist e-Researcher going to fare? Recently I was contacted by the head of one of the research centres at USQ, relaying a question from one of her graduate students. Are there tools for interlinear text that are compatible with Microsoft Word? Interlinear text is a method of displaying linguistic analysis of lines of text with lines of explanatory material aligned beneath them at its simplest an author could use a monospaced font or tables to encode it, but this would lose a lot of the semantics of the analysis (Bow, Hughes, & Bird 2003).
There are plenty of tools to create interlinear text. The question is which one to choose to meet the goals outline above. The first hit on a Google search turned up something that looks promising ... to me, but possibly not to the research student in question. The web site says:
So, to understand what this tool does you need to know all about the eXtensible Markup Language (XML), and Document Type Definitions (DTDs), and then work out how to embed your text into Word, how this might be rendered as HTML, linked back to a corpus in a repository and so on. This would be a significant distraction for anyone, even given the technical skills, if all they wanted was a tool for writing-up their thesis as a datument. Multiply that distraction several times if the goal was to start the document authoring on a wiki, or post work in progress to a blog. There was no time in this case to assist.
But even on a more general level, working with a word processor has challenges for researchers. They have to deal with the following:
This section describes an ongoing project I lead at the University of Southern Queensland, which is designed to produce tools for writing for mainstream academic authors, as well as considering some of the outcomes from the ARCHER project. This work uses an open source content management toolkit called The Integrated Content Environment (ICE) (P. Sefton 2006b). Along with a discussion of ICE, and its current shortcomings some other tools that might help researchers including tools that were developed (DART project n.d.) in the ARROW, DART and ARCHER projects (Treloar & Groenewegen 2007) are considered.
This section takes a look at some of the tools that have emerged from the Australian repositories movement over the last few years funded by the Australian government's Systemic Infrastructure Initiative, specifically a project to look at how researchers can write up their results in an efficient manner; the Integrated Content Environment for Research and Scholarship (ICE-RS) (P. Sefton 2006b). Which was a project to extend the Integrated Content Environment (ICE) an open-source content management system for academia (P. Sefton 2006a).
Journals often provide templates, but they require users to re-learn how to follow a template for each new publisher, they do not solve the problems of creating HTML or typically address the complex issues around data integration.
ICE is based around word processor templates for authors to work in word processors. There is good support for recent versions of Microsoft Word (excepting word 2008 for the Mac platform because it does not have the required scripting language) and for the OpenOffice.org family of word processors, which run on a variety of platforms. The templates contain styles which are designed to map onto HTML, guaranteeing good quality web 2.0-ready content; a prerequisite for datument production. ICE also provides a toolbar for the templates so that styles can be applied using the same kinds of buttons that appear on most modern editors. The ICE website has a mute screencast showing the interface in action. The idea is for authors to learn one interface, and then to be able to use it to write for multiple output media and publishers.
As far as we know, there are no other active projects with the same aims as ICE, or with the same coverage of features. Surveys of related word processor based projects for academic writing are available in earlier papers (P. Sefton 2006b, 2007a)
Currently adapting ICE templates to journal formats is possible, but it is a manual process. There are three current cases, with a fourth potential case:
In cases 2,3 and 4 above it would be possible to make ICE templates and conversion code available for download or as plug-ins for the ICE server, or for an ICE-like service. More on this in the conclusion.
One of the key features of the ICE project is to turn word processing documents into XHTML and PDF automatically, with the ability to include data integration. It is widely known that word processors (we are particularly concerned with Microsoft Word and OpenOffice.org Writer) do not by default produce preservation-quality XHTML, although serious consideration of this is not to be found in the literature. Most users would have little chance of producing high quality HTML from a journal template, but with an ICE-adapted version of the same template they can create datuments in HTML and still submit to the journal. The paper you are reading (or hearing) now was prepared using the ICE service.
This paper has embedded metadata The following screenshot shows the author's details. The name Peter Sefton is marked with a meaningful style:
There are alternative approaches to encoding metadata in documents for both Word, via a Microsoft project to create an XML authoring tool in Word; and new metadata support in the OpenDocument format (OASIS 2005) in its forthcoming version 1.2. But it is beyond the scope of this paper to speculate on how they might impact on an eResearcher and neither have been released in their complete form.
The same style-based mechanism can be used to support semantics such as geographical data. See this example, which uses an approach which can also be applied in other contexts, such as wikis or online word processors, with data embedded either using links or by embedding co-ordinates with a word processing style.
One of the key attributes of a datument is having data embedded or linked to a publication in such a way that it can be retrieved intact, and also viewed in medium-appropriate ways. The ICE PROJECT is developing a service oriented framework for linking data into documents using word processor based microformats. Khare and Çelik describe microformats:
Pragmatic aptly describes our approach, too; as we have to work within the limitations of not one but a number of existing software solutions, standards and formats. The ICE approach is define conventions for data integration that can be used as flexibly as possible so that the same or similar microformats could work in a wiki or online editor as well as in the word processor.
One early test case has been including Chemical Markup Language. Even for ICE, which can handle this kind of datument under certain circumstances there are problems trying to create a paper. For example to submit a chemical markup language file to a journal or a repository, one would need to also submit the Java applet to render it. There is no indication in the author guidelines for this conference whether this would be possible, but a reasonable assumption is that the organisers would not be willing to host code that may need to be maintained, could be subject to security concerns and so on. See this demonstration of how a live three dimensional view of a molecule can be provided for the web using the ICE system.
Using a microformat approach means that services developed for ICE should be able to be used in other contexts, such as the wiki services provided by ARCHER (link forthcoming).
ICE can package content in a variety of ways. IMS content packages (IMS 2005) for learning object repositories or learning environments, using the Australian METS profile (Pearce et al. 2008) packages for use in library systems. In addition to this there are proof of concept solutions for adding ICE documents directly to repositories, for example, an ICE-based repository ingest system was presented at the OpenRepositories 2008 Repository Challenge competition (Monus et al. 2008).
Work has started at USQ on the ICE-TheOREM project in collaboration at the Unilever Centre for Molecular Informatics at the University of Cambridge, working with a thesis management system to allow ICE to publish resource maps using OAI ORE (Open Archives Initiative Object Reuse and Exchange).
ICE uses the Open Document Format , which is an OAISIS and an ISO standard as a back-end storage system, helping to ensure that documents are in a preservable format. But more that this, ICE is designed to use an interoperable subset of ODF and we do a considerable amount of work to make sure that users in other systems, particularly Word are supported.
ICE has collaborative mechanisms:
There are plenty of collaborative tools around. Microsoft Word has a change-tracking feature where two or more authors can serially edit a document, but there are more spectacularly collaborative tools as well, such as Google Documents and Spreadsheets, which allows multiple authors to edit the same line of text at the same time; research at USQ has shown that it is actually difficult to create conflicts using the service, but on the other hand there are no specific tools for academic authoring, such as reference management, which is supported in word processors, and thus in ICE (Dekeyser & Watson 2006).
In an eResearch context, the DART project explored the use of wikis and collaboration tools and produced some work packages, but as with ICE, the the non-technical or under-resourced eResearcher would be struggling to install, learn and use the tools. Help is needed for our research communities. The biggest issue here is that while authors might collaborate on text in an online application, the result then needs to be integrated back into Word, references added and so on; there is a clear need to build more tools to close the gap between collaborative authoring and the word processor that is used to submit to a publisher.
ICE is currently being made a core IT system at the University of Southern Queensland (USQ), meaning that it is considered essential to the functioning of the university. This is for courseware, though. Not for eResearch. A major barrier to ICE's adoption outside of USQ is that it is too hard to install – it requires a number of things to be installed on a client machine, and server setup is non trivial. This means that institutional-level support is required.
Several other problems remain:
The following diagram shows the ICE system in context, mapped onto two axes. The vertical axis is the degree of collaborativeness. What is needed, for the eResearch community is to fill in the places in this diagram where there are dotted arrows – that is to be able to take a wiki document and turn it into a high-quality word processing document, with data integration and citations preserved. ICE has a service-oriented architecture which exposes its conversion services to other applications, so it could act as a content exchange hub in an eResearch architecture, for example providing a text to speech service or a word to wiki service for other applications.
Illustration 4: The Integrated Content Environment (ICE) as a content-hub
In summary, if you are an 'ordinary' Microsoft Word wielding researcher with some or all of the nine aspirations listed above, then the situation is grim.
Experiments with ICE show that is is possible to use a word processor to produce documents that begin to meet the definition of a datument; but it is not easy.
The largest issue with uptake is not that most researchers do not have the goals outlined for this paper. It is that even if they did, then there is nowhere for them to put their datuments and hence no reason to have the goals. Getting large numbers of researchers from where they are now to where they could be, is a gargantuan task which involves priming an enormous engine. It involves not only creating new services and software and documentation and training packages that do not yet exist, it means changing the behaviour of eResearchers and repository managers. The community itself can change its own practices, by creating data-integrated documents and repositories; journal publishers may take some time to respond.
Action is needed on a number of fronts. Specific projects towards the broad aims:
Bow, C., Hughes, B., & Bird, S., 2003. Towards a General Model of Interlinear Text. In Proceedings of the E-Meld Workshop on digitizing and Annotating Texts and Field Recordings. . Michigan.
Bray, T. et al., 2006. Extensible Markup Language (XML) 1.0 (Fourth Edition), World Wide Web Consortium. Available at: http://www.w3.org/TR/REC-xml/ [Accessed April 23, 2007].
DART project, DART Work Packages and Outcomes. Available at: http://dart.edu.au/workpackages/ [Accessed May 20, 2008].
Dekeyser, S. & Watson, R., 2006. Extending Google Docs to Collaborate on Research Papers (departmental presentation). Available at: http://www.sci.usq.edu.au/research/seminars/files//seminar132/GoogleDocsSeminar.pdf.
Ferraiolo, J., 2001. Scalable Vector Graphics (SVG) 1.0 Specification, World Wide Web Consortium. Available at: http://www.w3.org/TR/2001/REC-SVG-20010904/ .
Foster, N.F. & Gibbons, S., 2005. Understanding faculty to improve content recruitment for institutional repositories. D-Lib Magazine, 11(1), p.1082-9873.
IMS, 2005. IMS Content Packaging Overview Version 1.2 Public Draft. Available at: http://www.imsglobal.org/content/packaging/cpv1p2pd/imscp_oviewv1p2pd.html .
Khare, R. & Çelik, T., 2006. Microformats: a pragmatic path to the semantic web. In Proceedings of the 15th international conference on World Wide Web. Edinburgh, Scotland: ACM, p. 865-866. Available at: http://portal.acm.org/citation.cfm?id=1135777.1135917 [Accessed February 25, 2008].
Monus, L. et al., 2008. Zero Click Ingest. Available at: http://pubs.or08.ecs.soton.ac.uk/119/ [Accessed May 20, 2008].
Murray-Rust, P. & Rzepa, H.S., 2004. The Next Big Thing: From Hypermedia to Datuments. Journal of Digital Information, 5(1), p.248. Available at: http://jodi.tamu.edu/Articles/v05/i01/Murray-Rust/?printable=1.
Murray-Rust, P., 2008. Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Open Data: Datument submitted to Elsevier’s Serials Review. petermr's blog: A scientist and the Web. Available at: http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=913 [Accessed February 21, 2008].
OASIS, 2005. OpenDocument v1.0 specification. Available at: http://www.oasis-open.org/committees/download.php/12572/OpenDocument-v1.0-os.pdf .
Pearce, J. et al., 2008. The Australian METS Profile–A Journey about Metadata. D-Lib Magazine, 14(3/4), p.1082-9873.
Recordon, D. & Reed, D., 2006. OpenID 2.0: a platform for user-centric identity management. Proceedings of the second ACM workshop on Digital identity management, p.11-16.
Sefton, P., 2006a. The integrated content environment. In Noosa. Available at: http://eprints.usq.edu.au/archive/00000697/01/Sefton_ICE-ausweb06-paper-revised-3.pdf.
Sefton, P., 2006b. The Integrated Content Environment for Research and Scholarship. Available at: http://ice.usq.edu.au/introduction/ice_rs.htm [Accessed April 30, 2007].
Sefton, P., 2007a. An integrated approach to preparing, publishing,
Sefton, P., 2007b. Hooking up authoring processes and tools to institutional repositories. PT's blog. Available at: http://ptsefton.com/2007/12/19/hooking-up-authoring-processes-and-tools-to-institutional-repositories.htm [Accessed February 21, 2008].
Treloar, A. & Groenewegen, D., 2007. ARROW, DART and ARCHER: A Quiver Full of Research Repository and Related Projects. Ariadne, (51). Available at: http://www.ariadne.ac.uk/issue51/treloar-groenewegen/ [Accessed May 16, 2007].