eResearch for Word users?

Peter Sefton

The University of Southern Queensland


Abstract: This paper documents the plight of 'average' modern researchers as they apply their academic writing skills in the new world of eResearch.

We might expect researchers to have mastered some of the basic generic writing tools; an office suite with a word processor, the ability to generate charts from tables of data; a reference manager that can insert citations; and tools of their discipline like statistics packages.

But the 'ordinary' researcher who tunes-in to the clamour about ideas and tools from a conference like eResearch Australia could be easily overwhelmed by the gap between the obvious potential and their own command of the technology they have to hand.

Eight things to which a tuned-in researcher might aspire: (a) to share data with colleagues, (b) to collaborate on semantically rich documents which include appropriate data visualizations, (c) to blog their research as it happens, (d) to annotate data and works in progress, (e) to submit to journals, (f) to deposit appropriate copies of papers into various discipline and institutional repositories, and not just in PDF format, (g) in HTML, with rich interactivity and links to their data. They might also aspire to ensure (h) preservation of their data and their writing without accidentally choosing a doomed data format in which to store it.

The question is how do we get there from here? The starting point is using Microsoft Word with references in EndNote emailed around a workgroup then sent to a publisher. The goal is to collaborate on a document which has embedded rich semantics, such as geographical data points that can be displayed on maps and overlaid with data from other sources. The document needs to be viewed on the web with interactive maps, and annotated, tagged and commented upon, as well as being distributed as a traditional paper paper and stored in the dreaded PDF file. Finally it must be automatically deposited in appropriate repositories, one of which is a publisher's review queue.

Focussing on the writing process, this paper explores some of the aspirations listed above and suggests some practical advice for researchers and their support staff. There is a discussion at this point about the Integrated Content Environment an academically focussed collaborative content management system, with integration into repository systems which can help with some of the aspirations of the modern eResearcher, but with a lot of work still to do. Other tools are also considered and found wanting.

The conclusion suggests some more areas for research and development, targeted both at the Australasian context but also globally, to research funding bodies. How can our researchers get there from here?

Introduction and and assumptions

One of the key concepts for this work is the datument, a term coined in 2004 (P. Murray-Rust & Rzepa 2004):

A datument is a hyperdocument for transmitting "complete" information including content and behaviour. We differentiate between "machine-readability", merely that a document such as a JPEG image can be read into a system, and "understandability", where the machine is supplied with tools which are semantically aware of the document content. Examples of the latter are domain-specific XML components such as maps (GML), graphics (SVG) and molecules (Chemical Markup Language, CML). Understandability may require ontological (meaning) or semantic (behaviour) support for components. Neither are yet fully formalised but within domains it is often possible to find that certain concepts are sufficiently agreed that programs from different authors will behave in acceptable manners on the same documents. We shall assume that most scientific disciplines can, given the will, support machine-understandability for large parts of their information.

The term is not widely used but it is useful here because it encapsulates the goals for an eResearch publication. For the purposes of this paper let us assume that a datument will consist of an HTML document or documents, which include machine readable graphics and data files appropriate to a discipline. How would one create such a thing?

The abstract of this paper glibly refers to 'average' and 'ordinary' researchers. This is not derogatory, it is a placeholder for the assumptions made here about which researchers we are considering. We are interested in people who, in the course of their research wish to:

  1. share data with colleagues,

  2. collaborate on semantically rich documents which include appropriate data visualizations,

  3. blog their research as it happens,

  4. annotate data and works in progress,

  5. submit to journals as painlessly as possible,

  6. deposit appropriate copies of papers into various discipline and institutional repositories, preferably automatically as a by-product of writing the paper in a content management application,

  7. automatically create documents in PDF and HTML, with rich interactivity and links to their data,

  8. ensure preservation of their data and their writing without accidentally choosing a doomed data format.

This article is taking a very narrow view of a complex field looking at a slice through the eResearch process from the point of view of the word processor users, so the intricate interconnections of repositories and services described by the DART project (Treloar & Groenewegen 2007) are not in focus here, nor are the dynamics of researcher collaboration or the impact it will have on their writing, nor the data networks that will support this enterprise.

graphics2Illustration 1: DART High level architecture

Generic issues with generic tools

A reasonable assumption about the baseline technology available to a typical researcher is that they will have access to Microsoft Office, with Microsoft Word for writing papers, books and theses, with a reference manager which can insert citations in-text and format bibliographies and on-campus training in how to use it. For the most part journals and conferences to which these authors submit will request papers be submitted in Word format, giving guidance ranging from nil, through to detailed templates with strict rules about how papers are to be formatted and structured and which referencing format to follow. Some disciplines and sub disciplines use different tools, such as LaTeX or XML (Bray et al. 2006) this paper is not about those researchers although they may face some of the same issues described here. At the end of this paper this is a call to action which included gathering statistics about how many researchers in which disciplines use word processors in writing up their research. To give an idea of the numbers who choose to use word processors, at USQ around 4% of courses are authored using LaTex and submitted camera-ready. Less that one percent are authored directly in the legacy in-house XML system which was intended to be used for all courseware. With the remainder all editing is done by academic staff in Word or increasingly in

Peter Murray-Rust, who invented the term datument with Henry Rzepa is a senior chemist and Open Data activist with a high level of awareness of web technologies and the semantic web. He related on his blog the problems he had in submitting a datument to his university repository:

So, as a good Open Access advocate I have reposited it in the Cambridge DSpace. DSpace does not deal wth hyperdocuments (please tell me Im wrong). I would have to go through all the documents and find the relative URLs and expand them to the Cambridge DSpace base URL. This, of course, means that the documents are not portable. So I had to reposit a ZIP file. 15 years after the invention of HTML and we cannot reposit HTML hyperdocuments.

[UPDATE: I have since found that it does accept HTML so well see how it comes out. ]

[UPDATE2: Yes, it accepts HTML, but no the links dont work. You have to know the address of each image before you deposit them. Then you have to edit the main paper to make them work. Which means it breaks if you export it. So basically you cannot reposit normal HTML in DSpace and expect it to work.]

(P. Murray-Rust 2008)

Further hurdles needed to be cleared. The paper had to be converted to a word processing format for it to be submitted to the journal, a process that necessarily lost some of its semantic richness. For example, there is no practical way to embed SVG graphics (Ferraiolo 2001) in a Microsoft Word document in such a way as downstream users will be able to render them, so graphics that were infinitely scalable and machine readable (literally the text in an SVG picture is text that could be read out by a speech synthesiser) have been turned into bitmaps.

So the outcome for this eResearcher was not optimal the paper as disseminated in the journal can talk about semantic richness but has been stripped of its own richness, and the reposited version is delivered to readers as a zipped package, not as part of the web site for the repository; making it impossible to realise it's full potential.

If it's this hard for technically savvy user with software development skills, then how is a typical non-specialist e-Researcher going to fare? Recently I was contacted by the head of one of the research centres at USQ, relaying a question from one of her graduate students. Are there tools for interlinear text that are compatible with Microsoft Word? Interlinear text is a method of displaying linguistic analysis of lines of text with lines of explanatory material aligned beneath them at its simplest an author could use a monospaced font or tables to encode it, but this would lose a lot of the semantics of the analysis (Bow, Hughes, & Bird 2003).

There are plenty of tools to create interlinear text. The question is which one to choose to meet the goals outline above. The first hit on a Google search turned up something that looks promising ... to me, but possibly not to the research student in question. The web site says:

The data format is XML. No particular DTD is imposed, but the default options assume the use of the Lacito Archivage DTD. Advanced users can parametrize the program for other DTDs, and incorporate these parameters as options for end users. The following are required to adapt the program to a new DTD:

So, to understand what this tool does you need to know all about the eXtensible Markup Language (XML), and Document Type Definitions (DTDs), and then work out how to embed your text into Word, how this might be rendered as HTML, linked back to a corpus in a repository and so on. This would be a significant distraction for anyone, even given the technical skills, if all they wanted was a tool for writing-up their thesis as a datument. Multiply that distraction several times if the goal was to start the document authoring on a wiki, or post work in progress to a blog. There was no time in this case to assist.

But even on a more general level, working with a word processor has challenges for researchers. They have to deal with the following:

  1. A lack of word processing conventions and standards. Researchers have to deal with templates supplied by journals and conferences on a case by case basis.

  2. Difficulty producing the required HTML to make a semantically aware document, embed data or references to data. Each user in each discipline will be forced to work out solutions for themselves.

  3. Most journals, conferences and repositories cannot deal with datuments. Institutional repositories are typically set up to accept PDF files which lack the semantic encoding that make them datuments.

  4. Domain specific data and writing tools are not made available and moving documents from collaborative environments like wikis back the word processor is a manual process.

What can we do now? The Integrated Content Environment, ARCHER and alternatives

This section describes an ongoing project I lead at the University of Southern Queensland, which is designed to produce tools for writing for mainstream academic authors, as well as considering some of the outcomes from the ARCHER project. This work uses an open source content management toolkit called The Integrated Content Environment (ICE) (P. Sefton 2006b). Along with a discussion of ICE, and its current shortcomings some other tools that might help researchers including tools that were developed (DART project n.d.) in the ARROW, DART and ARCHER projects (Treloar & Groenewegen 2007) are considered.

This section takes a look at some of the tools that have emerged from the Australian repositories movement over the last few years funded by the Australian government's Systemic Infrastructure Initiative, specifically a project to look at how researchers can write up their results in an efficient manner; the Integrated Content Environment for Research and Scholarship (ICE-RS) (P. Sefton 2006b). Which was a project to extend the Integrated Content Environment (ICE) an open-source content management system for academia (P. Sefton 2006a).

Templates and user interface

Journals often provide templates, but they require users to re-learn how to follow a template for each new publisher, they do not solve the problems of creating HTML or typically address the complex issues around data integration.

ICE is based around word processor templates for authors to work in word processors. There is good support for recent versions of Microsoft Word (excepting word 2008 for the Mac platform because it does not have the required scripting language) and for the family of word processors, which run on a variety of platforms. The templates contain styles which are designed to map onto HTML, guaranteeing good quality web 2.0-ready content; a prerequisite for datument production. ICE also provides a toolbar for the templates so that styles can be applied using the same kinds of buttons that appear on most modern editors. The ICE website has a mute screencast showing the interface in action. The idea is for authors to learn one interface, and then to be able to use it to write for multiple output media and publishers.

As far as we know, there are no other active projects with the same aims as ICE, or with the same coverage of features. Surveys of related word processor based projects for academic writing are available in earlier papers (P. Sefton 2006b, 2007a)

Currently adapting ICE templates to journal formats is possible, but it is a manual process. There are three current cases, with a fourth potential case:

  1. No submission format is specified. This is the case with the conference to which this paper has been submitted where the specification is 'single spaced Times New Roman, 10 pt'. I have written it using the default ICE template in the absence of any advice on the call for participation page. This involved very little work, just resetting the basic paragraph style in an ICE document to 10pt Times New Roman, then using the ICE toolbar to auto-generate a complete set of styles based on that one.

  2. A format is specified and/or a template is supplied but it is the look of the paper that is important, not the contents. For example, a paper I submitted to a conference in 2007 (PDF) (P. Sefton 2007a), where a template was supplied, but the organisers, when contacted did not mind if the style names they suggested were used, just that the same font, margins and indents were used.

    In this case all that is required is to rename the styles supplied to match ICE's styles.

    A related example would be guidelines for submission of theses which typically specify that certain margins and line-spacing are used, but make no reference to the structure of the document, in terms of style-names. In this case it is possible to adapt or create a template so that it uses the ICE style-set and complies with the required format.

  3. The most complicated case is where a journal has a template and they care about the style names. In this case it is possible to convert the document so it can be edited in ICE, then convert it back. At this stage this is a semi-automated process at best using ad-hoc macros. More work is required.

  4. A journal could (but none so far do) accept content using the ICE style-set, no matter what the formatting, and re-format to their desired look automatically.

In cases 2,3 and 4 above it would be possible to make ICE templates and conversion code available for download or as plug-ins for the ICE server, or for an ICE-like service. More on this in the conclusion.

Conversion services

One of the key features of the ICE project is to turn word processing documents into XHTML and PDF automatically, with the ability to include data integration. It is widely known that word processors (we are particularly concerned with Microsoft Word and Writer) do not by default produce preservation-quality XHTML, although serious consideration of this is not to be found in the literature. Most users would have little chance of producing high quality HTML from a journal template, but with an ICE-adapted version of the same template they can create datuments in HTML and still submit to the journal. The paper you are reading (or hearing) now was prepared using the ICE service.

Embedding metadata and semantics in the document

This paper has embedded metadata The following screenshot shows the author's details. The name Peter Sefton is marked with a meaningful style: p-meta-author-name with similar styles for the affiliation and email address. The ICE system can process this information and expose it other applications.

graphics3Illustration 2: Embedded Metadata using style-based microformats

There are alternative approaches to encoding metadata in documents for both Word, via a Microsoft project to create an XML authoring tool in Word; and new metadata support in the OpenDocument format (OASIS 2005) in its forthcoming version 1.2. But it is beyond the scope of this paper to speculate on how they might impact on an eResearcher and neither have been released in their complete form.

The same style-based mechanism can be used to support semantics such as geographical data. See this example, which uses an approach which can also be applied in other contexts, such as wikis or online word processors, with data embedded either using links or by embedding co-ordinates with a word processing style.

Some progress on embedded data support

One of the key attributes of a datument is having data embedded or linked to a publication in such a way that it can be retrieved intact, and also viewed in medium-appropriate ways. The ICE PROJECT is developing a service oriented framework for linking data into documents using word processor based microformats. Khare and Çelik describe microformats:

Microformats are a clever adaptation of semantic XHTML that makes it easier to publish, index, and extract semi-structured information such as tags, calendar entries, contact information, and reviews on the Web. This makes it a pragmatic path towards achieving the vision set forth for the Semantic Web. (Khare & Çelik 2006).

Pragmatic aptly describes our approach, too; as we have to work within the limitations of not one but a number of existing software solutions, standards and formats. The ICE approach is define conventions for data integration that can be used as flexibly as possible so that the same or similar microformats could work in a wiki or online editor as well as in the word processor.

One early test case has been including Chemical Markup Language. Even for ICE, which can handle this kind of datument under certain circumstances there are problems trying to create a paper. For example to submit a chemical markup language file to a journal or a repository, one would need to also submit the Java applet to render it. There is no indication in the author guidelines for this conference whether this would be possible, but a reasonable assumption is that the organisers would not be willing to host code that may need to be maintained, could be subject to security concerns and so on. See this demonstration of how a live three dimensional view of a molecule can be provided for the web using the ICE system.

Using a microformat approach means that services developed for ICE should be able to be used in other contexts, such as the wiki services provided by ARCHER (link forthcoming).

Potential for repository integration

ICE can package content in a variety of ways. IMS content packages (IMS 2005) for learning object repositories or learning environments, using the Australian METS profile (Pearce et al. 2008) packages for use in library systems. In addition to this there are proof of concept solutions for adding ICE documents directly to repositories, for example, an ICE-based repository ingest system was presented at the OpenRepositories 2008 Repository Challenge competition (Monus et al. 2008).

Work has started at USQ on the ICE-TheOREM project in collaboration at the Unilever Centre for Molecular Informatics at the University of Cambridge, working with a thesis management system to allow ICE to publish resource maps using OAI ORE (Open Archives Initiative Object Reuse and Exchange).

Preservation ready

ICE uses the Open Document Format , which is an OAISIS and an ISO standard as a back-end storage system, helping to ensure that documents are in a preservable format. But more that this, ICE is designed to use an interoperable subset of ODF and we do a considerable amount of work to make sure that users in other systems, particularly Word are supported.

Collaboration facilities

ICE has collaborative mechanisms:

  • Inline threaded annotation

  • graphics1Illustration 3: Threaded annotation inline in an ICE documentOne can also publish to a weblog where comments can be solicited. The advantage of using ICE to do this are that the styles and references etc. are preserved, i.e. the document can retain its datument status. ICE uses the ATOM Publishing Protocol to post to blogs.

There are plenty of collaborative tools around. Microsoft Word has a change-tracking feature where two or more authors can serially edit a document, but there are more spectacularly collaborative tools as well, such as Google Documents and Spreadsheets, which allows multiple authors to edit the same line of text at the same time; research at USQ has shown that it is actually difficult to create conflicts using the service, but on the other hand there are no specific tools for academic authoring, such as reference management, which is supported in word processors, and thus in ICE (Dekeyser & Watson 2006).

In an eResearch context, the DART project explored the use of wikis and collaboration tools and produced some work packages, but as with ICE, the the non-technical or under-resourced eResearcher would be struggling to install, learn and use the tools. Help is needed for our research communities. The biggest issue here is that while authors might collaborate on text in an online application, the result then needs to be integrated back into Word, references added and so on; there is a clear need to build more tools to close the gap between collaborative authoring and the word processor that is used to submit to a publisher.

ICE issues

ICE is currently being made a core IT system at the University of Southern Queensland (USQ), meaning that it is considered essential to the functioning of the university. This is for courseware, though. Not for eResearch. A major barrier to ICE's adoption outside of USQ is that it is too hard to install it requires a number of things to be installed on a client machine, and server setup is non trivial. This means that institutional-level support is required.

Several other problems remain:

  1. Even though there are no standards for styles, ICE has an idiosyncratic set with no overlap with any style that might come with a standard word processor. This is by design, but presents a potential barrier to adoption where authors may want to use 'Standard' styles, and adapting ICE to use with a particular journal format is currently a manual process.

  2. Currently there are no downloadable journal templates for ICE, and a total lack of journals taking content in ICE.

  3. If the goal is to produce datuments, then only a tiny amount of progress has been made in setting up services and defining microformats that researchers can use to embed data and visualizations.

  4. It is too difficult to pull together globally distributed ad-hoc work groups, from within institutional networks; even with the Australian Access Federation not all collaborators are going to have a federation login. Supporting OpenID (Recordon & Reed 2006) as well may prove more flexible, as an eResearcher should be able to manage their own list of trusted OpenIDs (and OpenID providers) regardless of the institutional affiliation of their colleagues.

The following diagram shows the ICE system in context, mapped onto two axes. The vertical axis is the degree of collaborativeness. What is needed, for the eResearch community is to fill in the places in this diagram where there are dotted arrows that is to be able to take a wiki document and turn it into a high-quality word processing document, with data integration and citations preserved. ICE has a service-oriented architecture which exposes its conversion services to other applications, so it could act as a content exchange hub in an eResearch architecture, for example providing a text to speech service or a word to wiki service for other applications.


Illustration 4: The Integrated Content Environment (ICE) as a content-hub

Conclusion: A call to action

In summary, if you are an 'ordinary' Microsoft Word wielding researcher with some or all of the nine aspirations listed above, then the situation is grim.

Experiments with ICE show that is is possible to use a word processor to produce documents that begin to meet the definition of a datument; but it is not easy.

The largest issue with uptake is not that most researchers do not have the goals outlined for this paper. It is that even if they did, then there is nowhere for them to put their datuments and hence no reason to have the goals. Getting large numbers of researchers from where they are now to where they could be, is a gargantuan task which involves priming an enormous engine. It involves not only creating new services and software and documentation and training packages that do not yet exist, it means changing the behaviour of eResearchers and repository managers. The community itself can change its own practices, by creating data-integrated documents and repositories; journal publishers may take some time to respond.

Action is needed on a number of fronts. Specific projects towards the broad aims:

  1. Establish an standards group in the style of Metadata Advisory Committee for Australian Repositories (MACAR) to ensure that Australian data repositories, working document and institutional repositories can interoperate with semantic data; so that authors can create datument-style works and have them ingested in appropriate repositories along with their attendant data, while still being able to supply ordinary documents to journals and conference sites.

  2. Survey researchers and the journals they are targeting to find out what researchers use to create documents, what the publishers and conferences they target their work at expect from them, starting from existing studies of research practice such the work at Rochester (Foster & Gibbons 2005).

  3. Run a project to trial journal submission using an ICE-like template, initially aiming for just HTML and PDF from the same source, but working towards full support for datument-style integration of data-semantics and documents as well as supporting collaborative authoring.

  4. Run supported trials of an ICE-like process for writing theses in a small number of disciplines with the same approach to datument support as above, with a view to running a program like the Australasian Digital Thesis program. A long term goal would be to replace the current mishmash of advice offered at an institutional and departmental level with a national standard for structuring a thesis using a word processor in such a way that it is automatically a datument and can be ingested into a repository not just as a monolithic blob of PDF, but also as an HTML document.

  5. Establish a central resource where authoring resources for eResearchers and technical staff can be posted and/or referenced.

    1. Links to software such as ICE and the software outcomes of the ARCHER project.

    2. Documents along the lines of How do I write my thesis using a word processor? (A start has been made on these under the ICE project).

    3. Answers to the question how do I embed X in my document (datument) where X might be intralinear text, or Chemical Markup Language, or geographical co-ordinates or rainfall data or any number of commonly used data types and formats that many, many researchers would need to add in to their documents should they aspire to create datuments.

    4. Downloadable templates for various journals that can be used to create datuments, not just documents, using a similar model to the Zotero system for adding citation styles.


Bow, C., Hughes, B., & Bird, S., 2003. Towards a General Model of Interlinear Text. In Proceedings of the E-Meld Workshop on digitizing and Annotating Texts and Field Recordings. . Michigan.

Bray, T. et al., 2006. Extensible Markup Language (XML) 1.0 (Fourth Edition), World Wide Web Consortium. Available at: [Accessed April 23, 2007].

DART project, DART Work Packages and Outcomes. Available at: [Accessed May 20, 2008].

Dekeyser, S. & Watson, R., 2006. Extending Google Docs to Collaborate on Research Papers (departmental presentation). Available at:

Ferraiolo, J., 2001. Scalable Vector Graphics (SVG) 1.0 Specification, World Wide Web Consortium. Available at: .

Foster, N.F. & Gibbons, S., 2005. Understanding faculty to improve content recruitment for institutional repositories. D-Lib Magazine, 11(1), p.1082-9873.

IMS, 2005. IMS Content Packaging Overview Version 1.2 Public Draft. Available at: .

Khare, R. & Çelik, T., 2006. Microformats: a pragmatic path to the semantic web. In Proceedings of the 15th international conference on World Wide Web. Edinburgh, Scotland: ACM, p. 865-866. Available at: [Accessed February 25, 2008].

Monus, L. et al., 2008. Zero Click Ingest. Available at: [Accessed May 20, 2008].

Murray-Rust, P. & Rzepa, H.S., 2004. The Next Big Thing: From Hypermedia to Datuments. Journal of Digital Information, 5(1), p.248. Available at:

Murray-Rust, P., 2008. Unilever Centre for Molecular Informatics, Cambridge - petermrs blog » Blog Archive » Open Data: Datument submitted to Elseviers Serials Review. petermr's blog: A scientist and the Web. Available at: [Accessed February 21, 2008].

OASIS, 2005. OpenDocument v1.0 specification. Available at: .

Pearce, J. et al., 2008. The Australian METS ProfileA Journey about Metadata. D-Lib Magazine, 14(3/4), p.1082-9873.

Recordon, D. & Reed, D., 2006. OpenID 2.0: a platform for user-centric identity management. Proceedings of the second ACM workshop on Digital identity management, p.11-16.

Sefton, P., 2006a. The integrated content environment. In Noosa. Available at:

Sefton, P., 2006b. The Integrated Content Environment for Research and Scholarship. Available at: [Accessed April 30, 2007].

Sefton, P., 2007a. An integrated approach to preparing, publishing,
presenting and preserving theses. Available at: [Accessed July 2, 2007].

Sefton, P., 2007b. Hooking up authoring processes and tools to institutional repositories. PT's blog. Available at: [Accessed February 21, 2008].

Treloar, A. & Groenewegen, D., 2007. ARROW, DART and ARCHER: A Quiver Full of Research Repository and Related Projects. Ariadne, (51). Available at: [Accessed May 16, 2007].