Slide 1: Publish for Pleasure This talk is about how I am currently creating research publications and why I think the process and tools that I use are absolutely genius. Slide 2: Disclaimer I acknowledge that not everyone will have the technical background or interest or freedom to indulge in these technologies. Slide 3: Notation I will use the word "document" to describe the piece of work that I create to document my research. This is instead of more loaded words like report (which implies something less formal or worthy) or article (which implies something that must be published in a journal). Slide 4: Three Things I Care About I care about not wasting my time and effort. I care about producing something good. I care about everyone being able to access my work and pass it on to others. I care about other people being able to repeat what I did and build on what I did. SIX things I care about! Slide 5: Part I This section covers tools that I use to *create* a document. Slide 6: Text Files This is an example of the sort of text file that I write to create a document. EVERYTHING within the file is text. Images are references to external files. The file consists of content plus a description of the structure of the content (e.g., what is a heading, what is normal text and what is code, ...). This is the same idea as a LaTeX file or a Markdown file, but NOT the same as a Word document. Slide 7: Text Files I do not write the final document; I write a text description of the final document, which is processed by various software tools to generate the final document. MANY programs can create and modify text files, so you are not tied to a particular piece of sofware to write your document AND you are not placing any burden on anyone else who wants to view or use your document. It is easy to write code to generate text, which means we can automate the generation of parts of a document. This is the basis for tools like 'Sweave' and 'knitr' in R. Writing text also means that I can write exactly what I mean and I can write complicated things - this becomes more important when I layer XML structure on top ... Slide 8: XML This is an example of the sort of XML that I write to create a document. It is basically HTML, plus some tags that I made up for my own convenience. It is VERY easy to transform this to pure HTML. It is also very easy to transform it to other things (e.g., .Rhtml). Slide 9: XML One weakness of text files is that the content can be unstructured, which makes it hard to process. XML solves this problem by using tags to label the structure of the document. Tools like XPath and XLST take advantage of that structure to provide excellent processing tools. This means that we can do more than just process from XML to a publication format (e.g., HTML). We can also process to other formats, e.g., these slides come from an XML file that has been transformed to HTML, but the same XML file also gets transformed to produce speaker notes (and a version for printing as handouts). The fact that XML is an Open Standard means that there are lots of editors and processing tools for working with XML (so a low burden for me and for anyone I want to share or collaborate with). In effect, I create a document by writing code and the computer creates the document from my code. BUT the code that I am writing is in a very simple language. I use XML rather than Markdown because the Markdown does not give me enough control over the final HTML format and because Markdown is limited to generating the final format, whereas XML allows for a much wider range of transformations. The source file for these slides is an XML document that is transformed to HTML for the slides and a .txt file for speaker notes (and a modified HTML document for printing handouts). However, Markdown also works if XML is too much, especially if you have the freedom to select Markdown as part of your workflow and you have the freedom to make use of the final format that Markdown produces for you. LaTeX just does not process as easily as XML and it only easily transforms to PDF. Slide 10: XPath and XSLT This is an example of the XPath and XSLT code used to transform an XML document (to an .Rhtml file). The XPath bit is '//rcode', which matches ANY rcode element ANYWHERE in the document. The XSLT bit is everything else, which says, IF you get a match to '//rcode', start an XML comment, followed by 'begin.rcode', followed by the content of the rcode element, followed by 'end.rcode', followed by the end of the XML comment. The 'xsltproc' program can be used to apply the XSLT code to the XML document. Slide 11: XPath and XSLT This is an example of the .Rhtml file that is produced by processing the XML document that I wrote using XPath and XSLT. Slide 12: Literate Documents A literate document allows code chunks to be included in the document so that, when the document is processed, the code can be run to generate some of the document content. Slide 13: Literate Documents The process is now: I write an XML document, then I transform it with XPath and XSLT to a .Rhtml document, then I transform it again with 'knitr' to generate an HTML document. Slide 14: HTML This is an example of the HTML that is produced by processing the .Rhtml file that was produced by processing the literate XML document that I wrote. Slide 15: HTML The final document that I produce is an HTML document. HTML does not do typesetting as well as LaTeX does, but you can still produce a nice-looking result (like these slides), usually by making use of someone else's efforts with CSS. HTML is a great publication format because it is an Open Standard, so there are lots of (free) viewers (including web browsers), so no burden is placed on your audience. Oh, and it's still text. Where HTML nails other options (like PDF) is that it is easy to produce dynamic and interactive effects in HTML. HTML is also text and (informal) XML, so inherits all of their nice editing and processing properties. For example, suppose that the tool I used to generate the final HTML document (from my XML document) does not produce EXACTLY what I want; with HTML, I can easily tweak the final result with further processing (much more easily than I could if I had generated a PDF document as my final document). The relevance of HTML as a publication format is demonstrated by the fact that traditional publishers now offer HTML versions of articles online. Some web browsers now have "native" support for viewing PDFs, but not as part of a web page. PDF is not a web format. Slide 16: HTML This is an example of an HTML report with an interactive feature: click on the plus to toggle visibility of the code chunks. Slide 17: SVG If the pubication format is HTML, then the best format for images is SVG. It is vector (rather than raster), so it looks good at any size, it is an Open Standard so there is lots of software support (all browsers now have native support), PLUS it is XML with all the benefits previously mentioned about structured text. It is also easy to add dynamic and interactive features to an SVG image, which is handy if you are writing a document that describes the creation of dynamic and interactive statistical graphics. Slide 18: SVG This is an example of an SVG image (within an HTML document) with an interactive feature: drag the blue rectangle to scroll the window shown in the large plot. Slide 19: Part I Summary This process is similar to the R Markdown workflow that we have our 20x students use (right?), EXCEPT that I am working at a lower level, so it is more flexible, and I understand more about what is happening, and I have more of a focus on HTML as the final document format and SVG as the graphics format. Slide 20: Part II Having created a document using efficient and effective tools, we now turn to the issue of disseminating the document. The tools in this section focus on making it as efficient as possible for others to access and make use of my work. Slide 21: Electronic distribution Electronic distribution implies BOTH an electronic format (rather than hard copy) AND availability on the world wide web. The best way (the only sensible way) to publish a document is in electronic format (rather than print). Copies are (virtually) free, copying is fast and *exact*, and we gain features like colour and interactivity. It is not controversial to claim that an electronic format is good, but how many of you are still preparing documents for print (e.g., PDFs in A4 page format) ? I enjoy writing a document with the screen as my main format (e.g., freely using colour and interactive content). It is possible to format HTML nicely for print, but the screen format is now the primary concern. JSS was originally aiming for HTML, but backslid to LaTeX/PDF. Slide 22: Electronic distribution Distribution of an electronic document via the web means that the publication can be accessed from virtually anywhere virtually instantly. Distribution on the web can be just a matter of placing material on a public web server, but existing search engines and social media can further increase the visibility and discoverability of the material. People care about these things - the web gives them to us for free. Again, it is not crazy to suggest that putting a document on the web is good, but how many of you are writing for the web (e.g., producing HTML documents or interactive SVGs) ? I enjoy writing documents for the web. Slide 23: Creative Commons If you are conducting publicy funded research, the results have already been paid for. It makes sense to provide the results as openly as possible. The CC BY licence fits this situation perfectly. Slide 24: Traditional Transfer of Copyright In the traditional publishing model the copyright is signed over to the publisher and they limit access in order to charge for access (business model). Note that the transfer of copyright is quite substantial and quite persistent. My children are unlikely to live long enough to see my article enter the public domain. This is not good. I enjoy publishing documents with a CC BY licence. Slide 25: Creative Commons Contrast the expressions used here with those used in the traditional publisher contract. These words are compatible with sharing and unrestricted reuse. This is good. Slide 26: DIY Publishing Publishing outside of a traditional journal makes it possible to really take advantage of the available tools. We are not restricted by journal format rules, we are not hindered by the slow peer review process, we are not restricted by journal copyright assertions. If we no longer need to publish through a journal, we can think about escaping other artificial constraints like bundling articles into volumes or issues. A publication can be published on its own. For example, the Journal of Statistical Software ONLY publishes in electronic format and it publishes individual articles. JSS and R Journal still dictate the format (PDF) and still have a slow review process I withdrew an article from JSS after 2 years in the review process because the software had changed so much that the article had essentially become a bunch of lies! Slide 27: DIY Publishing If you don't want to run your own web server, the department has a Technical Blog where you can easily publish a document. My latest publication is a technical report published on the department's technical blog. I now have authored (or co-authored) 29 publications on the technical blog. DIY publishing allows for a greater variety of publishing models - rather than a one-size-fits-all journal article, we can have shorter or longer pieces of work. Smaller publications allows for documenting smaller pieces of work, such as student projects. More than 10 of my publications on the technical blog are based on student projects (BScHons, Masters, or Summer Scholarship). There are several more that are single-authored student publications (PhD or research assistant). I could also have used something like ArXiv, though that is still focused on preprints of print articles. Slide 28: Research Repositories In addition to making a document available *now*, we should be concerned with the document *remaining* available. There are independent services like figshare. These promise to provide persistent storage and increase visibility for works. UoA has its own figshare portal. UoA library has ResearchSpace. This has the advantage of having a reasonable chance of existing for as long as UoA exists. Slide 29: figshare Slide 30: figshare Slide 31: Research Space Slide 32: Research Space Slide 33: Part II Summary Slide 34: Part III We have already talked about literate documents and tools like 'knitr', which make it easy to create a document that can be reproduced by someone other than the original author. This part of the talk describes some other important pieces that we can share. The focus is still on others having access to my work, but we have moved on to allowing others to do more than just *view* the work. Slide 35: Supplementary Material EVERYTHING needed to reproduce the report is available online under a permissive licence. The open standard text formats and permissive licences also make it easy to create new work based on these materials. Slide 36: Supplementary Material This shows the list of materials provided for one of my recent publications. A lot of these materials have nice features like the fact that they are all text files. However, even with all of these resources, there is no guarantee that someone else has all of the tools available to work with them (I work on Linux). This is where Docker comes in ... Slide 37: Docker A Dockerfile is a text description of a computer (operating system and installed programs); a sort of virtual machine. Slide 38: Docker We can build a Docker image from a Dockerfile, then we can run that image (virtual machine) and run a command within the virtual machine. We can also publish the Docker image on the internet (DockerHub) so that others can easily access and reuse the virtual machine (e.g., to reproduce the research document). Slide 39: Docker Hub Slide 40: Part III Summary Slide 41: Part IV This section is about "recognition" in the sense of avoiding confusion and ambiguity with regards to who wrote a document. There are a couple of slides at the end that briefly address "recognition" in the sense of being rewarded for work (with fame and professional advancementXS). Slide 42: ORCID ORCiD provides a unique identifier for every researcher, so that I cannot be confused with someone else, just in case there is another Paul Murrell in the world who happens to specialise in producing slow statistical graphics software. Slide 43: Summary Technology is allowing us to do more and more of the publishing process ourselves. DIY publishing means greater access to and freedom to explore these technologies. Another way to look at it is that these technologies allow you to work (on your publication) more like a programmer - yet another reason for computing to have a greater presence in our curriculum! Slide 44: Summary #2 My *primary* concern is publishing and sharing my work. The problem of *measuring* my worth is *secondary*. I believe that is the correct order. Recently, I have been conducting small pieces of work, consisting of adding new features to an R package. Within a very short time, I can develop the R package, make it available, write a document describing the changes and demonstrating their use (if only as a reminder to myself of what I have done!), and make that document available. The entire research cycle can occur within the space of a couple of weeks. Slide 45: Acknowledgements Slide 46: Technology Roll of Honour Slide 47: Finis Slide 48: Peer review If you want independent peer review, there are already services offering that (for a price), e.g., Rubriq. Slide 49: Metrics Once it becomes easy to publish, number of publications is less meaningful. The emphasis is more on quality, as measured by things like citations, or other measures of how much your work is used and valued by others. Slide 50: