Heilmeier criteria

What are you trying to do? Articulate your objectives using absolutely no jargon. What is the problem? Why is it hard?

Develop a new way of delivering the full cycle of data analytics to natural scientists for diverse uses in scientific discovery and development. We want to do this in a manner that allows a much more rapid development and deployment, inclusion of the state of the art, and crucially the ability to cross organisational and technological boundaries.
In addition to building a platform that provides this service to other CSIRO business units, at the same time (not as a separate exercise) we propose to capitalise on the network based technology stack we build. We will architect this for IP segmentation with a view to be able to create spin-out companies and licensing opportunities with external parties.
The problem is at present, most of the work is bespoke, much of it is built on an architectural model that presumes the installation of software on a users computer, and as a consequence often entails much reimplementation. Furthermore, many critical problems are simply not solved: the complex particularities of the science need are not adequately addressed by simple importation of tools developed for other processes; it is very difficult to build end-to-end pipelines across different walled-gardens (isolated technology components that do not talk to each other); and some aspects (such as proper uncertainty management and provenance management and dealing with confidential data) are simply not dealt with at all.

It is hard because data sources can be distributed, they are highly heterogeneous, there are very many different tasks that scientists need to be able to do, and it is necessary to be able to propagate rich and complex information between tools (for example uncertainty and provenance). Furthermore, there is not even an adequate language to describe and catalog the end-user problems that makes it easy to link with the extant catalogues of techniques.

How is it done today, and what are the limits of current practice?

More as a craft than an industrial science. Typically scientists have to install various packages, that do not necessarily work well with each other. They are hard to maintain. And multiple packages are typically needed to do all the tasks needed. They do not handle provenance and uncertainty in a systematic way. And the monolithic approach is intrinsically not very agile, making development cycles long. The current approaches of dealing with confidential information do not scale across boundaries well.

What's new in your approach and why do you think it will be successful?

We start at the outset with the view that the solution needs to be network centric, pluralistic, composable and agile. There will be a strong end-user problem focus, rather than the traditional technique driven methodology.
We have done prior work illustrating some key elements of the approach (national map, and a proof of concept of machine learning as a service). Furthermore, the approach is much closer to the large scale and very effective deployments in internet commerce companies and we will thus be able to build on the deep technology stacks they have developed.

Who cares?

The other CSIRO business units with whom Data61 is partnering, care because they need the diverse analytics capability to conduct their core business.
Data61 cares because we want to push forward the science and technology of networked full-stack data analytics.
Other scientific research organisations will care because we will be able to share much of what we do with them.
CSIRO as a whole will care because we simultaneously propose to capitalise on the technology developed in a manner that leads to economic surpluses.
The business world will care because many of the problems we will develop solutions for have their analog in business. By solving them in the domain of science first, we will be able to transfer those solutions with the consequent opportunity to create new business and realise substantial economic gains.

If you're successful, what difference will it make? What impact will success have? How will it be measured?

We will speed up and make easier the diverse scientific challenges within CSIRO. We will put Australia on the map as the leading place to do this complex scientific data analytics. Success will be measured by uptake across CSIRO, other research organisations, and ultimately economic performance measures on the developed technology stack.

What are the risks and the payoffs?

A key risk is that the project is too complex, and gets bogged down in planning rather than executing. Another risk is that the technology will run off in a direction orthogonal to the needs of end users. The adoption of the agile methodology and the presence of the steering committee will strongly mitigate this.

Another risk is that there will be alternatives developed elsewhere in the world. Our view is that this risk is modest given the ambition of other players in this space. The payoff could be an enormous increase in the efficiency of the scientific discovery and development within CSIRO, global positioning of Australia in
this space, and the opportunity for high value economic outcomes from capitalising the technology.
A final risk is that there will be competing demands on Data61’s time. The way the project is architected (to maximise re-use and multiple use, and the adoption of a stack of networkable technological components, rather than massive monolithic platforms) will mitigate this.

How much will it cost?

The initial investment is $15M per year rising to$20M per year by year 5. This level of investment is comparable to IBM Big Bet investments of \$100M used to develop new capabilities.

How long will it take?

The whole vision will take at least 5 years, but the very point of the agile methodology is that we will be delivering usable and valuable technology early on, and continually refining it through continuous integration.

What are the midterm and final "exams" to check for success? How will progress be measured?

There will be many small exams through the agile process. But the key dimensions are 1) are the sets of tools useful for what scientists need to do; 2) does the work push the state of the art forward as measured by impact in the data analytics community; and 3) is the technology developed sufficiently attractive to be able to economically capitalise on it through licensing or business creation.

Quickstart instructions for Markdown

Hey, welcome. Double click anywhere on the text to start writing. In addition to simple text you can also add text formatted in boldface, italic, and yes, math too: $$E = mc^{2}$$!

Add images by drag'n'drop or click on the "Insert Figure" button.

Citing other papers is easy. Voilà: (CMS/CERN 2012) or (Holstein 2009). Click on the cite button in the toolbar to search articles and cite them. Authorea also comes with a powerful commenting system. Don't agree that $$E = mc^{3}$$?!? Highlight the text you want to discuss or click the comment button. Find out more about using Authorea on our help page.