Application
As alluded to in the above snowpack example, the state of environmental data is moving from one dominated by a collection of field stations maintained over long periods of time to near-real-time global monitoring from satellite instruments. A tremendous amount of data on the state of the planet is produced each day, thus providing a rich timeseries with which to understand global change. In order to turn this data into meaningful quantitative information, however, the field of earth science must move into the so-called "fourth paradigm" of science, or that of large-scale data analysis. In other words, we must employ algorithms capable of operating on planetary-scale datasets. These algorithms, given the quantity of information to be processed, are necessarily parallel.
One ongoing example of an attempt to provide access to the resources necessary to analyze this information is Google Earth Engine. This application likely differs somewhat from much of what we'll study in this course due to its goals. First, it is not a simulation engine but rather a combined data storage and analysis tool, and second, its architecture is designed to be accessible to earth scientists without computer science degrees. The latter is accomplished through abstracting much of the parallelization and optimization away from direct user control. This gives non-CS inclined users the power to perform analyses far more expensive than they would otherwise, yet it also limits the customizability.
The platform consists of three units - co-located (a) data storage, in which Petabytes of earth imagery and model results reside, and (b) distributed computation resources, as well as (c) a Javascript-based query language used to access the data and engine. This query-based structure is such that requests for data and operations are submitted from the client side to the Google servers, where the code is compiled just-in-time with a "lazy computation" mechanism in which image tiles are only requested and operations are only performed when needed by a subsequent request. The optimized, compiled code is then executed on the server-side cluster of 66,000 CPUs, and the results are fed back to the client. Unfortunately, aside from the number of CPUs (which could be an outdated number) I don't have the specs of that supercomputer on-hand and am not sure if they are publicly available.
From personal experience, the platform (which is continuously being developed) is extremely powerful yet the abstraction of the parallelization can limit its performance in certain use cases and makes debugging quite difficult. Because the platform has free access for all research-focused users, Google imposes strict time limits on individual steps within algorithms. When these limits are hit, it can be difficult to figure out what steps are causing the timeout and how this issue might be solved. This has happened to me with a particular use case and in this class, I'd like to learn how I might use a different supercomputer (where I have control over the parallelization strategies) to address these use cases. Nevertheless, the data catalog co-located with an extremely powerful and easily accessed computation engine has enabled analyses that I would have never been able to perform otherwise. Overall, I would argue that this tool gives tremendous power to the earth science field by providing easier access to both large-scale environmental datasets and the parallel computing resources needed to analyze them.