With the mass adoption of data analysis in several scientific fields such as climatology, medicine, astronomy and astrophysics, the availability of an appropriate analytics infrastructure has become a necessity increasingly recognized by the scientific community. However, appropriate tools and applications are required to process the large volume of data collected and generated by researchers. One of the biggest challenges lies in the fact that these tools need to be gathered to be applied in specific domains. The area of bioclimatic data is a scientific field that still has much to improve in this matter. It is a field of study that lacks great efforts in the direction to provide methodologies and tools to facilitate the understanding of the complex phenomena involved in the influence that environmental variables have on biodiversity on the planet. Thus, the purpose of this work is to propose a big data analytics architecture that presents an ecosystem that systematizes and facilitates the task of the scientists to deal with the complexity in the bioclimatic data analysis, providing tools for storage, management, analysis using machine learning algorithms and data mining, and visualization tools. The methodological approach of this work was to make a thorough bibliographical study to verify the most used tools and the suitability of each one to the purpose of the work. In addition, the literature provided indications of software ecosystem implementations methodologies that served as a guide in the architecture design. Within the architecture, we attempted to gather a set of bioclimatic data based on a subset of data obtained from the Atmospheric Radiation Measurement (ARM) data repository for climatic data, and the Brazilian Biodiversity Portal for biodiversity data. As a result, we were able to gather a series of tools to access data such as Cassandra, distribution of processing such as Spark, programming interface represented by Jupyter Notebook, system modules for data format conversion, machine learning algorithms libraries and software for data visualization. This research discuss the importance of a domain purpose design of a data analysis architecture for bioclimatic data. We concluded that this type of ecosystem is imperative to facilitate the research process and increase the quality of the results.

Mike Frame

and 2 more

In 2013, the Office of Science and Technology Policy (OSTP) issued a memorandum directing Federal agencies with over $100 million in annual research and development expenditures to develop a plan to support increased access to federally funded research results. In response, the US Geological Survey developed a Public Access Plan and published four new data management policies. The policies focus on review and approval of scientific data supporting scholarly conclusions, requirements for metadata, preservation, and data management planning. The new policies, in conjunction with the Public Access Plan, represent a shift in culture in how the USGS manages and provides access to its science data. The USGS recognizes that successful implementation of these new policies requires multiple pillars of support, from USGS leadership and staff buy-in, to effective tools. Active community engagement in the Bureau is stimulated through the Community for Data Integration (CDI), an open forum for community discussion and engagement, and an important component creating buy-in and contributing to the success of the new policies. Also critical are a suite of tools available to scientists to ensure their ability to implement the policies. Finally, support from leadership that manifests in the Fundamental Science Practices Advisory Council (FSPAC), a committee of representatives from across the Bureau who preside over policies and guidance is a critical component. While far from complete, the USGS has shifted its approach to science data management by engaging the community, offering tools to support policy, and providing leadership support for the quality and scientific integrity of USGS science data.

Mike Frame

and 2 more

In 2013, the Office of Science and Technology Policy (OSTP) issued a memorandum directing Federal agencies with over $100 million in annual research and development expenditures to develop a plan to support increased access to federally funded research results. In response, the US Geological Survey developed a Public Access Plan and published four new data management policies. The policies focus on review and approval of scientific data supporting scholarly conclusions, requirements for metadata, preservation, and data management planning. The new policies, in conjunction with the Public Access Plan, represent a shift in culture in how the USGS manages and provides access to its science data. The USGS recognizes that successful implementation of these new policies requires multiple pillars of support, from USGS leadership and staff buy-in, to effective tools. Active community engagement in the Bureau is stimulated through the Community for Data Integration (CDI), an open forum for community discussion and engagement, and an important component creating buy-in and contributing to the success of the new policies. Also critical are a suite of tools available to scientists to ensure their ability to implement the policies. Finally, support from leadership that manifests in the Fundamental Science Practices Advisory Council (FSPAC), a committee of representatives from across the Bureau who preside over policies and guidance is a critical component. While far from complete, the USGS has shifted its approach to science data management by engaging the community, offering tools to support policy, and providing leadership support for the quality and scientific integrity of USGS science data.