# Final report: Multivariate statistics for geothermal system prediction from some areas in Indonesia

Abstract

This document describes our progress. This research was funded by Institut Teknologi Bandung Research Grant 2016. We try to apply some multivariate statistical approach to build a clustering model of geothermal hydrochemistry dataset. Our progress is 100%: 416 dataset compiled from various sources. The objectives is to try out a machine learning method to learn the geothermal system, volcanic or non-volcanic system, based on geochemical composition of hot water samples as trained dataset using open source application. If we could come up with a certain model, then for the next step, we could predict the geothermal system of new samples.

We used R programming (and RStudio IDE) and multivariate analysis packages to try to extract the somewhat "hidden" pattern in the data set. We used principal component analysis, cluster analysis, and the multiple regression model. The codes was developed based on the free tutorials available. We provide the codes and data set available to be freely downloaded using Open Science Framework server (we put CC-BY license) in order to invite more participation from public to improve this work.

Based on our results, we could see the separation of water samples into two geothermal systems, volcanic and non-volcanic based systems. However we could also find some samples fall in the middle of both systems. The data shows that although the geology has major control to the system, but the chemical stability could show a hybrid characteristics.

We have produced some output in a sense of blogs, slide decks presented in front of the Bappeda West Java, two proceeding papers (one was for the IIGW 2016 and one is sent as abstract to the IIGW 2017), a draft paper will be submitted to ScienceOpen Research Journal. We also provide the full report available on Authorea.

Keywords: multivariate statistics, geothermal, hydrochemistry

Authors:

• Dasapta Erwin Irawan,
• N. Rina Herdianita,
• Yuano Rezky,
• Anggita Agustin, and
• Ali Lukman

# Abstract

This document describes our progress. This research was funded by Institut Teknologi Bandung Research Grant 2016. We try to apply some multivariate statistical approach to build a clustering model of geothermal hydrochemistry dataset. Our progress is 100%: 416 dataset compiled from various sources. The objectives is to try out a machine learning method to learn the geothermal system, volcanic or non-volcanic system, based on geochemical composition of hot water samples as trained dataset using open source application. If we could come up with a certain model, then for the next step, we could predict the geothermal system of new samples.

We used R programming (and RStudio IDE) and multivariate analysis packages to try to extract the somewhat "hidden" pattern in the data set. We used principal component analysis, cluster analysis, and the multiple regression model. The codes was developed based on the free tutorials available. We provide the codes and data set available to be freely downloaded using Open Science Framework server (we put CC-BY license) in order to invite more participation from public to improve this work.

Based on our results, we could see the separation of water samples into two geothermal systems, volcanic and non-volcanic based systems. However we could also find some samples fall in the middle of both systems. The data shows that although the geology has major control to the system, but the chemical stability could show a hybrid characteristics.

We have produced some output in a sense of blogs, slide decks presented in front of the Bappeda West Java, two proceeding papers (one was for the IIGW 2016 and one is sent as abstract to the IIGW 2017), a draft paper will be submitted to ScienceOpen Research Journal. We also provide the full report available on Authorea.

Keywords: multivariate statistics, geothermal, hydrochemistry

# Introduction

The following project were set up to try out the application of R to help us classify hot water samples based on their hydrochemical properties. The samples used in this activity are from all over Indonesia. But we try to choose locations with distinct geological character. We need to find out if this open source statistical package can bring out the uniqueness and use it to classify the water samples.

We collect the samples from direct field collection and from other similar geothermal researches. We are very thankful to fellow researchers who have been allowing us to re-use their data in the scope of this research. We will commit to the citation agreements upon the dataset transfer. The output of this research will also be published as scientific papers on several media. The first paper have already published in the 2016 ITB Geothermal Workshop last month.

The core team of the research is:

• Dr. Rina Herdianita (Geology, ITB)
• Dr. Dasapta Erwin Irawan (Geology, ITB)
• Yuanno Rezky, ST., MT (EBTKE counterpart)
• Anggita Agustin (student, master, Groundwater eng ITB)
• Ali Lukman (student, master, Geology, ITB)

Contributors:

• Prana Ugi Gio (Universitas Sumatra Utara) for keen support on statistical theory
• Fatkhurrokhman (student, undergraduate, Geology, ITB)
• Fithriyani Fauzziah (student, master, Groundwater Engineering, ITB)

Additional contributors will be added as we try to apply open science principles in this research. All materials and resources are available on Open Science Framework server and Github.

# Objectives

The goals of this research are to reproduce deterministic observation and classification by field geologists using multivariate statistical approaches. Hopefully we can analyze new dataset using the following model based on training dataset.

# Value of the research

If we can reproduce the deterministic approach then we can propose a machine learning technique to classify hydrochemical samples of geothermal water samples. This approach hopefully will robustly classify large dataset faster.

# Dataset

## Value of dataset

The following is the value of the dataset from this research:

• It is a nation-wide overview of geothermal dataset. Being aggregated from multiple sources with various settings, this dataset is a model of aggregation process,
• It is a model of low budget research in terms of software costs, because it uses open source tools (software, file formats, and data repository infrastructure)
• It endorses open science concept with: easy access on all resources, transparency of the process, and rapid dissemination through open source repository and publication media.

## Data sources

We have gathered 416 dataset from various sources, collected from (see data table and data descriptor):

• Geology ITB
• Center of Geological Research and Development
• Other sources and own testing for verification (specifically for dataset from West Java Province)

The preliminary result is a dataset with 416 rows and 66 columns. We managed to collect hot water samples consists of:

• major elements
• trace elements
• various geothermometry calculations

We identified each report that we had in the library, as follows:

• Krahmer, M. S. (1995)
• Dirasutisna, S. (2004)
• Suparman, dkk. (1999)
• Dedi K (2016)
• Euis (2016)
• Fauziyyah et al. (2016)
• Haq et al. (2016)
• Prabowo et al. (2016)
• Alfiandy et al. (2016)
• Wahyudi et al. (2016)
• Phuong dkk, 2005)
• Emianto dan Aribowo (2011)
• Nukman (2009)
• Nurohman dan Aribowo (2012)
• Iqbal et al. (2016)

However, not all of the samples have all the measurements, therefore we curated the dataset very carefully. Many blank cells will be labeled with NA . The complete dataset is attached in the dataset folder . However we had several drawbacks in collecting the coordinates of the points.

## Data management

Dataset is the most important part in this research. It affects how we see the data and what it says about the controlling hydrogeological process. Therefore we also write a detail description on how we set up the dataset.

### Data consolidation and treatment

The final dataset was consolidated from various sources. Each source contained different kind of measurements. Thus we had a various size of data matrix. Therefore we appended the matrix size piece by piece to incorporate all data matrices. The final dataset and variable descriptor can be seen in the repository (please refer to data.csv and datadescriptor.csv ). This step is the hardest since we've got many data from various sources, mainly in spreadsheet and pdf formats, with also various typing format, eg: different decimal separator, etc. So we have to work on that problems before we could use it in our analysis.

### Data repository

Every data set is accessible from the OSF repository. Final link may be changed and doi will be added as soon as it is registered) and can be searched by title, topic and field location. Interactive map will be available soon using QGIS Cloud System, and also we have to deal with the coordinates problem. We tried to set the points based on the geographical indications written in the original reports.

### Data use/sharing policy

All resources are released to the public in CC-BY license. It may be freely copied, distributed, edited, remixed, and built upon under the condition that you give acknowledgment as described below:

• Give proper acknowledgement. Publications, models and data products that make use of these datasets must include proper acknowledgement, including citing datasets in a similar way to citing a journal article (i.e. author, title, year of publication, edition or version, and URL or DOI access information. (See Why cite data).
• Let us know how you will use the data. The dataset creators would appreciate hearing of any plans to use the dataset. Consider consultation or collaboration with dataset creators.

### Software

We used the following open source software for the analysis: