Development

Data

The data set used for this research was provided by Cuneyt Gurcan Akcora, Yulia Gel and Murat Kantarcioglu and is currently available at the UCI Machine Learning Repository \cite{sets}, hosted by the University of California Irvine \cite{re3dataorg}. This data set contains a total of 2916697 rows with 10 columns each, collected from January 2009 to December 2018. 
Each row represents a transaction from the Bitcoin Blockchain and contains the following set of attributes: addressyeardaylengthweightcountloopedneighborsincome and label. A full description of each of these features can be found in \cite{set}. For the purpose of this research only the meaning of address and label are of importance. 
The label attribute is used to determine the known family of ransomware each transaction relates to, if any. There are 28 possible families of ransomware within the dataset. The entries that are not related to any of those families have been labeled as white by Akcora et al. It is imperative to state that this does not necessarily mean such transactions are known to be completely unrelated to any sort of ransomware, as there might exist families unbeknownst to the analysts that put the data set together.
The address feature represents the Bitcoin address and is not used for the purposed of this research. The reasoning behind this decision revolves around two issues. The first issue is this attribute being a string value, therefore not quantifiable. The second, although still related to the previous, pertains how such values are generated and what kind of meaning they would bring to a clustering process. The K-Means algorithm is based on the positioning of each sample in an N-dimension Euclidean Space, where closer samples tend to have a higher similarity. That is, the position in each axis has a meaning. Bitcoin addresses do not hold any sort of sequential meaning since they are randomly generated hashes.

Tools

I have developed a script using the Python language, and SciKi-Learn and Pandas packages. Python is a general purpose programming language \cite{pub:5007} \cite{programming} that has gained notoriety in the data science field \cite{kdnuggets} \cite{results}; SciKit-Learn is tool set for machine learning \cite{Geron2017Handson}; and Pandas is a library for data analysis \cite{articleb}. For this particular research, I used SciKit-Learn for generating the K-Means model and feeding it with the data set again to retrieve the relationship between each transaction and the expected cluster. Then, Pandas is used to manage the data set, cross-reference the known labels and the predicted clusters, and evaluating the level of similarity. The source code has been published under MIT License and can be found in my repository on GitHub \cite{thiagorcdlbitcoin_heist}: https://github.com/thiagorcdl/bitcoin_heist.
The script starts by preparing the data set: loads the CSV file containing the data, converts the type from strings to floating point representation when appropriate and the address column, as previously explained. Then a copy of the data is created without the label to avoid taking it into consideration, and the K-Means model for k clusters is generated using the remaining 8 attributes: yeardaylengthweightcountloopedneighbors, and income. The result of this model is a set of k centroids, which represent the center of each cluster in an 8-dimension Euclidean Space, and are written to a CSV file. Then, the model is used to predict the cluster for each of the entries in the data set and the result is then used, along with the previously known labels, to generate a Confusion Matrix, which is written to a separate CSV file.

Methodology

The effectiveness of the classification using K-Means is calculated by retrieving the Rand Index, which measures the similarity between how the known labels and how the predicted clusters are plotted \cite{Rand_1971} in the 8-dimension space. Finally the Adjusted Rand Index is also calculated, taking into consideration the similarity to random model clustering \cite{articlec}.
Considering the purpose of this research is to find out whether it is possible to segment the transactions in such way to allow the easy categorization of whether they represent a ransomware-related transaction or not, I decided to feed the script with two different values for k, that is to create two different sets of clusters. The first one is to create 29 cluster, ideally one for each family of ransomware plus the white ones. Since clustering is an unsupervised classification algorithm, the k clusters do not have a pre-established meaning. In other words, unless there was a perfect segregation of the 29 possible labels, it is not necessarily possibly to know which cluster relates to which label. Thus, the second attempt was to simply create two cluster, which a perfect fit could separate the transactions that are known to be malicious from those that have no evidence.