ABSTRACT

In this dissertation, we investigate the applications of data mining algorithms on online criminal information. Ever since the entry of the information era, the development of the world wide web makes the convenience of peoples’ lives to the next level. However, at the same time, the web is utilized by criminals for illegal activities like drug smuggling and online fraudulence. Cryptomarkets and instant message software are the most popular two online platforms for criminal activities. Here, we try to extract useful information from related open source intelligence in these two platforms with data mining algorithms.

Cryptomarkets (or darknet markets) are commercial hidden-service websites that operate on The Onion Router (Tor) anonymity network, which have grown rapidly in recent years. In this dissertation, we discover interesting characteristics of Bitcoin transaction patterns in cryptomarkets. We present a method to identify vendors’ Bitcoin addresses by matching vendors’ feedback reviews with Bitcoin transactions in the public ledger. We further propose a cost-effective algorithm to accelerate both steps effectively. Comprehensive experimental results have demonstrated the effectiveness and efficiency of the proposed method.

Instant message(IM) software is another base for these criminal activities. Users of IM applications can easily hide their identities while interacting with strangers online. In this dissertation, we propose an effective model to discover hidden networks of influence between members in a group chat. By transferring the whole chat history to sequential events, we can model message sequences to a multi-dimensional Hawkes process and learn the Granger Causality between different individuals. We learn the influence graph by applying an expectation–maximization(EM) algorithm on our text biased multi-dimensional Hawkes Process. Users in IM software normally maintain multiple accounts. We propose a model to cluster the accounts that belong to the same user.

CHAPTER ONE

INTRODUCTION

Illegal online sales have grown exponentially [1]. Vendors can sell illicit products through cryptomarkets or encrypted IM messages software like telegram easily. In darknet, the privacy of participants in illicit online transactions is protected through both Tor and cryptocurrency. The darknet utilizes The Onion Router (Tor) network to hide users’ IP addresses from the internet service provider. Darknet markets choose cryptocurrencies as payment currency mainly because they are anonymous. Unlike traditional currencies, cryptocurrencies like Bitcoin are decentralized[2]: there is no central authority responsible for the issuance of cryptocurrencies and there is no need to involve a trusted third-party like banks when making online transfers [3,4]. Both buyers and vendors can trade anonymously through cryptocurrencies [5]. For IM software like telegram, they provide encryption services for users. Users of IM applications can easily hide their identities while interacting with strangers online. The privacy provided by darknet and IM software make it hard for law enforcement to trace illicit business online. This fact inspires us to conduct the research to extract useful information from public data in these platforms through data mining algorithms to combat cyber crime.

We first investigate the characteristics of Bitcoin transactions behind cryptomarkets in part 2. Darknet utilized the decentralized cryptocurrencies as payment currency. We conduct transactions on different types of markets to discover the currency management of cryptomarkets. In part 3, we further propose a method to identify vendors’ Bitcoin addresses by matching vendors’ feedback reviews with Bitcoin transactions in the public ledger. Each feedback review is matched to a Bitcoin transaction based on timestamp and value transferred in this transaction. Therefore a Bitcoin address whose history transactions can match more reviews of a vendor have a higher possibility to belong to this vendor. In part 4, we propose a model to discover hidden influence networks between members in a group chat. We can model message sequences to a Temporal-Textual Multi-Dimensional Hawkes process and learn the Granger Causality between different individuals. In part 5, we propose a model to cluster the accounts that belong to the same user in IM software. We design a 24-7 CNN to learn the representations of timestamp lists. By leveraging the post and time pattern of accounts, we propose a method to learn the embedding of each account and train a binary classifier to identify accounts from the same user.

1.1 Background

In this chapter, we provide the background information of our topic, including Darknet,

Bitcoin, and Hawkes Process respectively.

1.1.1 Darknet

A darknet market is a commercial website on the dark web that operates via darknets such as Tor. They function primarily as black markets, selling or brokering transactions involving drugs, weapons, counterfeit currency, stolen credit card details, forged documents, unlicensed pharmaceuticals, steroids and other illicit goods as well as the sale of legal products. Tor is a network of virtual tunnels that allows you to improve your privacy and security on the Internet. Tor works by sending your traffic through three random relays in the Tor network. The last relay in the circuit (the “exit relay”) then sends the traffic out onto the public Internet. Tor provides hidden services (also known as onion services) for users to hide their locations and identities while offering web publishing services. Vendors and buyers can surf the darknet through Tor browser without leakage of their IP addresses to internet service providers.

1.1.2 Bitcoin

Bitcoin is the first decentralized cryptocurrency (also known as digital currency or electronic cash) that operates on the peer-to-peer network without the need for intermediaries and there are no central banks or administrators. Transactions are verified by network nodes via cryptography and recorded in a public distributed ledger called a blockchain. Users can transfer Bitcoin pseudonymously because funds are not tied to real-world entities but rather bitcoin addresses. Owners of bitcoin addresses are not explicitly identified. We focus on Bitcoin in this dissertation because Bitcoin is the most popular cryptocurrency which is accepted by all darknet markets [6]. Using blockchain and distributed ledger technology, the Bitcoin system promises great transparency and improved trust across transaction value chains [7,8]. Without a third-party to ensure a transaction, the Bitcoin system publishes all of its history transaction data. The Bitcoin ledger stores all transaction records in history which are public to any Bitcoin users. A user wallet can own multiple bitcoin addresses, which are the “pseudonymous identity” of this user in the public ledger.

1.1.3 Hawkes Process

One-dimensional Hawkes process is a type of temporal point process which can model the self-exciting event sequence. The intuition behind it is that previous events may trigger the occurrence of future events. A temporal point process can be represented as a counting process, N = {N(t)|t ∈ [0,T]}, where N(t) records the number of events before time t. Intensity function λ(t) = E[(dN(t)|H)]/dt represents the expected instantaneous happening rate of event given the event history H. Due to self-excitation, the intensity function of Hawkes process is conditionally based on history events. Given a sequence of n events on time T = {t₁,t₂,t₃…t_n}, officially the intensity function of event in Hawkes process is

λ(t) = µ + ^Xα · g(t − t_j) (1.1)

j:t_j<t

where µ is exogenous base intensity independent of history while the second part on the right side is impact from previous events. t_jis the occurrence time of a previous event. g(∆t) is the triggering kernel which decays with time difference. The earlier the previous event, the less impact it has on the current event. α is a coefficient measuring the amount of influence from previous events on the current event. Here we use an exponentially decaying function to capture the influence.

g(ti − tj) = βe−β(ti−tj) (1.2)

1.2 Characteristics of Bitcoin Transactions on Cryptomarkets

The darknet is a portion of the Internet that purposefully protects the identities and privacy of both web servers and clients. The Onion Router (Tor) is the most popular instance of a darknet and also the most popular anonymous network. A cryptomarket (or darknet market) is a commercial website operating on the darknet. Specifically, in Tor, a cryptomarket is a hidden service website with a “.onion” link address. Most products being sold in cryptomarkets are illicit. Some examples of popular products in cryptomarkets are drugs, malware, and stolen credit cards. After the demise of the first cryptomarket called Silk Road in 2013, new cryptomarkets have proliferated. Bitcoin is accepted in all cryptomarkets. As the first decentralized cryptocurrency, Bitcoin operates on the peer-to-peer network without the need for intermediaries and there are no central banks or administrators. In this section, we systematically study the vulnerabilities of Bitcoin privacy that exist in cryptomarkets. We identify and categorize patterns of Bitcoin transactions in cryptomarkets. The observations are then used for discussing the possibility of re-identifying Bitcoin addresses related to cryptomarkets. The conclusions obtained from this chapter can help design better Bitcoin payment systems and strengthen the privacy protection. On the other hand, the conclusions can also be used by law enforcement to understand the activities in cryptomarkets.

1.3 Identifying Darknet Vendor Wallets by Matching Feedback Reviews with

Bitcoin Transactions

In part 3, we aim at finding vendors’ Bitcoin addresses used in the darknet markets by matching feedback reviews with Bitcoin transactions. To narrow down the scope of the problem, we choose Bitcoin and Wall Street Market as a study example. Each feedback review is matched to a Bitcoin transaction based on timestamp and value transferred in this transaction. Specifically, we decompose our problem formulation into two sub-problems:

Bounding Box Matching Problem and Maximum Review Coverage Problem. In the Bounding

Box Matching Problem, we construct a bounding box for each review and find matched Bitcoin transactions. We build a K-D tree from massive Bitcoin transaction data to achieve quick range searching in a bounding box. In the Maximum Review Coverage Problem, we prove the NP-Hardness of the problem. We exploit the submodular property of the objective function and design a greedy algorithm with an approximation ratio of (1 − 1/e) to find a set of addresses that can cover near-optimal product reviews received by one vendor. Our method can discover the number of addresses used by one vendor, realizing one-to-many mapping. We further develop an algorithm that can effectively accelerate the matching and greedy algorithm.

Our contributions are as follows:

We propose the problem of identifying the vendors’ Bitcoin addresses by matching public Bitcoin transactions to vendor’s feedback reviews in darknet markets. This problem is important because of two potential applications. First, it helps law enforcement to trace illicit transactions. Second, it reveals a privacy concern of cryptocurrencies so helps better design new cryptocurrencies.
We decompose the complicated problem into two sub-problems and provide efficient computing algorithms for the sub-problems. We further propose a Cost-Effective Addresses Searching(CEAS) algorithm to accelerate the whole process, which can reduce about 60% matching calculations in experiments.
We extensively evaluate our methods in both real and synthetic data and demonstrate the effectiveness and accuracy of our method.

1.4 Learning Infectivity Graph in Chat Group via Temporal Textual Multi-

dimensional Hawkes Process

Instant message(IM) applications provide a convenient way for people to communicate and exchange confidential information. Users of IM applications can easily hide their identities while interacting with strangers online. To protect user’s privacy, some IM developers provide encryption services for their customers. However, these privacy-protecting and convenient software has been utilized by criminals for illegal activities like drug smuggling, online fraudulence or even anti-social activities[9].

In part 4, we propose a framework which extracts the weighted directed infectivity graph by applying data mining and natural language processing techniques on the chat log of a group. The chat history is a sequence of messages where each message contains information including time when the message is posted, members who post the message and text content. The timestamp of each message makes chat history a time series data, which can be viewed as event sequences containing multiple event types and modeled via multi-dimensional point processes. Each posted message can be viewed as an event with a timestamp and the person’s identity can represent the corresponding event type. To construct Granger Causality graph over event types(members in group), we model the data with a special class of point processes called Hawkes processes. Hawkes Process is a type of temporal point process which is widely used to model the self-exciting event sequences like earthquakes. When there are multiple event types, Hawkes Process is capable of describing mutually-triggering patterns among different event types. We relate influence between users to the possibility of replies among members. Natural language processing techniques are utilized to find dialogues in group chat logs. Impact functions of Hawkes process can capture the influence graph.

Our contributions are as follows:

We propose the problem of detecting the influence graph from group chat. It helps law enforcement to analyze the organizational structure and key person from criminal activities in group chat.
We present a modeling framework based on text biased Marked Multi-dimensional Hawkes Process. Hawkes Process can extract mutual-triggering patterns over individuals in a group. We further apply natural language processing techniques to identity conversations from the chat log and update impact functions of Hawkes process with the reply embedding. By applying an EM algorithm on the model, we are able to learn the influence graph over individuals from chat logs.

1.5 Clustering of Accounts in Online Messaging Software through Attributed Heterogeneous Information Networks

In this work, we propose a model to learn the representations of each account in group chat through attributed heterogeneous information networks.

The aim of this work is to cluster users based on time pattern and text of the post. The intuition behind our method is that if a vendor has several accounts, he or she will post the similar content with a similar time pattern by using these accounts.

Our contributions are as follows:

We present a model to learn the representations of time stamp series by training a CNN auto-encoder. The embedding we learned can be used to measure the similarity of two timestamp lists effectively.
We build an Attributed Heterogeneous Information Network. In the AHIN we built, it contains four types of nodes: User, account, post and product. We train a model to learn the embedding of each node. To effectively measure the relationship between nodes in constructed AHIN, we sample paths from AHIN through weighted random walk and propose a new network embedding model User2Vector to learn the hidden representations of each user. We further train a binary classifier to classify two user representations we learned by User2Vector.
We extensively evaluate our methods in both real and synthetic data and demonstrate the accuracy of our method.

Statement of the Problem

Cybercrime is on the rise, and traditional approaches to combatting it are proving insufficient. Criminals use sophisticated techniques to launch attacks, and law enforcement agencies often struggle to keep pace. OSINT provides a wealth of information that can help identify potential threats, but the volume and complexity of data make manual analysis ineffective. Data mining algorithms have the potential to automate the process of sifting through OSINT data, identifying patterns, and predicting cyber threats. However, there is a need to develop and implement effective data mining solutions tailored to the specific challenges of cybercrime prevention.

Objectives of the Study

The primary objectives of this research project are as follows:

To develop a comprehensive understanding of the current landscape of cybercrime and the role of OSINT in combating it.
To identify the specific challenges and limitations in using OSINT data for cybercrime prevention.
To explore and select appropriate data mining algorithms and techniques for analyzing OSINT data.
To design and implement a data mining system capable of processing and analyzing OSINT data to detect and predict cyber threats.
To evaluate the performance and effectiveness of the implemented data mining algorithms in identifying cyber threats and vulnerabilities.
To provide recommendations and guidelines for law enforcement agencies on the integration of data mining into their cybercrime prevention strategies.

Significance of the Study

This research project holds significant importance in the field of cybersecurity and law enforcement for several reasons:

Enhanced Cybercrime Prevention: The successful implementation of data mining algorithms on OSINT can provide law enforcement agencies with a powerful tool to proactively identify and mitigate cyber threats, thereby enhancing cybersecurity.

Resource Optimization: By automating the analysis of OSINT data, this research can contribute to resource optimization within law enforcement agencies, allowing them to allocate their personnel and resources more efficiently.

Knowledge Advancement: The project will contribute to the academic and practical understanding of how data mining techniques can be applied to combat cybercrime, potentially leading to advancements in the field.

Public Safety: As cybercrime poses a growing threat to individuals and organizations, the outcomes of this research can ultimately contribute to increased public safety in the digital age.

1.6 Proposed Dissertation Organization

In this dissertation proposal, we plan to investigate how to efficiently extract useful information from darknet and IM software. In part 2, we describe our experiments of purchases in cryptomarkets and summarize the Bitcoin transaction mechanisms behind cryptomarkets. In part 3, we present a greedy method to identify vendors’ bitcoin addresses by matching vendors’ feedback reviews with Bitcoin transactions in the public ledger. In part 4, we present our work which models group chat with Hawkes Process to discover hidden networks of influence between members. In part 5, we propose a model to cluster the accounts from the same user by learning the embedding of these accounts. In part 6, we conclude this

dissertation.

0Shares

AN IMPLEMENTATION OF DATA MINING ALGORITHMS ON OPEN SOURCE INTELLIGENCE TO COMBAT CYBER CRIME

ABSTRACT

CHAPTER ONE

INTRODUCTION

1.1 Background

1.2 Characteristics of Bitcoin Transactions on Cryptomarkets

Bitcoin Transactions

dimensional Hawkes Process

1.5 Clustering of Accounts in Online Messaging Software through Attributed Heterogeneous Information Networks

1.6 Proposed Dissertation Organization

Author: SPROJECT NG

CHOOSE YOUR DEPARTMENT

Most Top Download