Fast Forensic Analysis in Computer Inspection using Clustering Methods
Authors:L. JAYASREE , A. NARAYANA RAO
Authors:L. JAYASREE , A. NARAYANA RAO
Abstract: Hundreds of thousands of files are usually examined in computer forensic analysis. The data in those files consists of
unstructured text. Algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the
documents under analysis. In this paper we are discussing an approach that applies document clustering algorithms to forensic
analysis of computers seized in police investigations. The proposed approach is explained by carrying out extensive
experimentation with six well-known clustering algorithms (K-means, K-medoids, Single Link, Complete Link, Average Link,
and CSPA) applied to five real-world datasets. Experiments have been performed with different combinations of parameters,
resulting in 16 different instantiations of algorithms.. The experiments show that the Average Link and Complete Link
algorithms provide the best results for our application domain. Two relative validity indexes were used to automatically
estimate the number of clusters.
Keywords: Clustering, Text Mining and Forensic Computing.
INTRODUCTION
It is estimated that the volume of data in the digital world
increased from 161 hexa bytes in 2006 to 988 hexa bytes in
about 18 times the amount of information present in all the
books ever written and it continues to grow exponentially.
This large amount of data has a direct impact in Computer
Forensics, which can be broadly defined as the discipline
that combines elements of law and computer science to
collect and analyze data from computer systems. Clustering
algorithms are typically used for exploratory data analysis,
where there is little or no prior knowledge about the data.
our datasets consist of unlabeled objects—the classes or
categories of documents that can be found are a priori
unknown. Assuming that labeled datasets could be available
from previous analyses, there is almost no hope that the
same classes (possibly learned earlier by a classifier in a
supervised learning setting) would be still valid for the
upcoming data, obtained from other computers and
associated to different investigation processes, the new data
sample would come from a different population. In this
context, the use of clustering algorithms, which are capable
of finding latent patterns from text documents found in
seized computers, can enhance the analysis performed by
the expert examiner. Algorithms find that objects within a
valid cluster are more similar to each other than they are
objects belonging to a different cluster.
No comments:
Post a Comment