Strand: Fast Sequence Comparison using MapReduce and Locality Sensitive Hashing

Figure 1: High-level Diagram of the Strand MapReduce Style Pipeline.

 

Authors

This research was originally published at ACM BCB 2014.  The publication can be downloaded here.

ABSTRACT

The Super Threaded Reference-Free Alignment-Free N-sequence Decoder (Strand) is a highly parallel technique for the learning and classication of gene sequence data into any number of associated categories or gene sequence taxonomies.  Current methods, including the state-of-the-art sequence classication method RDP, balance performance by using a shorter word length. Strand in contrast uses a much longer word length, and does so eciently by implementing a Divide and Conquer algorithm leveraging MapReduce style processing and locality sensitive hashing. Strand is able to learn gene sequence taxonomies and classify new sequences approximately 20 times faster than the RDP classifierer while still achieving comparable accuracy results. This paper compares the accuracy and performance characteristics of Strand against RDP using 16S rRNA sequence data from the RDP training dataset and the Greengenes sequence repository.

1. INTRODUCTION

Many popular gene sequence analysis tools are based on the idea of pairwise sequence alignment [16, 19]. Sequence alignment finds the least costly transformation of one sequence into another using insertions, deletions, and replacements. This method is related to the well know distance metric called Levenshtein or edit distance. However, finding the optimal pairwise alignment is computationally expensive and requires dynamic programming.

Gene sequence words are sub-sequences of a given length. In addition to words they are often also referred to as k-mers or n-grams, where k and n represent the word length. Words are extracted from individual gene sequences and used for similarity estimations between two or more gene sequences [21]. Methods like BLAST [1] were developed for searching large sequence databases. Such methods search for seed words first and then expand matches. These so called alignment-free methods [21] are based on gene sequence word counts and have become increasingly popular since the computationally expensive sequence alignment method is avoided. One of the most successful word-based methods is the RDP classifier [22], a naive Bayesian classifier widely used for organism classification based on 16S rRNA gene sequence data.

Numerous methods for the extraction, retention, and matching of word collections from sequence data have been studied. Some of these methods include: 12-mer collections with the compression of 4 nucleotides per byte using byte-wise searching [1], sorting of k-mer collections for the optimized processing of shorter matches within similar sequences [11], modification of the edit distance calculation to include only removals (maximal matches) in order to perform distance calculations in linear time [20], and the use of locality sensitive hashing for inexact matching of longer k-mers [5].

This research combines the following three primary contributions in a novel and innovative way to achieve the results presented:

  1. Jaccard similarity is used as an approximation for edit distance when determining the similarities and differences between gene sequence data.
  2. A form of locality sensitive hashing called minhashing is used to rapidly process much longer word lengths for enhanced accuracy. Minhashing allows us to estimate Jaccard similarity without computing and storing information for all possible words extracted from a gene sequence. Instead, we use the intersection of the minhash signatures produced during the minhashing process to quickly calculate an accurate approximation of the Jaccard similarity between sequences and known taxonomy categories.
  3. A MapReduce style parallel pipeline is used to simultaneously identify unique gene sequence words, minhash each word generating minhash signatures, and intersect minhash signatures to estimate Jaccard similarity for highly accurate and efficient identification of gene sequence taxonomies.

Buhler [5] previously used locality sensitive hashing to allow for inexact matches between longer words of a predefined length. We use locality sensitive hashing in a very different way as a selection strategy for performing exact word matching when the number of possible words becomes much too large to store. For example, with an alphabet of 4 symbols, the number of unique words of length 60 is 460 which is already more than 1036 distinct words. The RDP classifier utilizes a fixed word length of only 8 bases to perform its taxonomy classification processing making the total possible number of unique words (i.e., features for the classifier) only 4^8 = 65,536 words [22].

Strand is able to very rapidly classify sequences while still taking advantage of the increased accuracy provided by extracting longer words. Using the much larger possible feature space provided by a longer word length combined with locality sensitive hashing to reduce memory requirements, Strand achieves classification accuracy similar to RDP in processing times which are magnitudes of order faster. All stages of Strand processing are highly parallelized, concurrently mapping all identified words from a gene sequence and reducing mapped words into minhash signatures simultaneously. The unique relationship of Jaccard similarity between sets and locality sensitive hashing [17] allows minhashing to occur during learning, storing only a predetermined number of minimum hash values in place of all words extracted from the gene sequences in the learning set. This process reduces the amount of memory used during learning and classification to a manageable amount.

2. BACKGROUND

This paper combines the three concepts of longer length word extraction, minhashing, and multicore MapReduce style processing in a novel way that produces superior gene sequence classification results. The following sections briefly describe the background material relevant to this research.

2.1 Word Extraction

The general concept of k-mers or words was originally defined as n-grams during 1948 in an information theoretic context [18] as a sub-sequence of n consecutive symbols. We will use the terms words or k-mers in this paper to refer to n-grams created from a gene sequence. Over the past twenty years, numerous methods utilizing words for gene sequence comparison and classification have been presented [21]. These methods are typically much faster than alignment-based methods and are often called alignment-free methods. The most common method for word extraction uses a sliding window of a fixed size. Once the word length k is defined, the sliding window moves from left to right across the gene sequence data producing each word by capturing k consecutive bases from the sequence.

Strand performs word extraction using lock-free data structures to identify unique gene sequence words. This method is similar to other highly parallel word counting tools such as Jellysh [14]. Traditionally, computationally expensive lock objects are used in parallel programs to synchronize thread-level access to a shared resource. Each thread must either acquire the lock object or block until it becomes available prior to entering critical sections of code. Lock-free data structures avoid the overhead associated with locking by making use of low-level atomic read/write transactions and other lock-free programming techniques.

2.2 Minhashing

In word-based sequence comparison, sequences are often considered to be sets of words. A form of locality sensitive hashing called minhashing uses a family of random hash functions to generate a minhash signature for each set. Each hash function used in the family of n hash functions implement a unique permutation function, imposing an order on the set to be minhashed. Choosing the element with the minimal hash value from each of the n permutations of the set results in a signature of n elements. Typically the original set is several magnitudes larger than n resulting in a significant reduction of the memory required for storage. From these signatures an estimate of the Jaccard similarity between two sets can be calculated [2, 17]. Minhashing has been successfully applied in numerous applications including estimating similarity between images [3] and documents [2], document clustering on the internet [4], image retrieval [8], detecting video copies [6], and relevant news recommendations [13].

In this paper we apply minhashing to estimate the similarity between sequences which have been transformed into very large sets of words.

2.3 MapReduce Style Processing

MapReduce style programs break algorithms down into map and reduce steps which represent independent units of work that can be executed using parallel processing [7, 10]. Initially, input data is split into many pieces and provided to multiple instances of the mapping functions executing in parallel. The result of mapping is a key-value pair including an aggregation key and its associated value or values. The key-value pairs are redistributed using the aggregation key and then processed in parallel by multiple instances of the reduce function producing an intermediary or final result.

MapReduce is highly scalable and has been used by large companies such as Google and Yahoo! to successfully manage rapid growth and extremely massive data processing tasks [15]. Over the past few years, MapReduce processing has been proposed for use in many areas including: analyzing gene sequencing data [15], machine learning on multiple cores [12], and highly fault tolerant data processing systems [7, 9].

Strand uses MapReduce style processing to quickly map gene sequence data into words while simultaneously reducing the mapped words from multiple sequences into their appropriate corresponding minhash signatures.

3. LEARNING CATEGORY SIGNATURES

Figure 1 illustrates a high-level process of the Strand MapReduce pipeline including both the mapping of gene sequence data into words, the reduction of words into minimum hash values, and finally, the last reduce step which organizes the minhash signatures by category. In the following we will describe each stage in detail.

Figure 1: High-level Diagram of the Strand MapReduce Style Pipeline.

Figure 1: High-level Diagram of the Strand MapReduce Style Pipeline.

3.1 Mapping Sequences into Words

The input data are sequences with associated categories.

DEFINITION 1     (Sequence). Let S be a single input sequence, a sequence of |S| symbols from alphabet Σ = {A, C, G, T}.

DEFINITION 2     (Category). Let C be the set of all L known taxonomic categories and cl ∈ C be a single category where l = {1, 2, … , L}. Each sequence S is assigned a unique true category cl ∈ C.

The goal of mapping sequences into words is to create for each sequence a word profile.

DEFINITION 3     (Sequence Word Profile). Let S denote the word profile of sequence S, i.e., the set of all words sj ∈ S, j = {1, 2, … , |S|}, with length k extracted from sequence S.

3.2 Creating Sequence Minhash Signatures

As words are produced, minhashing operations are also performed simultaneously in parallel to create minhash signatures.

DEFINITION 4     (Sequence Minhash Signature). Minhashing (min-wise locality sensitive hashing) applies a family of random hashing functions h1, h2, h3,…hk to the input sequence word profile S to produce k independent random permutations and then chooses the element with the minimal hash value for each. We define the minhash function:

Minhash

 

Min-wise locality sensitive hashing operations are performed in parallel and continually consume all randomly selected word hashes for each processed sequence. The minhash signature’s length is predetermined by the number of random hashing functions used during minhash processing, and the minhash signature length impacts processing time and overall classification accuracy. Processes using more hashing functions (i.e., longer minhash signatures) have been proven to produce more accurate Jaccard estimations [17]. However, careful consideration must be given to the tradeoff between the minhash signature’s impact on performance and Jaccard estimation accuracy.

A thread-safe minhash signature collection contains one minhash signature for each unique input sequence. During minhash processing, all hash values produced for each sequence’s unique set of word hashes are compared to the sequence’s current minhash signature values. The minimum hash values across all unique words for each sequence and each unique hashing function are then retained within each sequence’s final minhash signature. In applications where the similarity calculation incorporates word length or the frequency of each minimum hash value, the length and frequency for any word resulting in a particular minimum hash value can also be contained as additional properties within each minhash signature’s values. However, lengths are not retained within the Strand data structure during training since they can quickly be determined during any subsequent classification’s minhashing process. In some cases, the total number of hashing functions and overall minhashing operations can be reduced by saving the n smallest hash values generated from each individual hashing function. For instance, the number of hashing operations can be cut in half by retaining the 2 smallest values produced from each unique hashing function.

3.3 Reducing Sequence Minhash Signatures into Category Signatures

Next we discuss how to represent an entire category as a signature built from the minhash signatures of all sequences in the category.

DEFINITION 5     (Category Minhash Signature). We define the category minhash signature of category cl ∈ C as the union of the sequence minhash signatures of all sequences assigned to category cl:

Minhash Signature

where the union is calculated for each minhash hashing function separately.

The Strand data structure actually represents an array containing one data structure for each unique hashing function used (see Figure 1). Since this structure is keyed using minhash values, hash function partitions must exist to separate the minhash values produced by each unique hashing function. Within each individual hash function partition, a collection of key-value pairs (kvp) exists which contains the minhash value as a key and then a second nested collection of categorical key-value pairs for each value. The nested collection of kvp-values contains all category numbers and associated frequencies (when required) that have been encountered for a particular minhash value. In practice however, minhash values seldom appear to be associated with more than one taxonomy category which drastically reduces the opportunity for imbalance between categories, especially when minhash value frequencies are not used within the classification similarity function.

During learning, minimum hash values for each unique hashing function are retained within the array of nested categorical key-value pair collections and partitioned by each unique hashing function. Each hash function’s collection contains all unique minimum hash values, their associated taxonomies, and optional frequencies. Using the Strand data structure, minhash signatures for each input data sequence can quickly be compared to all minimum hash values associated with each known taxonomy including the taxonomy frequency (when utilized) during classification. During training, each value in the input data sequence’s minhash signature is reduced into the Strand data structure by either creating a new entry or adding additional taxonomy categories to an existing entry’s nested categorical key-value pair collection.

All results presented in this research were achieved using only a binary classication similarity function. This approach produced optimal performance while still achieving comparable accuracy results when benchmarked against the current top performer in this domain.

4. CLASSIFICATION PROCESS

The MapReduce style architecture used for learning and classification are very similar. While the process of mapping words into minhash signatures is identical, the reduce function now instead of creating category signatures creates classification scores.

The word profiles of sequences are the set of words contained within the sequences. A common way to calculate the similarity between sets is the Jaccard index. The Jaccard index between two sequence word profiles S1 and S2 is defined as:

Jaccard

However, after minhashing we only have the sequence minhash signatures M1 = minhash(S1) and M2 = minhash(S2) representing the two sequences. Fortunately, minhashing [17] allows us to efficiently estimate the Jaccard index using only the minhash signatures:

Jaccard Minhash Estimation

 

where the intersection is taken hash-wise, i.e., how many minhash values agree between the two signatures.

Next, we discuss scoring the similarity between a sequence minhash signature and the category minhash signatures used for classification. Category signatures are not restricted to k values since they are created using the unique minhash values of all sequence minhash signatures belonging to the category. This is why we do not directly estimate the Jaccard index, but dene a similarity measure based on the number of collisions between the minhash values in the sequence signature and the category signature.

DEFINITION 6     (Minhash Category Collision). We define the Minhash Category Collision between a sequence S represented by the minhash signature M and a category signature C as: Minhash Category Collision

We calculate MCC for each category and classify the sequence to the category resulting in the largest category collision count.

Many other more sophisticated approaches to score sequences are possible. These are left for future research.

5. RESULTS

In this section we report on a set of initial experiments. First, we compare dierent word sizes and numbers of sequence signature lengths (i.e., the number of hashing functions used for minhashing). Then we compare Strand with RDP using two different data sets.

The data sets we use are all 16S rRNA data sets. The RDP classifier raw training set was obtained from the RDP download page. It contains 9,217 sequences. The second data set we use is extracted from the Greengenes database. We randomly selected 150,000 unaligned sequences with complete taxonomic information for our experiments. We used for all experiments 10-fold cross-validation. During 10-fold cross-validation, the entire training file is randomly shuffled and then divided into ten equally sized folds or segments. While nine folds are learned, one fold is held out for classification testing. This process is repeated until all ten folds have been held out and classified against.

The experiments were performed on a Windows 8 (64-bit) machine with a 2.7 Ghz Intel i7-4800MQ quad core CPU and 28 GB of main memory installed. For the RDP classier we used version 2.5 (rdp-classifier-2.5.jar), and we implemented Strand in C#.

5.1 Choosing Word Size and Signature Length

Both, the used word size and the length of the signature need to be specified for Strand. We expect both parameters to have an impact on classification accuracy, space, and run time. While it is clear that with increasing signature lengths also the time needed to compute sequence signatures and the space needed to store signatures increases, the impact of word size is not so clear. In the following we will empirically find good values for both parameters.

To look at the impact of the word length, we set the signature size (i.e., number of hash functions used for minhashing) to 300. This was empirically found to be a reasonable value. Next we perform 10-fold cross-validation on the RDP training data for different word lengths ranging from 8 bases to 200 bases. The average accuracy of Genus prediction and overall prediction (average accuracy over all phylogenetic ranks) depending on the word length is shown in Figure 2. We see that accuracy increases with word length till the word length reaches 60 bases and then starts to fall quickly at lengths larger than 100 bases. This shows that the optimal word length for the used 16S rRNA is around 60 bases.

Figure 2: Strand word size accuracy on RDP 16S rRNA.

Figure 2: Strand word size accuracy on RDP 16S rRNA.

Next, we look at the impact of sequence signature length. We use a xed word size of 60 and perform again 10-fold cross-validation on the RDP training data set using signature lengths from 50 to 500. Figure 3 shows the impact of signature length. Accuracy initially increases with signature length, but flattens at about 300. Since an increased signature length directly increases run time (more hashes need to be calculated) and storage, we conclude that 300 is the optimal size for the used 16S rRNA data, but signature lengths of 200 or even 100 also provide good accuracy at lower computational cost.

Figure 3: Strand signature length accuracy on RDP 16S rRNA.

Figure 3: Strand signature length accuracy on RDP 16S rRNA.

 

5.2 Comparison of Strand and RDP on the RDP Training Data

In this section we compare Strand and RDP in terms of run time and accuracy. For the comparison we use again 10-fold cross-validation on the RDP training data set. Table 1 shows the run time for learning and classification using Strand and RDP. While the learning times for Strand are approximately 30% faster, Strand classifies sequences almost 20 times faster than RDP. Strand trained with 8,296 sequences averaging around 23 seconds per fold while RDP averaged around 33 seconds on the same 8,296 training sequences. During 10-fold cross-validation, Strand was able to classify 921 sequences averaging 3 seconds per fold while RDP’s average classification time was 59 seconds per fold. This is substantial since classification occurs much more frequently than training in a typical application. Since Strand uses longer words during training and classification, no bootstrap sampling or consensus is used to determine taxonomy assignments. Strand greatly increases classification speeds when compared to RDP by combining this factor with a highly parallel MapReduce style processing pipeline.

Table 1: 10-fold cross-validation performance comparison between Strand and RDP.

Table 1: 10-fold cross-validation performance comparison between Strand and RDP.

 

An accuracy comparison between Strand and RDP is shown in Table 2. In cross-validation it is possible that we have a sequence with a Genus in the test set for which we have no sequence in the learning set. We exclude such sequences from the results since we cannot predict a Genus which we have not encountered during learning. Strand achieves similar overall accuracy to RDP, however, as we saw above in a fraction of the time.

Table 2: 10-fold cross-validation accuracy comparison between Strand and RDP.

Table 2: 10-fold cross-validation accuracy comparison between Strand and RDP.

 

5.3 Comparison of Strand and RDP on the Greengenes Data

Here we compare Strand and RDP on a sample of 150,000 sequences from the Greengenes project. While the RDP training set is relatively small and well curated to create a good classier, these sequences will contain more variation. To analyze the impact of data set size, we hold 25,000 sequences back for testing and then use incrementally increased training set sizes from 25,000 to 125,000 in increments of 25,000 sequences. For Strand we use a word size of 60 and a sequence signature length of 100.

Figure 4 shows the classication accuracy of Strand and RDP using an increasingly larger training set. Strand has slightly higher accuracy. Accuracy increases for both classifiers with training set size. However, it is interesting that after 100,000 sequences, the accuracy starts to drop with a significantly steeper drop for RDP.

Figure 4: Strand and RDP accuracy on Greengenes data.

Figure 4: Strand and RDP accuracy on Greengenes data.

 

Figure 5 compares the time needed to train the classifiers with increasing training set size. During training, Strand execution times consistently outperform RDP with training time deltas further widening as input training volumes increase. In Figure 5 RDP training times increase rapidly as the training set size increases. Strand training times increase at a rate closer to linear. When training against 125,000 Greengenes sequences, Strand completes training in 5:39 (mm:ss) while RDP takes 16:50 (mm:ss). The classification time does not vary much with the training set size. Strand’s average classication time for the 25,000 sequences is 1:41 (mm:ss) while the average time for RDP is 20:15 (mm:ss).

Figure 5: Strand and RDP running time on Greengenes data.

Figure 5: Strand and RDP running time on Greengenes data.

 

Finally, we look at memory requirements for Strand. Since we use a word size of 60 bases, there exist 4^60 = 10^36 unique words. For RDP’s word size of 8 there are only 4^8 = 65,536 unique words. Strand deals with the huge amount of possible words using minhashing and adding only a small signature for each sequence to the class signature. This means that the Strand data structure will continue to grow as the volume of training data increases. The overall space consumption characteristics of Strand are directly related to the selected word size, the minhash signature length, and the amount of sequences learned. Figure 6 shows the percentage of unique signature entries (minhash values) stored relative to the number of words processed. With increasing training set size the fraction of retained entries falls from 2% at 25,000 sequences to just above 1% at 125,000 sequences. This characteristic is attributed to the fact that many sequences share words. In total, Strand stores for 125,000 sequences 1.7 million entries which is more than the 65,536 entries stored by RDP, but easily ts in less than 1 GB of main memory which is typically in most modern smart phones.

Figure 6: Percentage of retained entries in the Strand data structure.

Figure 6: Percentage of retained entries in the Strand data structure.

 

6. CONCLUSION

In this paper we have introduced a novel word-based sequence classification scheme that utilizes large word sizes.  A highly parallel MapReduce style pipeline is used to simultaneously map gene sequence input data into words, map words into word hashes, reduce all word hashes within a single sequence into a minimum hash signature, and then populates a data structure with category minhash signatures which can be used for rapid classification. Experiments show that for 16S rRNA a word size of 60 bases and a sequence minhash signature length of 300 produce the best classification accuracy. Compared to RDP, Strand provides comparable accuracy while performing classification 20 times as fast.

Acknowledgements

The systems and methods described in this publication are protected by the following US patent applications:

  1. Jake Drew. 2014. Collaborative Analytics Map Reduction Classification Learning Systems and Methods. (Feb. 2013). Patent Application No. 14/169,689, Filed February 6th., 2013.
  2. Jake Drew, Michael Hahsler, Tyler Moore. 2014. System and Method for Machine Learning and Classifying Data. (May 2013). Patent Application No. 14/283,031, Filed May 20th., 2013.

7. REFERENCES

  1. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410, 1990.
  2. A. Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings, pages 21-29. IEEE, 1997.
  3. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 327-336. ACM, 1998.
  4. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8):1157-1166, 1997.
  5. J. Buhler. Ecient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 17(5):419-428, 2001.
  6. C.-Y. Chiu, H.-M. Wang, and C.-S. Chen. Fast min-hashing indexing and robust spatio-temporal matching for detecting video copies. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), 6(2):10, 2010.
  7. C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-Reduce for Machine Learning on Multicore. In B. Scholkopf, J. C. Platt, and T. Homan, editors, NIPS, pages 281-288. MIT Press, 2006.
  8. O. Chum, M. Perdoch, and J. Matas. Geometric min-hashing: Finding a (thick) needle in a haystack. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 17-24. IEEE, 2009.
  9. J. Dean and S. Ghemawat. Mapreduce: A flexible data processing tool. Communications of the ACM, 53(1):72-77, 2010.
  10. J. Drew. Mapreduce: Map reduction strategies using C#, 2013.
  11. R. C. Edgar. Search and clustering orders of magnitude faster than BLAST. Bioinformatics (Oxford, England), 26(19):2460-1, 2010.
  12. D. Keco and A. Subasi. Parallelization of genetic algorithms using hadoop map/reduce. South East Europe Journal of Soft Computing, 1(2), 2012.
  13. L. Li, D. Wang, T. Li, D. Knox, and B. Padmanabhan. Scene: A scalable two-stage personalized news recommendation system. In ACM Conference on Information Retrieval (SIGIR), 2011.
  14. G. Marcais and C. Kingsford. A fast, lock-free approach for ecient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764-770, 2011.
  15. A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, et al. The genome analysis toolkit: A mapreduce framework for analyzing next-generation dna sequencing data. Genome research, 20(9):1297-1303, 2010.
  16. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443-453, 1970.
  17. A. Rajaraman and J. Ullman. Mining of Massive Datasets. Mining of Massive Datasets. Cambridge University Press, 2012.
  18. C. E. Shannon. A mathematical theory of communication. The Bell Systems Technical Journal, 27:379-423, 1948.
  19. T. F. Smith and M. S. Waterman. Identication of common molecular subsequences. Journal of Molecular Biology, 147(1):195-197, March 1981.
  20. E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(205):191{211, 1992.
  21. S. Vinga and J. Almeida. Alignment-free sequence comparison | A review. Bioinformatics, 19(4):513-523, 2003.
  22. Q. Wang, G. M. Garrity, J. M. Tiedje, and J. R. Cole. Naive bayesian classier for rapid assignment of RNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16):5261-5267,2007.

Clustering Similar Images Using MapReduce Style Feature Extraction with C# and R

Cluster images are displayed as they are selected by the user.

Abstract

This article provides a full demo application using both the C# and R programming languages interchangeably to rapidly identify and cluster similar images.   The demo application includes a directory with 687 screenshots of webpages.  Many of these images are very similar with different domain names but near identical content.  Some images are only slightly similar with the sites using the same general layouts but different colors and different images on certain portions of the page.  The demo application can create a pairwise distance matrix including measures of dissimilarity between all 687 websites in around 67 seconds on my laptop.  Next, the R programming language is used to perform Hierarchical Agglomerative Clustering grouping all similar images together in the same clusters.  The application demonstrates how to create very tight, highly similar clusters or larger, very loose, slightly dissimilar clusters by adjusting the clustering cut height.  After reviewing this article and the included demo project code, the reader should be able to identify similar images and weave together high-performance processing pipelines using inputs and outputs from both the C# and R programming languages.

Here is a screen shot of the demo application displaying the first 4 similar images from cluster 7:

Cluster images are displayed as they are selected by the user.

Cluster images are displayed as they are selected by the user.

Introduction

While the C# programming language is “a multi-paradigm programming language encompassing strong typingimperativedeclarativefunctionalgenericobject-oriented (class-based), and component-oriented programming disciplines”, [1] R is an open source functional programming language [2] which has become very popular for more complex statistical data analysis.  Typically, these two programming languages are seldom used together.  There have been attempts to successfully integrate R and C# within the same process using R.NET. [3]  However, this project is relatively unused with around 93 downloads on its homepage as of today.

One of the primary advantages for using R is the large number of “packages” contributed by users across the globe which can be quickly integrated into any R project for added programming features and functionality.  C# on the other hand has some extensive built in libraries for rapid parallel programming development which are simply hard to surpass.  I was recently excited to discover that the standard R installation includes a program called Rscript.exe which can be used from the command line to execute R programming code contained within a file.  After a little experimenting, I was able to write a small class within C# which will execute any R program file using Rscript.exe.  In fact, you can even pass variables back and forth between the two languages interchangeably within a single processing pipeline.

By the end of this article, the reader can expect to understand the following concepts:

  1. Executing R programming code from within a C# program.
  2. Passing variables and information between C# and R programs.
  3. Extracting luminosity histogram features from images using C#.
  4. Comparing two image’s luminosity histograms to measure image similarity.
  5. Executing steps 4 and 5 in parallel using C# MapReduce style processing to create a pairwise distance matrix from a large collection of images.
  6. Performing Hierarchical Agglomerative Clustering using the R programming language to cluster similar images together using input from the pairwise distance matrix.
  7. Understanding the impact to the clusters created when changing the cut height parameter during Hierarchical Agglomerative Clustering.

This may sound like a lot of material to cover.  However, all of the code and clustering results can be reviewed using the fully functional demo application and sample images which are included in the Resources section below.  The demo application was written in C# using Visual Studio 2013 and requires a full installation of the R programming language to work properly.

Executing R Programming Code From Within a C# Program

The RScriptRunner class executes R programs from C# using the Rscript.exe program which comes with the standard R installation.  This class simply starts Rscript.exe in the background and then passes it any number of command line arguments which can be accessed and used within the R program.

The C# code starts the Rscript.exe program as a separate process and redirects the R standard output to a string variable which is returned to C# when the R program completes:

Run an R program from C#

Figure 1 – Run an R program from C#

The only cumbersome part of using the RScriptRunner‘s RunFromCmd() function is making sure that you provide the correct location for the Rscript.exe program.  On my machine, this was as simple as providing the value “Rscript.exe”.  However, on other computers where Rscript.exe is not recognized as a valid command line command (i.e. a proper path / environment variable was not set), this argument may require a fully qualified path for the “Rscript.exe” program.  My best advice would be to figure out what is required to enter in the command line window to start Rscript.exe and then use that value.

The following C# code snippet executes the R program ImageClustering.r using the RscriptRunner.RunFromCmd() function:

Run R code from C# and then use the results.

Figure 2 – Run R code from C# and then use the results.

Prior to execution of the RscriptRunner.RunFromCmd() function, the three variables workingDirectory, clusteringOutput, and cutHeight are separated by spaces and passed in as arguments to the function.  Once these arguments are passed to Rscript.exe, they can be accessed directly from within the ImageClustering.r program.

Figure 3 demonstrates how each variable is captured in the ImageClustering.r program, and then how the first argument is used to set the working directory within R:

Figure 3 - Using a C# variable in an R program.

Figure 3 – Using a C# variable in an R program.

Since the output file location for the ImageClustering.r program was actually provided by C# as the variable “clusteringOutput” in Figure 2, the C# program can easily begin to process the output created by the R clustering program when the RscriptRunner.RunFromCmd() function call completes.  The last foreach() loop in Figure 2 demonstrates C# reading in the clustering assignments created by the ImageClustering.r program.

Extracting Luminosity Histogram Features From Images Using C#

“Luminance is a photometric measure of the luminous intensity per unit area of light” which “describes the amount of light that passes through or is emitted from a particular area, and falls within a given solid angle.” [4]  Relative luminance follows this definition, but it normalizes its values to 1 or 100 for a reference white. [5]  When separated into RGB components, a luminosity histogram acts as a very powerful machine learning fingerprint for an image.  Since these type of histograms only evaluate the distribution and occurrence of luminosity color information, they can handle affine transformations quite well. [6]  In simple terms, it very easy to separate the RGB components of an image using C#.  In fact, using the “unsafe” keyword you can very, very rapidly calculate the relative luminance of each pixel within an image.  I am certainly not the first person to think of doing this in C# either.  There are several C# Computer Vision pioneers who deserve some credit in this regard. [7,8]  For this project, I was able to quickly adapt the luminosity histogram feature extraction program contained within the Eye.Open library.[8]

The following figure demonstrates how a image luminosity histogram can quickly be calculated using C#:

Figure 4 - Use RGB channels from image to calculate luminosity for each pixel.

Figure 4 – Use RGB channels from image to calculate luminosity for each pixel.

Once an image has been processed, both vertical and horizontal luminosity histograms are extracted and retained for similarity calculations between images.  Figure 4 shows each pixel’s red, green, and blue channels being isolated for use within the luminosity calculation.  Images can quickly be compared for similarity using the similarity function shown in Figure 5 which calculates the deviation between two histograms using the weighted mean.  Once both the vertical and horizontal similarity have been calculated, the maximum or average vertical / horizontal similarity is retained for each image pair to create the pairwise image distance matrix.

Figure 5 demonstrates how similarity can be measured between two luminosity histograms:

 

Figure 5 - Luminosity Histogram Similarity Function

Figure 5 – Luminosity Histogram Similarity Function

Keep in mind that there are many ways to extract features from images and create image fingerprints.  Many people recommend other approaches such as taking 2d Harr Wavelets of each image. [9]  In addition, higher performance options such as uisng minhashing with tf-idf weighting have also been implemented.[10]  Harr features have been successfully used for more complex Computer Vision tasks such as detecting faces within an image. [11] However, the speed and performance of luminosity histograms are hard to ignore, if the recognition quality meets your requirements.  Other image feature engine implementations in C# such as CEDD: Color and Edge Directivity Descriptors [12], FCTH: Fuzzy Color and Texture Histograms [13], a hybrid scheme using NMRR and ANMRR values [14], and LIRe: Lucene Image Retrieval [15] can be located within the C# application provided here. [7]

Using MapReduce Style Processing to Extract and Compare Image Fingerprints in Parallel

A parallel pipeline within the demo project is used to “map” images into luminosity histograms while simultaneously “reducing”  luminosity histograms into a pairwise distance matrix.  For those of you wondering, a pairwise distance matrix is simply a matrix that contains a measure of dissimilarity for all possible pairs of images considered.  Dissimilarity for a pair of images is calculated as 1 –  the Similarity between both images.  The distance matrix will always contain (n^2 – n ) / 2 meaningful entries where n equals the total number of unique items in the matrix.  This is due to the fact that all items will intersect with themselves down the center diagonal, and all values on either side of the diagonal are duplicated and only need to be stored one time.

Figure 6 further demonstrates this concept:

Figure 6 - Only the information below the diagonal is required.

Figure 6 – Only the information below the diagonal is required.

The pairwise distance matrix is required as input when using the  Hierarchical Agglomerative Clustering function within R.  In order to create this matrix in parallel, a three stage parallel pipeline is used.  First, pairwise image matches are created using a C# yield return enumeration.  Each time the generateMatches() function in Figure 7 produces a pairwise match, processing stops and “yield returns” each match to the createPairwiseMatches() function’s Parallel.ForEach loop.

Figure 7 shows the relationship between these two functions which is executed in a background process during the distance matrix creation:

Figure 7 - The MapReduce Mapping Stage.

Figure 7 – The MapReduce Mapping Stage.

The createPairwiseMatches() function shown in Figure 7 above, extracts features in parallel mapping images to vertical and horizontal luminosity histograms.  Furthermore, the histograms for each image are saved in a hash table for quick reference since each image’s features will be repeatedly matched to other images.  Once the match features are extracted, the match is immediately placed in a thread safe blocking collection for further downstream reduction processing.  While the mapping functions shown in Figure 7 are executing in a background thread, parallel reduce functions simultaneously execute processing each completed match produced to calculate the similarity between the match images.

Figure 8 shows the calculateDistances() function reducing image features into a single similarity measure:

Figure 8 - MapReduce Reduction Stage.

Figure 8 – MapReduce Reduction Stage.

Once this process has completed, a pairwise distance matrix is saved to disk which can be used as input into the R program’s Hierarchical Agglomerative Clustering engine.  Similar to Figure 6 above, the final distance matrix contains pairwise distances for all images in the input directory.

Figure 9 demonstrates that only the minimum required number of pairwise distances were retained:

Figure 9 - The Pairwise Distance Matrix.

Figure 9 – The Pairwise Distance Matrix.

 

Hierarchical Agglomerative Clustering in R

While there are multitudes of packages and options for clustering data within R, the base language provides functions for simple HAC clustering.  The purpose of this article is not explain in too much detail how HAC clustering works.  Rather, a demonstration of how HAC clustering can be used to identify similar images is provided.

The purpose of clustering is to divide a collection of items into groups based on their similarity to each other.  Very simplistically…  The HAC clustering of images works by comparing the pairwise distances of all images and then grouping them into a structure called a dendrogram.  The “dendrogram” is a map of all possible clustering assignments at various dissimilarity thresholds.  The dissimilarity threshold dictates the maximum amount two images (or two clusters of images) are allowed to be dissimilar and still end up being merged into the same cluster.   Once the dendrogram has been created, all images can quickly be assigned to clusters using any dissimilarity threshold value, which is referred to as the “cut height”.  The cut height is typically provided by the user.  This process can also occur in reverse with the user requesting a particular total number of clusters at which point the algorithm calculates the best cut height to achieve the requested result.  While a small cut height will produce smaller clusters with highly similar images, a large cut height will create larger clusters containing more dissimilar images.

What this all means is that the dendrogram maps out all possible clustering memberships based on each image’s dissimilarity to the cluster as a group.  This can be done using each cluster’s minimum, maximum, average, or centroid distances.  Depending on what measure you choose to use, the clustering type is referred to by “nerds” as either single-linkage (minimum), complete-linkage (maximum), UPGMA (Unweighted Pair Group Method with Arithmetic Mean) (average), or centroid-based (centroid) clustering.  While there are many types of clustering methods, these seem to be the most common.

The clustering algorithms typically work by starting out with all images in their own individual cluster of 1, and then successively combining the clusters which are in closest distance proximity based on the distance metrics described above.  The clusters are combined until no more clusters can be joined without violating the provided cut height threshold.  While this description is a slight over-simplification, additional research regarding this topic is left up to the reader.

Figure 10 shows a dendrogram with all possible clustering combinations for 40 images at various cut heights which are displayed along the y axis:

Figure 10 - A Dendrogram for 40 Images.

Figure 10 – A Dendrogram for 40 Images.

Looking at the dendrogram in Figure 10 above, it is easy to see that a cut height of 0.60 would produce only two clusters containing all 40 images.  In this case, the two clusters are very large and likely contain many dissimilar images since the cut height threshold allows images with a distance of up to 0.60 to be included within the same cluster.  In the other extreme, a cut height of 0.10 places all but 2 images into singleton clusters containing only one image each.  This is due to the fact that images must be at least 90% similar to be included within the same cluster.  Using the demo application, the cut height can be adjusted to explore the impact on clustering similar images.

Figure 11 illustrates how image clusters are changed by a 10% adjustment in cut height when using the demo application’s sample images:

Figure 11 - Cut Height Impact on Clusters

Figure 11 – Cut Height Impact on Clusters

Figure 11 shows a 10% reduction in cut height forcing the third image out of the cluster.  Since two of the images are highly similar, they remain in a cluster of 2 once the cut height dissimilarity threshold is reduced to 0.25.  It is important to understand that the optimal cut height for image clustering will vary greatly depending on the types of images you are trying to cluster and the image features used to create the pairwise image distance matrix.  Even within the sample images provided, strong arguments can made for adjustments in the cut height depending on individual goals.

For instance, Figure 12 shows 4 images included in a single cluster.  However, some might argue that one of these images is sufficiently different than the other three:

Figure 12 - Cut Height Impact on Cluster Similarity.

Figure 12 – Cut Height Impact on Cluster Similarity.

Conversely, if one were trying to identify websites made from the same template, all of the images above would be clustered acceptably.  In fact, you would even want these images to be included in the same cluster, if they had different color schemes.  In this instance, a more generous cut height might be applied, and in some cases, different features might be required for the image matching exercise at hand.

Once the distance matrix has been created, it is relatively straightforward to perform the clustering using R.  The R program used to perform  Hierarchical Agglomerative Clustering on the image distance matrix can be seen in Figure 13 below:

Figure 13 - HAC Clustering Using R.

Figure 13 – HAC Clustering Using R.

Once the image distance matrix is saved to disk using C#, the ImageClustering.r program reads in the file and converts it to an R distance matrix (dist) object.  Next, the function hclust() creates the dendrogram mapping using the UPGMA or “average” distance method.  The cutree() function then cuts the dendrogram to the “cutHeight” which is specified by the user.  It is important to note that this value was actually specified by the user on the demo application’s form.  This value can be seen being passed as an argument from C# to the RscriptRunner’s RunFromCmd() function in Figure 2.  It is also seen being captured within the ImageClustering.r program in Figure 3.  The R clustering program is very efficient clustering the 687 images in just over 1 second on my machine.  Finally, the file names are sorted by cluster number and written back to disk.

Conclusion

This article successfully demonstrates that the C# and R programming languages can be combined to create powerful parallel processing pipelines using MapReduce style programming and harnessing the analytical powers of R as needed.  Information produced in both R and C# can also easily be exchanged and combined across programs when necessary.  The demo application provided in the Resources section below uses the powerful parallel programming libraries in C# to rapidly extract and compare luminosity histograms from images creating a pairwise distance matrix for all images contained in a directory folder.  Next, C# uses R to perform HAC clustering to combine similar images and then display the output from R on demo application’s form.  The demo application gives the user a thumbnail preview of the currently selected image row and also the next 3 images below the currently selected image.  This allows the user to change the clustering cut height and quickly re-run the R clustering program until they are satisfied with the image clustering results.

Article Resources

  • Feel free to learn more about me at: www.jakemdrew.com
  • The demo application, screenshot images, and source code in presented within this article are available for review on the web and can be downloaded using the links below.

View Code On The Web (html)

  • ImageClustering.r – The R program for HAC clustering using an image distance matrix.
  • RScriptRunner – C# class used for executing R programs from C#.
  • ImageDistanceMatrix - C# class which creates a pairwise distance matrix for all images in a folder.
  • RgbProjector – C# Adaptation of the Eye.Open project’s code used for extracting luminosity histograms from images and comparing them for similarity.

Demo Application Downloads (zip file)

  • ImageClusteringSmall.zip – The entire image clustering Visual Studio project including all source code, 40 webpage screen shots, and distance matrix / clustering outputs for the sample images. (42,068 KB)
  • ImageClusteringAll.zip – The entire image clustering Visual Studio project including all source code, 687 webpage screen shots, and distance matrix / clustering outputs for the sample images. Warning!!! (210,432 KB)
  • ImageClustering.zip –  The image clustering Visual Studio class library project including the RScriptRunnerImageDistanceMatrix, and RgbProjector source and no demo application or images. (42 kb)
  • ImageClusteringApp.zip – The  image clustering Visual Studio class library and demo app with NO SCREENSHOTS. (232 kb)

References

  1. Wikipedia, C#, http://en.wikipedia.org/wiki/C_Sharp_(programming_language), accessed on 06/24/2014.
  2. Advanced R,  Hadley Wickham, http://adv-r.had.co.nz/Functional-programming.html, accessed on 06/24/2014.
  3. CodePlex, R.NET, https://rdotnet.codeplex.com/, accessed on 06/24/2014.
  4. Wikipedia, Luminance, http://en.wikipedia.org/wiki/Luminance, accessed on 06/24/2014.
  5. Wikipedia, Relative Luminance, http://en.wikipedia.org/wiki/Luminance_(relative)#cite_note-1, accessed on 06/24/2014.
  6. Stack Overflow, Image Fingerprint, http://stackoverflow.com/questions/596262/image-fingerprint-to-compare-similarity-of-many-images
  7. AI Applications, C# Tutorials, http://savvash.blogspot.com/p/c-tutorials.html, accessed on 06/24/2014.
  8. CodePlex, Similar Images Finder, https://similarimagesfinder.codeplex.com/, accessed on 06/24/2014.
  9. Stack Overflow, Near-Duplicate Image Detection,  http://stackoverflow.com/questions/1034900/near-duplicate-image-detection/1076507#1076507, accessed on 06/24/2014.
  10. Chum et. al., Near Duplicate Image Detection: min-Hash and tf-idf Weighting ,http://cmp.felk.cvut.cz/~chum/papers/chum_bmvc08.pdf, accessed on 06/24/2014.
  11. CodeProject, Haar-feature Object Detection in C#, http://www.codeproject.com/Articles/441226/Haar-feature-Object-Detection-in-Csharp, accessed on 06/24/2014.
  12. Savvas et. al., CEDD: Color and Edge Directivity Descriptors, http://orpheus.ee.duth.gr/anaktisi/pdfs/CEDD.pdf, accessed on 06/24/2014.
  13. Savvas et. al., FCTH: Fuzzy Color and Texture Histogram, http://orpheus.ee.duth.gr/anaktisi/pdfs/FCTH.pdf, accessed on 06/24/2014.
  14. Savvas et. al., A Hybrid Scheme for Fast and Accurate Image Retrieval based on Color Descriptors,http://www.actapress.com/Abstract.aspx?paperId=31620, accessed on 06/24/2014.
  15. Mathias et. al., Lire: lucene image retrieval: an extensible java cbir library ,http://www.researchgate.net/publication/221573372_Lire_lucene_image_retrieval_an_extensible_java_CBIR_library/file/3deec526a1ca298d98.pdf, accessed on 06/24/2014.

Automatic Identification of Replicated Criminal Websites Using Combined Clustering Methods

The following publication was presented at the 2014 IEEE International Workshop on Cyber Crime and received the Best Paper Award on 5/18/2014.  The original IEEE LaTeX formatted PDF publication can also be downloaded from here: IWCC Combined Clustering.

best paper award

 AUTHORS

Authors

ABSTRACT

To be successful, cybercriminals must figure out how
to scale their scams. They duplicate content on new websites,
often staying one step ahead of defenders that shut down past
schemes. For some scams, such as phishing and counterfeitgoods
shops, the duplicated content remains nearly identical. In
others, such as advanced-fee fraud and online Ponzi schemes, the
criminal must alter content so that it appears different in order
to evade detection by victims and law enforcement. Nevertheless,
similarities often remain, in terms of the website structure or
content, since making truly unique copies does not scale well. In
this paper, we present a novel consensus clustering method that
links together replicated scam websites, even when the criminal
has taken steps to hide connections. We evaluate its performance
against two collected datasets of scam websites: fake-escrow
services and high-yield investment programs (HYIPs). We find
that our method more accurately groups similar websites together
than does existing general-purpose consensus clustering methods.

I. INTRODUCTION AND BACKGROUND

Cybercriminals have adopted two well-known strategies for defrauding consumers online: large-scale and targeted attacks.  Many successful scams are designed for massive scale. Phishing scams impersonate banks and online service providers by the thousand, blasting out millions of spam emails to lure a very small fraction of users to fake websites under criminal control [8], [20]. Miscreants peddle counterfeit goods and pharmaceuticals, succeeding despite very low conversion rates [12]. The criminals profit because they can easily replicate content across domains, despite efforts to quickly take down content hosted on compromised websites [20]. Defenders have responded by using machine learning techniques to automatically classify malicious websites [23] and to cluster website copies together [4], [16], [18], [27].

Given the available countermeasures to untargeted largescale attacks, some cybercriminals have instead focused on creating individualized attacks suited to their target. Such attacks are much more difficult to detect using automated methods, since the criminal typically crafts bespoke communications.  One key advantage of such methods for criminals is that they are much harder to detect until after the attack has already succeeded.

Yet these two approaches represent extremes among available strategies to cybercriminals. In fact, many miscreants operate somewhere in between, carefully replicating the logic of scams without completely copying all material from prior iterations of the attack. For example, criminals engaged in advanced-fee frauds may create bank websites for non-existent banks, complete with online banking services where the victim can log in to inspect their “deposits”. When one fake bank is shut down, the criminals create a new one that has been tweaked from the former website. Similarly, criminals establish fake escrow services as part of a larger advanced-fee fraud [21]. On the surface, the escrow websites look different, but they often share similarities in page text or HTML structure.  Yet another example is online Ponzi schemes called high-yield investment programs (HYIPs) [22]. The programs offer outlandish interest rates to draw investors, which means they inevitably collapse when new deposits dry up. The perpetrators behind the scenes then create new programs that often share similarities with earlier versions.

The designers of these scams have a strong incentive to keep their new copies distinct from the old ones. Prospective victims may be scared away if they realize that an older version of this website has been reported as fraudulent. Hence, the criminals make a more concerted effort to distinguish their new copies from the old ones.

While in principle the criminals could start over from scratch with each new scam, in practice it is expensive to recreate entirely new content repeatedly. Hence, things that can be changed easily are (e.g., service name, domain name, registration information). Website structure (if coming from a kit) or the text on a page (if the criminal’s English or writing composition skills are weak) are more costly to change, so only minor changes are frequently made.

The purpose of this paper is to design, implement, and evaluate a method for clustering these “logical copies” of scam websites. In Section II we describe two sources of data on scam websites that we will use for evaluation: fake-escrow websites and HYIPs. In Section III we outline a consensus clustering method that weighs HTML tags, website text, and file structure in order to link disparate websites together. We then evaluate the method compared to other approaches in the consensus clustering literature and cybercrime literature to demonstrate its improved accuracy in Section IV. In Section V we apply the method to the entire fake-escrow and HYIP datasets and analyze the findings. We review related work in Section VI and conclude in Section VII.

II. DATA COLLECTION METHODOLOGY

In order to demonstrate the generality of our clustering approach, we collect datasets on two very different forms of cybercrime: online Ponzi schemes known as High-Yield Investment Programs (HYIPs) and fake-escrow websites. In both cases, we fetch the HTML using wget. We followed links to a depth of 1, while duplicating the website’s directory structure. All communications were run through the anonymizing service Tor [6].

Data Source 1: Online Ponzi schemes We use the HYIP websites identified by Moore et al. in [22]. HYIPs peddle dubious financial products that promise unrealistically high returns on customer deposits in the range of 1–2% interest, compounded daily. HYIPs can afford to pay such generous returns by paying out existing depositors with funds obtained from new customers. Thus, they meet the classic definition of a Ponzi scheme.  Because HYIPs routinely fail, a number of ethically questionable entrepreneurs have spotted an opportunity to track HYIPs and alert investors to when they should withdraw money from schemes prior to collapse. Moore et al. repeatedly crawled the websites listed by these HYIP aggregators, such as hyip.com, who monitor for new HYIP websites as well as track those that have failed. In all, we have identified 4,191 HYIP websites operational between November 7, 2010 and September 27, 2012.

Data Source 2: Fake-escrow websites A long-running form of advanced-fee fraud is for criminals to set up fraudulent escrow services [21] and dupe consumers with attractively priced high-value items such as cars and boats that cannot be paid for using credit cards. After the sale the fraudster directs the buyer to use an escrow service chosen by the criminal, which is in fact a sham website. A number of volunteer groups track these websites and attempt to shut the websites down by notifying hosting providers and domain name registrars.  We identified reports from two leading sources of fake-escrow websites, aa419.org and escrow-fraud.com. We used automated scripts to check for new reports daily. When new websites are reported, we collect the relevant HTML. In all, we have identified 1 216 fake-escrow websites reported between January 07, 2013 and June 06, 2013.

For both data sources, we expect that the criminals behind the schemes are frequently repeat offenders. As earlier schemes collapse or are shut down, new websites emerge.  However, while there is usually an attempt to hide evidence of any link between the scam websites, it may be possible to identify hidden similarities by inspecting the structure of the HTML code and website content. We next describe a process for identifying such similarities.

III. METHOD FOR IDENTIFYING REPLICATING CRIMINAL WEBSITES

We now describe a general-purpose method for identifying replicated websites. Figure 1 provides a high-level overview. We now briefly describe each step before detailing each in greater detail below.

Process Flow

Fig. 1: High-level diagram explaining how the method works.

  1. Data scraper: Raw information on websites is gathered (as described in Section II).
  2. Input attributes: Complementary attributes such as website text and HTML tags are extracted from the raw data on each website.
  3. Distance matrices: Pairwise similarities between websites for each attribute are computed using Jaccard distance metrics.
  4. Clustering stage 1: Hierarchical, agglomerative clustering methods are calculated using each distance matrix, rendering distinct clusterings for each input attribute.
  5. Consensus matrix: A single distance matrix is calculated by combining the individual distance matrices.
  6. Clustering stage 2: Hierarchical, agglomerative clustering methods are calculated using the consensus distance matrix to arrive at the final clustering.

Extracting Website Input Attributes We identified three input attributes of websites as potential indicators of similarity: website text sentences, HTML tags and website file names.  To identify the text that renders on a given webpage, we used a custom “headless” browser adapted from the Watin package for C#. We extracted text from all pages associated with a given website, then split the text into sentences using the OpenNLP sentence breaker for C#.  We extracted all HTML tags in the website’s HTML files, while noting how many times each tag occurs. We then constructed a compound tag with the tag name and its frequency.  For example, if the “<br>” tag occurs 12 times within the targeted HTML files, the extracted key would be “<br>12”.

Finally, we examined the directory structure and file names for each website since these could betray structural similarity, even when the other content has changed. However, some subtleties must be accounted for during the extraction of this attribute. First, the directory structure is incorporated into the filename (e.g., admin/home.html). Second, since most websites include a home or main page given the same name, such as index.htm, index.html, or Default.aspx, websites comprised of only one file may in fact be quite different. Consequently, we exclude this input attribute from consideration for such websites.

Constructing Distance Matrices For each input attribute, we calculated the Jaccard distance between all pairs of websites. The Jaccard distance between two sets S and T is defined as 1- J(S; T), where:

Jaccard Similarity

Consider comparing website similarity by sentences. If website A has 50 sentences in the text of its web pages and website B has 40 sentences, and they have 35 in common, then the Jaccard distance is 1 – J(A;B) = 1 – 35 / 65 = 0.46.

Clustering Stage 1 We compute clusterings for each input attributes using a hierarchical clustering algorithm [11]. Instead of selecting a static height cutoff for the resulting dendogram, we employ a dynamic tree cut using the method described in [15]. These individual clusterings are computed because once we evaluate the clusters against ground truth, we may find that one of the individual clusterings work better. If we intend to incorporate all input attributes, this step can be skipped.

Creating a Consensus Matrix Combining orthogonal distance measures into a single measure must necessarily be an information-lossy operation. A number of other consensus clustering methods have been proposed [2], [5], [7], [9] , yet as we will demonstrate in the next section, these algorithms do not perform well when linking together replicated scam websites, often yielding less accurate results than clusterings based on individual input attributes.

Consequently, we have developed a simple and more accurate approach to combining the different distance matrices.  We define the pairwise distance between two websites a and b as the minimum distance across all input attributes.  The rationale for doing so is that a website may be very different across one measure but similar according to another.  Suppose a criminal manages to change the textual content of many sentences on a website, but uses the same underlying HTML code and file structure. Using the minimum distance ensures that these two websites are viewed as similar. Figure 2 demonstrates examples of both replicated website content and file structures. The highlighted text and file structures for each website displayed are nearly identical.

File Structure Example

Fig. 2: Examples of replicated website content and file structures for the HYIP dataset.

One could also imagine circumstances in which the average or maximum distance among input attributes was more appropriate.  We calculate those measures too, mainly to demonstrate the superiority of the minimum approach.

Clustering Stage 2 We then cluster the websites for a second time, based upon the consensus matrix. Once again, hierarchical clustering with dynamic cut tree height is used.

When labeled clusters are available for a sample of websites, the final step is to compare the consensus clustering following stage 2 to the individual clusterings based on single input attributes.  The more accurate method is selected for subsequent use.

IV. EVALUATION AGAINST GROUND-TRUTH DATA

One of the fundamental challenges to clustering logical copies of criminal websites is the lack of ground-truth data for evaluating the accuracy of automated methods. Some researchers have relied on expert judgment to assess similarity, but most forego any systematic evaluation due to a lack of ground truth (e.g., [17]). We now describe a method for constructing ground truth datasets for samples of fake-escrow services and high-yield investment programs.

We developed a software tool to expedite the evaluation process. This tool enabled pairwise comparison of website screenshots and input attributes (i.e., website text sentences, HTML tag sequences and file structure) by an evaluator.

A. Performing Manual Ground Truth Clusterings

After the individual clusterings were calculated for each input attribute, websites could be sorted to identify manual clustering candidates which were placed in the exact same clusters for each individual input attribute’s automated clustering.  Populations of websites placed into the same clusters for all three input attributes were used as a starting point in the identification of the manual ground truth clusterings.  These websites were then analyzed using the comparison tool in order to make a final assessment of whether the website belonged to a cluster. Multiple passes through the website populations were performed in order to place them into the correct manual ground truth clusters. When websites were identified which did not belong in their original assigned cluster, these sites were placed into the unassigned website population for further review and other potential clustering opportunities.

Deciding when to group together similar websites into the same cluster is inherently subjective. We adopted a broad definition of similarity, in which sites were grouped together if they shared most, but not all of their input attributes in common. Furthermore, the similarity threshold only had to be met for one input attribute. For instance, HYIP websites are typically quite verbose. Many such websites contain 3 or 4 identical  paragraphs of text, along with perhaps one or two additional paragraphs of completely unique text. For the ground-truth evaluation, we deemed such websites to be in the same cluster. Likewise, fake-escrow service websites might appear visually identical in basic structure for most of the site.  However, a few of the websites assigned to the same cluster might contain extra web pages not present in the others.

We note that while our approach does rely on individual input attribute clusterings as a starting point for evaluation, we do not consider the final consensus clustering in the evaluation.  This is to maintain a degree of detachment from the consensus clustering method ultimately used on the datasets. We believe the manual clusterings identify a majority of clusters with greater than two members. Although the manual clusterings contain some clusters including only two members, manual clustering efforts were ended when no more clusters of greater than two members were being identified.

B. Results

In total, we manually clustered 687 of the 4 191 HYIP websites and 684 of the 1 221 fake-escrow websites. We computed an adjusted Rand index [24] to evaluate the consensus clustering method described in Section III against the constructed ground-truth datasets described in Section V.  We also evaluated other consensus clustering methods for comparison. Rand index ranges from 0 to 1, where a score of 1 indicates a perfect match between distinct clusterings.

Table Ia shows the adjusted Rand index for both datasets for all combinations of input attributes and consensus clustering methods. The first three columns show the Rand index for each individual clustering. For instance,  or fake-escrow services, clustering based on tags alone yielded a Rand index of 0.672. Thus, clustering based on sentences alone is much more accurate than by file structure or website sentences alone (Rand indices of 0.072 and 0.10 respectively).  When combining these input attributes, however, we see further improvement.  Clustering based on taking the minimum distance between websites according to HTML tags and sentences yield a Rand index of 0.9, while taking the minimum of all three input attributes yields an adjusted Rand index of 0.952. This combined score far exceeds the Rand indices for any of the other comparisons.

TABLE I: Table evaluating various consensus clustering methods against ground truth dataset.

(a) Adjusted Rand index for different consensus clusterings, varying the number of input attributes considered.

(a) Adjusted Rand index for different consensus clusterings, varying the
number of input attributes considered.

(b) Adjusted Rand index for different consensus clusterings.

(b) Adjusted Rand index for different consensus
clusterings.

Because cybercriminals act differently when creating logical copies of website for different types of scams, the input attributes that are most similar can change. For example, for HYIPs, we can see that clustering by website sentences yields the most accurate Rand index, instead of HTML tags as is the case for fake-escrow services. We can also see that for some scams, combining input attributes does not yield a more accurate clustering. Clustering based on the minimum distance of all three attributes yields a Rand index of 0.301, far worse than clustering based on website sentences alone.  This underscores the importance of evaluating the individual distance scores against the combined scores, since in some circumstances an individual input attribute or a combination of a subset of the attributes may fare better.

We used several general-purpose consensus clustering methods from R’s Clue package [10] as benchmarks against the our “best minimum” approach:

  1. “SE” – Implements “a fixed-point algorithm for obtaining soft least squares Euclidean consensus partitions” by minimizing using Euclidean dissimilarity [5], [10].
  2. “DWH” – Uses an extension of the greedy algorithm to implement soft least squares Euclidean consensus partitions [5], [10].
  3. “GV3” – Utilizes an SUMT algorithm which is equivalent to finding the membership matrix m for which the sum of the squared differences between C(m) =
    mm´ and the weighted average co-membership matrix SUM_b W_b C(m^b) of the partitions is minimal [9], [10].
  4. “soft/symdiff” – Given a maximal number of classes, uses an SUMT approach to minimize using Manhattan dissimilarity of the co-membership matrices coinciding with symdiff partition dissimilarity in the case of hard partitions [7], [10].

Table Ib summarizes the best-performing measures for different consensus clustering approaches. We can see that our “best minimum” approach performs best. It yields more accurate results than other general-purpose consensus clustering methods, as well as the custom clustering method used to group spam-advertised websites by the authors of [18].

V. EXAMINING THE CLUSTERED CRIMINAL WEBSITES

We now apply the best-performing clustering methods identified in the prior section to the entire fake-escrow and HYIP datasets. The 4 191 HYIP websites formed 864 clusters of at least size two, plus an additional 530 singletons. The 1 216 fake-escrow websites observed between January and June 2013 formed 161 clusters of at least size two, plus seven singletons.

A. Evaluating Cluster Size

We first study the distribution of cluster size in the two datasets. Figure 3a plots a CDF of the cluster size (note the logarithmic scale on the x-axis). We can see from the blue dashed line that the HYIPs tend to have smaller clusters. In addition to the 530 singletons (40% of the total clusters), 662 clusters (47% of the total) include between 2 and 5 websites. 175 clusters (13%) are sized between 6 and 10 websites, with 27 clusters including more than 10 websites. The biggest cluster included 20 HYIP websites. These results indicate that duplication in HYIPs, while frequent, does not occur on the same scale as many other forms of cybercrime.

 Fig. 3: Evaluating the distribution of cluster size in the escrow fraud and HYIP datasets.

(a) Cumulative distribution function of cluster size.

Fig. 3(a) Cumulative distribution function of cluster size.

There is more overt copying in the escrow-fraud dataset.  Only 7 of the 1 216 escrow websites could not be clustered with another website. 80 clusters (28% of the total) include between 2 and 5 websites, but another 79 clusters are sized between 6 and 20. Furthermore, two large clusters (including 113 and 109 websites respectively) can be found. We conclude that duplication is used more often as a criminal tactic in the fake-escrow websites than for the HYIPs.

Another way to look at the distribution of cluster sizes is to examine the rank-order plot in Figure 3b. Again, we can observe differences in the structure of the two datasets. Rankorder plots sort the clusters by size and show the percentage of websites that are covered by the smallest number of clusters.  For instance, we can see from the red solid line the effect of the two large clusters in the escrow-fraud dataset. These two clusters account for nearly 20% of the total escrow-fraud websites. After that, the next-biggest clusters make a much smaller contribution in identifying more websites. Nonetheless, the incremental contributions of the HYIP clusters (shown in the dashed blue line) are also quite small. This relative dispersion of clusters differs from the concentration found in other cybercrime datasets where there is large-scale replication of content.

Fig. 3(b) Rank order plot of cluster sizes.

Fig. 3(b) Rank order plot of cluster sizes.

B. Evaluating Cluster Persistence

We now study how frequently the replicated criminal websites are re-used over time. One strategy available to criminals is to create multiple copies of the website in parallel, thereby reaching more victims more quickly. The alternative is to reuse copies in a serial fashion, introducing new copies only after time has passed or the prior instances have collapsed. We investigate both datasets to empirically answer the question of which strategy is preferred.

Fig. 4: Top 10 largest clusters in the fake-escrow dataset by date the websites are identified.

Fig. 4: Top 10 largest clusters in the fake-escrow dataset by date the websites are identified.

Figure 4 groups the 10 largest clusters from the fake-escrow dataset and plots the date at which each website in the cluster first appears. We can see that for the two largest clusters there are spikes where multiple website copies are spawned on the same day. For the smaller clusters, however, we see that websites are introduced sequentially. Moreover, for all of the biggest clusters, new copies are introduced throughout the observation period. From this we can conclude that criminals are likely to use the same template repeatedly until stopped.

Next, we examine the observed persistence of the clusters.  We define the “lifetime” of a cluster as the difference in days between the first and last appearance of a website in the cluster. For instance, the first-reported website in one cluster of 18 fake-escrow websites appeared on February 2, 2013, while the last occurred on May 7, 2013. Hence, the lifetime of the cluster is 92 days. Longer-lived clusters indicate that cybercriminals can create website copies for long periods of time with impunity.

We use a survival probability plot to examine the distribution of cluster lifetimes. A survival function S(t) measures the probability that a cluster’s lifetime is greater than time t.  Survival analysis takes into account “censored” data points, i.e., when the final website in the cluster is reported near the end of the study. We deem any cluster with a website reported within 14 days of the end of data collection to be censored.  We use the Kaplan-Meier estimator [13] to calculate a survival function.

Fig. 5: Survival probability of fake-escrow clusters (left) and HYIP clusters (right).

Fig. 5: Survival probability of fake-escrow clusters (left) and HYIP clusters (right).

Figure 5 gives the survival plots for both datasets (solid lines indicate the survival probability, while dashed lines indicate 95% confidence intervals). In the left graph, we can see that around 75% of fake-escrow clusters persist for at least 60 days, and that the median lifetime is 90 days. Note that around 25% of clusters remained active at the end of the 150-day measurement period, so we cannot reason about how long these most-persistent clusters will remain.

Because we tracked HYIPs for a much longer period (Figure 5 (right)), nearly all clusters eventually ceased to be replicated. Consequently, the survival probability for even long-lived clusters can be evaluated. 20% of HYIP clusters persist for more than 500 days, while 25% do not last longer than 100 days. The median lifetime of HYIP clusters is around 250 days. The relatively long persistence of many HYIP clusters should give law enforcement some encouragement: because the criminals reuse content over long periods, tracking them down becomes a more realistic proposition.

VI. RELATED WORK

A number of researchers have applied machine learning methods to cluster websites created by cybercriminals. Wardman et al. examined the file structure and content of suspected phishing webpages to automatically classify reported URLs as phishing [27]. Layton et al. cluster phishing webpages together using a combination of k-means and agglomerative clustering [16].  Several researchers have classified and clustered web spam
pages. Urvoy et al. use HTML structure to classify web pages, and they develop a clustering method using locality-sensitive hashing to cluster similar spam pages together [25]. Lin uses HTML tag multisets to classify cloaked webpages [19]. Lin’s technique is used byWang et al. [26] to detect when the cached HTML is very different from what is presented to user. Finally, Anderson et al. use image shingling to cluster screenshots of
websites advertised in email spam [4]. Similarly, Levchenko et al. use a custom clustering heuristic method to group similar spam-advertised web pages [18]. We implemented and evaluated this clustering method on the cybercrime datasets in Section IV. Finally, Leontiadis et al. group similar unlicensed online pharmacy inventories [17]. They did not attempt to evaluate against ground truth; instead they used Jaccard distance  and agglomerative clustering to find suitable clusters.

Separate to the work on cybercriminal datasets, other researchers have proposed consensus clustering methods for different applications. DISTATIS is an adaptation of the STATIS methodlogy specifically used for the purposes of integrating distance matrices for different input attributes [3]. DISTATIS can be considered a three-way extension of metric multidimensional scaling [14], which transforms a collection of distance matrices into cross-product matrices used in the crossproduct approach to STATIS. Consensus can be performed between two or more distance matrices by using DISTATIS and then converting the cross-product matrix output into into a (squared) Euclidean distance matrix which is the inverse transformation of metric multidimensional scaling [1].

Our work follows in the line of both of the above research thrusts. It differs in that it considers multiple attributes that an attacker may change (site content, HTML structure and file structure), even when she may not modify all attributes. It is also tolerant of greater changes by the cybercriminal than previous approaches. At the same time, though, it is more specific than general consensus clustering methods, which enables the method to achieve higher accuracy in cluster labelings.

VII. CONCLUDING REMARKS

When designing scams, cybercriminals face trade-offs between scale and victim susceptibility, and between scale and evasiveness from law enforcement. Large-scale scams cast a wider net, but this comes at the expense of lower victim yield and faster defender response. Highly targeted attacks are much more likely to work, but they are more expensive to craft.  Some frauds lie in the middle, where the criminals replicate scams but not without taking care to give the appearance that each attack is distinct.

In this paper, we propose and evaluate a consensus clustering method to automatically link together such semi-automated scams. We have shown it to be more accurate than general-purpose consensus clustering approaches, as well as approaches designed for large-scale scams such as phishing that use more extensive copying of content. In particular, we applied the method to two classes of scams: HYIPs and fake-escrow websites.

The method could prove valuable to law enforcement, as it helps tackle cybercrimes that individually are too minor to investigate but collectively may cross a threshold of significance.  For instance, our method identifies two distinct clusters of more than 100 fake escrow websites each. Furthermore, our method could substantially reduce the workload for investigators as they prioritize which criminals to investigate.

There are several promising avenues of future work we would like to pursue. First, the accuracy of the HYIP clustering could be improved. Second, it would be interesting to compare the accuracy of the consensus clustering method to other areas where clustering has already been tried, such as in the identification of phishing websites and spam-advertised storefronts. Finally, additional input attributes such as WHOIS registration details and screenshots could be considered.

ACKNOWLEDGMENTS

We would like to thank the operators of escrow-fraud.com and aa419.org for allowing us to use their lists of fake-escrow websites.

This work was partially funded by the Department of Homeland Security (DHS) Science and Technology Directorate, Cyber Security Division (DHS S&T/CSD) Broad Agency Announcement 11.02, the Government of Australia and SPAWAR Systems Center Pacific via contract number N66001-13-C-0131. This paper represents the position of the authors and not that of the aforementioned agencies.

REFERENCES

  1.  H. Abdi. Encyclopedia of Measurement and Statistics, pages 598–605. SAGE Publications, Inc., 2007.
  2. H. Abdi, A. O’Toole, D. Valentin, and B. Edelman. Distatis: The analysis of multiple distance matrices. In Computer Vision and Pattern Recognition – Workshops, 2005. CVPR Workshops. IEEE Computer Society Conference on, pages 42–42, June 2005.
  3. H. Abdi, L. J. Williams, D. Valentin, and M. Bennani-Dosse. Statis and distatis: optimum multitable principal component analysis and three way metric multidimensional scaling. Wiley Interdisciplinary Reviews: Computational Statistics, 4(2):124–167, 2012.
  4. D. S. Anderson, C. Fleizach, S. Savage, and G. M. Voelker. Spamscatter: Characterizing Internet scam hosting infrastructure. In Proceedings of 16th USENIX Security Symposium, pages 10:1–10:14, Berkeley, CA,
    USA, 2007. USENIX Association.
  5. E. Dimitriadou, A. Weingessel, and K. Hornik. A combination scheme for fuzzy clustering. International Journal of Pattern Recognition and Artificial Intelligence, 16(07):901–912, 2002.
  6. R. Dingledine, N. Mathewson, and P. Syverson. Tor: The secondgeneration onion router. In 13th USENIX Security Symposium, Aug. 2004.
  7. A. V. Fiacco and G. P. McCormick. Nonlinear programming: sequential unconstrained minimization techniques. Number 4. Siam, 1990.
  8. D. Florencio and C. Herley. Evaluating a trial deployment of password re-use for phishing prevention. In Second APWG eCrime Researchers Summit, eCrime ’07, pages 26–36, New York, NY, USA, 2007. ACM.
  9.  A. Gordon and M. Vichi. Fuzzy partition models for fitting a set of partitions. Psychometrika, 66(2):229–247, 2001.
  10. K. Hornik. A clue for cluster ensembles. 2005.
  11. S. C. Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241–254, 1967.
  12. C. Kanich, C. Kreibich, K. Levchenko, B. Enright, G. Voelker, V. Paxson, and S. Savage. Spamalytics: An empirical analysis of spam marketing conversion. In Conference on Computer and Communications Security (CCS), Alexandria, VA, Oct. 2008.
  13. E. Kaplan and P. Meier. Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53:457–481, 1958.
  14. S. Krolak-Schwerdt. Three-way multidimensional scaling: Formal properties and relationships between scaling methods. In D. Baier, R. Decker, and L. Schmidt-Thieme, editors, Data Analysis and Decision Support,
    Studies in Classification, Data Analysis, and Knowledge Organization, pages 82–90. Springer Berlin Heidelberg, 2005.
  15. P. Langfelder, B. Zhang, and S. Horvath. Defining clusters from a hierarchical cluster tree. Bioinformatics, 24(5):719–720, Mar. 2008.
  16. R. Layton, P. Watters, and R. Dazeley. Automatically determining phishing campaigns using the uscap methodology. In eCrime Researchers Summit (eCrime), 2010, pages 1–8, Oct 2010.
  17. N. Leontiadis, T. Moore, and N. Christin. Pick your poison: pricing and inventories at unlicensed online pharmacies. In Proceedings of the fourteenth ACM conference on Electronic commerce, pages 621–638. ACM, 2013.
  18. K. Levchenko, A. Pitsillidis, N. Chachra, B. Enright, M. Félegyházi, C. Grier, T. Halvorson, C. Kanich, C. Kreibich, H. Liu, D. McCoy, N. Weaver, V. Paxson, G. M. Voelker, and S. Savage. Click trajectories: End-to-end analysis of the spam value chain. In Proceedings of the 2011 IEEE Symposium on Security and Privacy, SP ’11, pages 431–446, Washington, DC, USA, 2011. IEEE Computer Society.
  19. J.-L. Lin. Detection of cloaked web spam by using tag-based methods. Expert Syst. Appl., 36(4):7493–7499, May 2009. Available at http://dx.doi.org/10.1016/j.eswa.2008.09.056.
  20. T. Moore and R. Clayton. Examining the impact of website take-down on phishing. In Second APWG eCrime Researchers Summit, eCrime ’07, Pittsburgh, PA, Oct. 2007. ACM.
  21. T. Moore and R. Clayton. The Impact of Incentives on Notice and Take-down, pages 199–223. Springer, 2008.
  22. T. Moore, J. Han, and R. Clayton. The postmodern Ponzi scheme: Empirical analysis of high-yield investment programs. In A. D. Keromytis, editor, Financial Cryptography, volume 7397 of Lecture Notes in Computer Science, pages 41–56. Springer, 2012.
  23. N. Provos, P. Mavrommatis, M. Rajab, and F. Monrose. All your iFrames point to us. In 17th USENIX Security Symposium, Aug. 2008.
  24. W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, 1971.
  25. T. Urvoy, E. Chauveau, P. Filoche, and T. Lavergne. Tracking web spam with html style similarities. ACM Trans. Web, 2(1):3:1–3:28, Mar. 2008.
  26. D. Y. Wang, S. Savage, and G. M. Voelker. Cloak and dagger: dynamics of web search cloaking. In Proceedings of the 18th ACM conference on Computer and communications security, CCS ’11, pages 477–490, New York, NY, USA, 2011. ACM. Available at http://doi.acm.org/10.1145/2046707.2046763.
  27. B. Wardman and G. Warner. Automating phishing website identification through deep MD5 matching. In eCrime Researchers Summit, 2008, pages 1–7. IEEE, 2008.

Resources