authorJakeDrew

Combining over 15 years of Fortune 100 experience in banking analytics and revenue optimization with cutting edge Computer Science techniques, Jake Drew is producing innovative results in the fields of Bioinformatics and Cybercrime Economics.  Jake Drew is currently a Ph.D. student and research assistant at Southern Methodist University under the supervision of Professors Tyler Moore and Michael Hahsler.  He has also recently acted as an expert witness, expert assistant, and technical consultant in over 20 intellectual property related cases.  He holds a Master of Computer Science from Southern Methodist University, and a B.S. degree in Computer Information Systems from The University of Texas at Tyler.  Previously he worked as a Vice President in Revenue Optimization at Bank of America, spending almost 15 years in the banking industry.

6 thoughts on “

  1. Hello jake,
    I need your help:

    I have a dataset of text documents, here I need to extract features from documents then apply LSH(all similar documents will be in same bucket and dissimilar in different buckets) to features, create bucket index construction. Here each bucket will be given an index constructed on common feature between similar document.
    now i need to search a particular word in document dataset.

    How to apply LSH for creating index for text document set to minimize search space.
    means how it is possible practically?

    for example say dataset of 100 text documents.

    please help me to solve this.

    Thanks…

    • It sounds like what you are looking for is a term/document frequency matrix to use for calculating term frequency / inverse document frequency. If you only have hundreds of documents,then locality sensitive hashing is not required. However, if you are using a very large number of documents, then you can use LSH and construct a matrix where minhash values take the place of words. You will need to track min hash value frequencies across documents. However, since you are using multiple hashing functions to create the minhash values, you would need multiple matrices (one matrix to track minhash value frequencies across documents for each hash function used).

      Another way to think of this would be as a three dimensional matrix where the third dimension contained a minhash frequency vector for each hash function used during minhashing across all documents. The reason you must partition these frequencies by hash function is because the exact same minhash value for two different hashing functions will map to two totally different words.

      The problem you will run into with minhashing is that it is a form of lossy compression. It will not work as well when searching for a single word. For example, if you minhashed by word a 1000 word document into a minhash signature of 100 values, you would only have 10% of the words represented in that document. You can make it work, but success really depends on your requirements, the content you are hashing, and how well you design the minhash system.

      Hope this helps!

      • Hi Jake!

        Warm Greetings! Thank you for sharing your wealth of knowledge.

        Please, I need your help. I would like to use LSH technique using Java for document similarity detection.
        I have a dataset (a text document) to compute the similarity index with another set of documents (at least 1TB size – perhaps i can extract it from the web).

        Thanks with regards,
        Afolabi

Leave a comment