2025

Document Similarity and Clustering Toolkit

Data Mining

A Python toolkit for document clustering and similarity analysis using k-means, LSH, cosine similarity, shingling, and LOF for outlier detection.

Language:

  • Python

Libraries:

  • numPy

  • PySpark

  • scikit-learn

Platform:

  • Spark

README.md