README.md
2025
Document Similarity and Clustering Toolkit
Data Mining
A Python toolkit for document clustering and similarity analysis using k-means, LSH, cosine similarity, shingling, and LOF for outlier detection.
Language:
Python
Libraries:
numPy
PySpark
scikit-learn
Platform:
Spark