Privacy-Preserving Clustering of Data Streams

Ching-Ming Chao; Po-Zung Chen; Chu-Hao Sun

doi:10.6180/jase.2010.13.3.14

Privacy-Preserving Clustering of Data Streams

Computer Science and Information Engineering

Impact of variation on number of clusters.

Ching-Ming Chao¹ , Po-Zung Chen² and Chu-Hao Sun ²

¹Department of Computer Science and Information Management, Soochow University, Taipei, Taiwan 100, R.O.C.
²Department of Computer Science and Information Engineering, Tamkang University, Tamsui, Taiwan 251, R.O.C.

Received: July 3, 2009
Accepted: March 2, 2010
Publication Date: September 1, 2010

Download Citation: ||https://doi.org/10.6180/jase.2010.13.3.14

ABSTRACT

As most previous studies on privacy-preserving data mining placed specific importance on the security of massive amounts of data from a static database, consequently data undergoing privacy-preservation often leads to a decline in the accuracy of mining results. Furthermore, following by the rapid advancement of Internet and telecommunication technology, subsequently data types have transformed from traditional static data into data streams with consecutive, rapid, temporal, and unpredictable properties. Due to the increase of such data types, traditional privacy-preserving data mining algorithms requiring complex calculation are no longer applicable. As a result, this paper has proposed a method of Privacy-Preserving Clustering of Data Streams (PPCDS) to improve data stream mining procedures while concurrently preserving privacy with a high degree of mining accuracy. PPCDS is mainly composed of two phases: Rotation-Based Perturbation and cluster mining. In the phase of data rotating perturbation phase, a rotation transformation matrix is applied to rapidly perturb the data streams in order to preserve data privacy. In the cluster mining phase, perturbed data will first establish a micro-cluster through optimization of cluster centers, then applying statistical calculation to update a micro-cluster, as well as using geometric time frame to allocate and store a micro-cluster, and finally output mining result through a macro-cluster generation. Two simple data structure are added in the macro-cluster generation process to avoid recalculating the distance between the macro-point and the cluster center in the generation process. This process reduces the repeated calculation time in order to enhance mining efficiency without losing mining accuracy.

Keywords: Privacy-Preserving, Data Mining, Data Stream, Clustering

REFERENCES

[1]Cohen, E. and Strauss, M., “Maintaining Time Decaying Stream Aggregates,” Proceedings of the 22th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, San Diego, California, U.S.A., pp. 223-233 (2003).
[2]Chang, J. H. and Lee, W. S., “Finding Recent Frequent Itemsets Adaptively over Online Data Stream,” Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, D.C., U.S.A., pp. 487-492 (2003).
[3]Agrawal, R. and Srikant, R., “Privacy-Preserving Data Mining,” Proceeding of the ACM SIGMOD Conference on Management of Data, Dallas, Texas, U.S.A., pp. 439-450 (2000).
[4]Ketel, M. and Homailfar, A., “Privacy-Preserving Mining by Rotational Data Transformation,” Proceedings of the 43th Annual Southeast Regional Conference, Kennesaw, Georgia, pp. 233-236 (2005).
[5]Oliveira, R. M. and Zaїane., O. R., “Privacy Preserving Clustering by Data Transformation,” Proceedings of the 18th Brazilian Symposium on Databases, Manaus, Brazil, pp. 304-318 (2003).
[6]Kumari, P. K., Raju, K. and Rao., S. S., “Privacy Preserving in Clustering Using Fuzzy Sets,” Proceedings of the 2006 International Conference on Data Mining, Las Vegas, Nevada, U.S.A., pp. 26-29 (2006).
[7]Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X. and Zhu, M. Y., “Tools for Privacy Preserving Distributed Data Mining,” ACM SIGKDD Explorations Newsletter, 4, No. 2, pp. 28-34 (2002).
[8]Vaidya, J. and Clifton, C., “Privacy-Preserving K-Means Clustering over Vertically Partitioned Data,” Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, D.C., U.S.A., pp. 206-215 (2003).
[9]Meregu, and Ghosh, J., “Privacy-Preserving Distributed Clustering Using Generative Models,” Proceedings of the 3th IEEE International Conference on Data Mining, Melbourne, Florida, U.S.A., pp. 211-218 (2003).
[10]Liu, L. and Thuraisingham, B., “The Applicability of the Perturbation Model-Based Privacy Preserving Data Mining for Real-World Data,” Proceedings of the 6th IEEE International Conference on Data Mining, Hong Kong, China, pp. 507-512 (2006).
[11]Chen, T. S., Lin, C. C., Chiu, Y. H. and Chen, R. C., “Combined Density-Based and Constraint-Based Algorithm for Clustering,” Journal of Donghua University, Vol. 23, No. 6, pp. 36-38 (2006).
[12]Hulten, G., Spencer, L. and Domingos, P., “Mining Time-Changing Data Streams,” Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, U.S.A., pp. 97-106 (2001).
[13]Domingos, P. and Hulten, G., “A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering,” Proceedings of the 18th International Conference on Machine Learning, Williamstown, Massachusetts, S.A., pp. 106-113 (2001).
[14]Domingos, P. and Hulten, G., “Mining High-Speed Data Streams,” Proceedings of the Association for Computing Machinery 6th International Conference on Knowledge Discovery and Data Mining, Boston, U.S.A., pp. 71-80 (2000).
[15]Ordonez, C., “Clustering Binary Data Streams with K-means,” Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, San Diego, California, U.S.A., pp. 12-19 (2003).
[16]Aggarwal, C., Han, J., Wang, J. and Yu, P. S., “A Framework for Clustering Evolving Data Streams,” Proceedings of the 29th International Conference on Very Large Data Bases, Berlin, Germany, pp. 81-92 (2003).
[17]Gaber, M. M., Krishnaswamy, S. and Zaslavsky, A., “On-Board Mining of Data Streams in Sensor Networks,” Springer, Berlin Heidelberg, Germany, pp. 307-335 (2005).
[18]Yang, C. and Zhou, J., “HClustream: a Novel Approach for Clustering Evolving Heterogeneous Data Stream,” Proceedings of the 6th IEEE International Conference on Data Mining, Hong Kong, China, pp. 682-688 (2006).
[19]Zhang,, Ramakrishnan, R. and Livny, M., “BIRCH: an Efficient Data Clustering Method for Very Large Databases,” Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal Canada, pp. 103-114 (1996).
[20]Farnstrom, F., Lewis, J. and Elkan, C., “Scalability for Clustering Algorithms Revisited,” ACM SIGKDD Explorations Newsletter, Vol. 2, No 1, pp. 51-57 (2000).