A Two-Step Method for Clustering Mixed Categroical and Numeric Data

Ming-Yi  Shih; Jar-Wen Jheng; Lien-Fu Lai

doi:10.6180/jase.2010.13.1.02

A Two-Step Method for Clustering Mixed Categroical and Numeric Data

Computer Science and Information Engineering

HAC algorithm.

Ming-Yi Shih ¹, Jar-Wen Jheng¹ and Lien-Fu Lai¹

¹Department of Computer Science and Information Engineering, National Changhua University of Education, Changhua, Taiwan 500, R.O.C.

Received: January 8, 2010
Accepted: March 3, 2010
Publication Date: March 3, 2010

Download Citation: ||https://doi.org/10.6180/jase.2010.13.1.02

ABSTRACT

Various clustering algorithms have been developed to group data into clusters in diverse domains. However, these clustering algorithms work effectively either on pure numeric data or on pure categorical data, most of them perform poorly on mixed categorical and numeric data types. In this paper, a new two-step clustering method is presented to find clusters on this kind of data. In this approach the items in categorical attributes are processed to construct the similarity or relationships among them based on the ideas of co-occurrence; then all categorical attributes can be converted into numeric attributes based on these constructed relationships. Finally, since all categorical data are converted into numeric, the existing clustering algorithms can be applied to the dataset without pain. Nevertheless, the existing clustering algorithms suffer from some disadvantages or weakness, the proposed two-step method integrates hierarchical and partitioning clustering algorithm with adding attributes to cluster objects. This method defines the relationships among items, and improves the weaknesses of applying single clustering algorithm. Experimental evidences show that robust results can be achieved by applying this method to cluster mixed numeric and categorical data.

Keywords: Data Mining, Clustering, Mixed Attributes, Co-Occurrence

REFERENCES

[1] Wiederhold, G., Foreword. In: Fayyad U., Shapiro G. P., Smyth P., Uthurusamy R., editors, Advances in Knowledge Discovery in Databases. California: AAAI/ MIT Press, 1996;2.
[2] Han, J. and Kamber, K., Data mining: Concept and Techniques. San Francisco: Morgan Kaufman Publisher (2001).
[3] Jain, A. K. and Dubes, R. C., Algorithms for Clustering Data, New Jersey: Printice Hall (1988).
[4] Kaufman, L. and Rousseeuw, P. J., Finding Groups in Data: An Introduction to Cluster Analysis, New York: John Wiley & Sons (1990).
[5] Ng, R. and Han, J., Efficient and Effective Clustering Method for Spatial Data Mining, Proc. of the 20th VLDB Conf. 1994 September. Santiago, Chile (1994).
[6] Zhang, T., Ramakrishman, R. and Livny, M., BIRCH: an Efficient Data Clustering Method for Very Large Databases, Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data, 1996 June. Montreal, Canada (1996).
[7] Guha, S., Rastogi, R. and Shim, K., Cure: An Efficient Clustering Algorithm for Large Databases, Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data. 1998 june. Seattle, WA (1998).
[8] Ester, M., Kriegel, H. P., Sander, J. and Xu, X., A Density-Based Algorithm for Discovering Clusters in Large spatial databases, Proc. of the Second International Conference on Data Mining (KDD-96), 1996 August. Portland, Oregon (1996).
[9] Hinneburg, A. and Keim, D., An Efficient Approach to Clustering in Large Multimedia Databases with Noise, Proc. 1998 Int. Conf. on Data Mining and Knowledge Discovery (KDD’98). 1998 August. New York (1998).
[10] Wang, W., Yang, J. and Muntz, R., Sting: A Statistical Information Grid Approach to Spatial Data Mining, Proc. 23rd VLDB. 1997 August. Athens, Greece (1997).
[11] Kaufman, L. and Rousseeuw, P. J., Finding Groups in Data: An Introduction to Cluster Analysis, New York: John Wiley & Sons (1990).
[12] Hinneburg and Keim, D., An Efficient Approach to Clustering in Large Multimedia Databases with Noise, Proc. 1998 Int. Conf. on Data Mining and Knowledge Discovery (KDD’98). 1998 August. New York (1998).
[13] Huang, Z., “Extensions to the K-Means Algorithm for Clustering Large Data Sets with Categorical Values,” Data Mining and Knowledge Discovery, Vol. 2, pp. 283304 (1998).
[14] Chiu, T., Fang, D., Chen, J. and Wang, Y., A Robust and Scalable Clustering Algorithm for Mixed Type Attributes in Large Database Environment, Proc. 2001 Int. Conf. On knowledge Discovery and Data Mining. 2001 Auguest. San Fransico (2001).
[15] Li, C. and Biswas, G., “Unsupervised Learning with Mixed Numeric and Nominal Data,” IEEE Transactions on Knowledge and Data Engineering, Vol . 14, p. 4 (2002).
[16] Goodall, D. W., “A New Similarity Index Based on Probability,” Biometric, Vol. 22, pp. 882907 (1966).
[17] Yin, J., Tan, Z. F., Ren, J. T. and Chen, Y. Q., An Efficient Clustering Algorithm for Mixed Type Attributes in Large Dataset, Proc. of the Fourth International Conference on Machine Learning and Cybernetics, 2005 August. Guangzhou China (2005).
[18] Ahmad, L. and Dey, A., “K-Mean Clustering Algorithm for Mixed Numeric and Categorical Data,” Data & Knowledge Engineering, Vol. 63, pp. 503527 (2007).
[19] He, Z., Xu, X. and Deng, S., “Scalable Algorithms for Clustering Mixed Type Attributes in Large Datasets,” Interbational Journal of Intelligent Systems, Vol. 20, pp. 10771089 (2005).