Multiple factor hierarchical clustering algorithm for large scale web page and search engine clickstream data

Kou, Gang; Lou, Chunwei

doi:10.1007/s10479-010-0704-3

Multiple factor hierarchical clustering algorithm for large scale web page and search engine clickstream data

Published: 14 February 2010

Volume 197, pages 123–134, (2012)
Cite this article

Annals of Operations Research Aims and scope Submit manuscript

Gang Kou¹ &
Chunwei Lou¹

735 Accesses
43 Citations
Explore all metrics

Abstract

The developments in World Wide Web and the advances in digital data collection and storage technologies during the last two decades allow companies and organizations to store and share huge amounts of electronic documents. It is hard and inefficient to manually organize, analyze and present these documents. Search engine helps users to find relevant information by present a list of web pages in response to queries. How to assist users to find the most relevant web pages from vast text collections efficiently is a big challenge. The purpose of this study is to propose a hierarchical clustering method that combines multiple factors to identify clusters of web pages that can satisfy users’ information needs. The clusters are primarily envisioned to be used for search and navigation and potentially for some form of visualization as well. An experiment on Clickstream data from a processional search engine was conducted to examine the results shown that the clustering method is effective and efficient, in terms of both objective and subjective measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Review on Clustering of Web Search Result

Efficient Techniques for Clustering of Users on Web Log Data

Implementation of Web Search Result Clustering System

References

Al-Aomar, R., & Dweiri, F. (2008). A customer-oriented decision agent for product selection in web-based services. International Journal of Information Technology & Decision Making, 7(1), 35–52.
Article Google Scholar
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Wokingham: Addison-Wesley.
Google Scholar
Bush, V. (1945). As we may think. Atlantic Monthly, 176, 101–108.
Google Scholar
Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data preparation for mining world wide web browsing. Journal of Knowledge Information Systems, 1(1), 5–32.
Google Scholar
CRISP-DM (1996). CRoss industry standard process for data mining. http://www.crisp-dm.org/Overview/index.htm. Accessed 28 August 2009.
Cutting, D., Karger, D., Pedersen, J., & Tukey, J. (1992). Scatter/gather: a clusterbased approach to browsing large document collection. In Proceedings of the 15th ACM SIGIR conference (pp. 318–329), Copenhagen, Denmark.
Dhillon, I., Fan, J., & Guan, Y. (2001). Efficient clustering of very large document collections. In R. L. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, & R. R. Namburu (Eds.) Data mining for scientific and engineering applications. Dordrecht: Kluwer Academic.
Google Scholar
Foss, A., Wang, W., & Zaane, O. (2001). A non-parametric approach to Web log analysis. In 1st SIAM ICDM, workshop on web mining (pp. 41–50), Chicago, IL.
Gómez, S. A., Chesnevar, C. I., & Simari, G. R. (2008). Defeasible reasoning in web-based forms through argumentation. International Journal of Information Technology & Decision Making, 7(1), 71–101.
Article Google Scholar
Han, J. W., & Kamber, M. (2006). Data mining: concepts and techniques (2nd ed.). San Mateo: Morgan Kaufmann.
Google Scholar
Hearst, M. A. (1999). Untangling text data mining. In Proceedings of ACL’99: the 37th annual meeting of the association for computational linguistics, University of Maryland, June 20–26.
Heer, J., & Chi, E. (2001). Identification of web user traffic composition using multimodal clustering and information scent. In 1st SIAM CDM, workshop on web mining (pp. 51–58), Chicago, IL.
Hong, A., Katerattanakul, P., & Joo, S. J. (2008). Evaluating government website accessibility: a comparative study. International Journal of Information Technology & Decision Making, 7(3), 491–515.
Article Google Scholar
Hu, J., & Zhong, N. (2008). Web farming with clickstream. International Journal of Information Technology & Decision Making, 7(2), 291–308.
Article Google Scholar
Kou, G., Liu, X., Peng, Y., Shi, Y., Wise, M., & Xu, W. (2003). Multiple criteria linear programming to data mining: models, algorithm designs and software developments. Optimization Methods and Software, 18(4), 453–473, Part 2.
Article Google Scholar
Kou, G., Peng, Y., Shi, Y., Wise, M., & Xu, W. (2005). Discovering credit cardholders’ behavior by multiple criteria linear programming. Annals of Operations Research, 135(1), 261–274.
Article Google Scholar
Kumar, M., & Patel, N. (2008). Using clustering to improve sales forecasts in retail merchandising, Annals of Operations Research. Published online: 17 September 2008.
Lee, J., & Lee, H. (2008). Strategic agent based web system development methodology. International Journal of Information Technology & Decision Making, 7(2), 309–337.
Article Google Scholar
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press, 2008.
Book Google Scholar
Mathiak, B., & Eckstein, S. (2004). Five steps to text mining in biomedical literature. In Proceedings of the second European workshop on data mining and text mining in bioinformatics, Italy (pp. 47–50).
McQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley symposium on mathematics, statistics and probability (Vol. 1, pp. 281–298).
Nasraoui, O., Cardona, C., Rojas, C., & Gonzalez, F. (2003). Mining evolving user profiles in NoisyWeb clickstream data with a scalable immune system clustering algorithm. In Proc. of KDD workshop on web mining as a premise to...
Pantel, P. (2003). Clustering by committee. Ph.D. Thesis, University of Alberta.
Park, S., Seo, K., & Jang, D. (2007). Fuzzy art-based image clustering method for content-based image retrieval. International Journal of Information Technology and Decision Making, 6(2), 213–233.
Article Google Scholar
Peng, Y., Kou, G., Shi, Y., & Chen, Z. (2008a). A descriptive framework for the field of data mining and knowledge discovery. International Journal of Information Technology and Decision Making, 7(4), 639–682.
Article Google Scholar
Peng, Y., Kou, G., Shi, Y., & Chen, Z. (2008b). A multi-criteria convex quadratic programming model for credit data analysis. Decision Support Systems, 44(4), 1016–1030.
Article Google Scholar
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Article Google Scholar
Van Rijsbergen, C. J. (1979). Information retrieval. London: Butterworth.
Google Scholar
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for information retrieval. Communications of the ACM, 18(11), 613–620.
Article Google Scholar
Shi, Y. (2009). Current research trend: information technology and decision making in 2008. International Journal of Information Technology and Decision Making, 8(1), 1–5.
Article Google Scholar
Shi, Y., Peng, Y., Kou, G., & Chen, Z. (2005). Classifying credit card accounts for business intelligence and decision making: a multiple-criteria quadratic programming approach. International Journal of Information Technology and Decision Making, 4(4), 1–19.
Google Scholar
Singhal, A. (2001). Modern information retrieval: a brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4), 35–43.
Google Scholar
Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In 6th ACM SIGKDD, world text mining conference, Boston, MA.
Zhang, Q., & Segall, R. (2008). Web mining: a survey of current research, techniques, and software. International Journal of Information Technology and Decision Making, 7(4), 683–720.
Article Google Scholar
Zhang, W., Yoshida, T., & Tang, X. (2009). Distribution of multi-words in Chinese and English documents. International Journal of Information Technology and Decision Making, 8(2), 249–265.
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Management and Economics, University of Electronic Science and Technology of China, Chengdu, 610054, P.R. China
Gang Kou & Chunwei Lou

Authors

Gang Kou
View author publications
You can also search for this author in PubMed Google Scholar
Chunwei Lou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gang Kou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kou, G., Lou, C. Multiple factor hierarchical clustering algorithm for large scale web page and search engine clickstream data. Ann Oper Res 197, 123–134 (2012). https://doi.org/10.1007/s10479-010-0704-3

Download citation

Published: 14 February 2010
Issue Date: August 2012
DOI: https://doi.org/10.1007/s10479-010-0704-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multiple factor hierarchical clustering algorithm for large scale web page and search engine clickstream data

Abstract

Access this article

Similar content being viewed by others

A Review on Clustering of Web Search Result

Efficient Techniques for Clustering of Users on Web Log Data

Implementation of Web Search Result Clustering System

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multiple factor hierarchical clustering algorithm for large scale web page and search engine clickstream data

Abstract

Access this article

Similar content being viewed by others

A Review on Clustering of Web Search Result

Efficient Techniques for Clustering of Users on Web Log Data

Implementation of Web Search Result Clustering System

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation