Skip to main content
Log in

Multiple factor hierarchical clustering algorithm for large scale web page and search engine clickstream data

  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

The developments in World Wide Web and the advances in digital data collection and storage technologies during the last two decades allow companies and organizations to store and share huge amounts of electronic documents. It is hard and inefficient to manually organize, analyze and present these documents. Search engine helps users to find relevant information by present a list of web pages in response to queries. How to assist users to find the most relevant web pages from vast text collections efficiently is a big challenge. The purpose of this study is to propose a hierarchical clustering method that combines multiple factors to identify clusters of web pages that can satisfy users’ information needs. The clusters are primarily envisioned to be used for search and navigation and potentially for some form of visualization as well. An experiment on Clickstream data from a processional search engine was conducted to examine the results shown that the clustering method is effective and efficient, in terms of both objective and subjective measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Al-Aomar, R., & Dweiri, F. (2008). A customer-oriented decision agent for product selection in web-based services. International Journal of Information Technology & Decision Making, 7(1), 35–52.

    Article  Google Scholar 

  • Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Wokingham: Addison-Wesley.

    Google Scholar 

  • Bush, V. (1945). As we may think. Atlantic Monthly, 176, 101–108.

    Google Scholar 

  • Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data preparation for mining world wide web browsing. Journal of Knowledge Information Systems, 1(1), 5–32.

    Google Scholar 

  • CRISP-DM (1996). CRoss industry standard process for data mining. http://www.crisp-dm.org/Overview/index.htm. Accessed 28 August 2009.

  • Cutting, D., Karger, D., Pedersen, J., & Tukey, J. (1992). Scatter/gather: a clusterbased approach to browsing large document collection. In Proceedings of the 15th ACM SIGIR conference (pp. 318–329), Copenhagen, Denmark.

  • Dhillon, I., Fan, J., & Guan, Y. (2001). Efficient clustering of very large document collections. In R. L. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, & R. R. Namburu (Eds.) Data mining for scientific and engineering applications. Dordrecht: Kluwer Academic.

    Google Scholar 

  • Foss, A., Wang, W., & Zaane, O. (2001). A non-parametric approach to Web log analysis. In 1st SIAM ICDM, workshop on web mining (pp. 41–50), Chicago, IL.

  • Gómez, S. A., Chesnevar, C. I., & Simari, G. R. (2008). Defeasible reasoning in web-based forms through argumentation. International Journal of Information Technology & Decision Making, 7(1), 71–101.

    Article  Google Scholar 

  • Han, J. W., & Kamber, M. (2006). Data mining: concepts and techniques (2nd ed.). San Mateo: Morgan Kaufmann.

    Google Scholar 

  • Hearst, M. A. (1999). Untangling text data mining. In Proceedings of ACL’99: the 37th annual meeting of the association for computational linguistics, University of Maryland, June 20–26.

  • Heer, J., & Chi, E. (2001). Identification of web user traffic composition using multimodal clustering and information scent. In 1st SIAM CDM, workshop on web mining (pp. 51–58), Chicago, IL.

  • Hong, A., Katerattanakul, P., & Joo, S. J. (2008). Evaluating government website accessibility: a comparative study. International Journal of Information Technology & Decision Making, 7(3), 491–515.

    Article  Google Scholar 

  • Hu, J., & Zhong, N. (2008). Web farming with clickstream. International Journal of Information Technology & Decision Making, 7(2), 291–308.

    Article  Google Scholar 

  • Kou, G., Liu, X., Peng, Y., Shi, Y., Wise, M., & Xu, W. (2003). Multiple criteria linear programming to data mining: models, algorithm designs and software developments. Optimization Methods and Software, 18(4), 453–473, Part 2.

    Article  Google Scholar 

  • Kou, G., Peng, Y., Shi, Y., Wise, M., & Xu, W. (2005). Discovering credit cardholders’ behavior by multiple criteria linear programming. Annals of Operations Research, 135(1), 261–274.

    Article  Google Scholar 

  • Kumar, M., & Patel, N. (2008). Using clustering to improve sales forecasts in retail merchandising, Annals of Operations Research. Published online: 17 September 2008.

  • Lee, J., & Lee, H. (2008). Strategic agent based web system development methodology. International Journal of Information Technology & Decision Making, 7(2), 309–337.

    Article  Google Scholar 

  • Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press, 2008.

    Book  Google Scholar 

  • Mathiak, B., & Eckstein, S. (2004). Five steps to text mining in biomedical literature. In Proceedings of the second European workshop on data mining and text mining in bioinformatics, Italy (pp. 47–50).

  • McQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley symposium on mathematics, statistics and probability (Vol. 1, pp. 281–298).

  • Nasraoui, O., Cardona, C., Rojas, C., & Gonzalez, F. (2003). Mining evolving user profiles in NoisyWeb clickstream data with a scalable immune system clustering algorithm. In Proc. of KDD workshop on web mining as a premise to...

  • Pantel, P. (2003). Clustering by committee. Ph.D. Thesis, University of Alberta.

  • Park, S., Seo, K., & Jang, D. (2007). Fuzzy art-based image clustering method for content-based image retrieval. International Journal of Information Technology and Decision Making, 6(2), 213–233.

    Article  Google Scholar 

  • Peng, Y., Kou, G., Shi, Y., & Chen, Z. (2008a). A descriptive framework for the field of data mining and knowledge discovery. International Journal of Information Technology and Decision Making, 7(4), 639–682.

    Article  Google Scholar 

  • Peng, Y., Kou, G., Shi, Y., & Chen, Z. (2008b). A multi-criteria convex quadratic programming model for credit data analysis. Decision Support Systems, 44(4), 1016–1030.

    Article  Google Scholar 

  • Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.

    Article  Google Scholar 

  • Van Rijsbergen, C. J. (1979). Information retrieval. London: Butterworth.

    Google Scholar 

  • Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for information retrieval. Communications of the ACM, 18(11), 613–620.

    Article  Google Scholar 

  • Shi, Y. (2009). Current research trend: information technology and decision making in 2008. International Journal of Information Technology and Decision Making, 8(1), 1–5.

    Article  Google Scholar 

  • Shi, Y., Peng, Y., Kou, G., & Chen, Z. (2005). Classifying credit card accounts for business intelligence and decision making: a multiple-criteria quadratic programming approach. International Journal of Information Technology and Decision Making, 4(4), 1–19.

    Google Scholar 

  • Singhal, A. (2001). Modern information retrieval: a brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4), 35–43.

    Google Scholar 

  • Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In 6th ACM SIGKDD, world text mining conference, Boston, MA.

  • Zhang, Q., & Segall, R. (2008). Web mining: a survey of current research, techniques, and software. International Journal of Information Technology and Decision Making, 7(4), 683–720.

    Article  Google Scholar 

  • Zhang, W., Yoshida, T., & Tang, X. (2009). Distribution of multi-words in Chinese and English documents. International Journal of Information Technology and Decision Making, 8(2), 249–265.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gang Kou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kou, G., Lou, C. Multiple factor hierarchical clustering algorithm for large scale web page and search engine clickstream data. Ann Oper Res 197, 123–134 (2012). https://doi.org/10.1007/s10479-010-0704-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-010-0704-3

Keywords

Navigation