Approaches to Cluster Analysis



Introduction to Cluster Analysis

Introduction

One of the most fundamental capacities of living things is the ability to classify objects by grouping them together. Early man, for example, must have realized that many distinct objects shared certain traits such as being edible, toxic, or ferocious, and so on, therefore classifying comparable things into categories is clearly a primordial concept.

The creation of language, which consists of words that help us recognize and describe the various categories of events, objects, and people we experience, necessitates classification in its broadest definition. Each noun in a language, for example, is essentially a label used to designate a collection of items that share distinguishing characteristics; for example, animals are named cats, dogs, horses, and so on, and such a name groups individuals. The terms "naming" and "classifying" are nearly interchangeable.

Classification is crucial to most disciplines of research, as well as being a basic human conceptual activity. In biology, for example, classification of species has been a focus from the beginning of scientific research. Aristotle devised a complex method for classifying animal species, beginning with the division of animals into two groups: those with red blood (approximately equivalent to our own vertebrates) and those without (the invertebrates). He further categorized these two groups based on how the young are generated, such as whether they are born alive, in eggs, as pupae, and so forth.

Theophrastos, following Aristotle, published the first fundamental explanations of plant structure and taxonomy. The publications that resulted were so well-documented, profound, and all-encompassing in scope that they served as the foundation for biological inquiry for generations. They were only overtaken in the 17th and 18th centuries, when famous European explorers, by opening the rest of the world to curious travelers, provided the impetus for a second, similar research and collection program, led by Swedish naturalist Linnaeus.

Taxonomy is a term used in biology to describe the theory and practice of classifying organisms. Initially, taxonomy was more of an art than a scientific method, but less subjective techniques were developed, largely by Adanson (1727–1806), who is credited by Sokal and Sneath (1963) with the introduction of the polythetic type of system into biology, in which classifications are based on many characteristics of the objects being studied, rather than molecular characteristics.

Animal and plant classification has undoubtedly played a significant role in biology and zoology, particularly as a foundation for Darwin's theory of evolution. However, classification has played an important part in the formation of theories in a variety of domains. For example, Mendeleyev's classification of the elements in the periodic table, which he created in the 1860s, has had a significant impact on our understanding of the structure of the atom. In astronomy, the use of the Hertzsprung–Russell plot of temperature against luminosity to classify stars into dwarf and giant stars has had a significant impact on ideas of stellar evolution.

Cluster analysis approaches are focused with examining large data sets to see whether they can be properly described in terms of a limited number of groups or clusters of items or individuals that are similar but differ in some ways from individuals in other clusters. Over the last four decades or so, a wide range of clustering approaches has been developed.

The cluster analysis answers to common research questions are as follows:

  • Medicine – What are the diagnostic clusters? The researcher would create a diagnostic questionnaire that contains possible symptoms to address this question (for example, in psychology, anxiety, depression etc.). After that, cluster analysis can be used to find groups of patients with similar symptoms.
  • Marketing- What are the customer segments? A market researcher may undertake a survey of customers' requirements, views, demographics, and behavior to answer this question. After that, the researcher can utilize cluster analysis to find homogeneous groups of clients with similar requirements and attitudes.
  • Education -What are student groups that need special attention? Psychological, aptitude, and achievement qualities can all be measured by researchers. A cluster analysis can then be used to determine whether pupils belong to homogeneous groups (for example, high achievers in all subjects, or students that excel in certain subjects but fail in others).
  • Biology- What is the taxonomy of species? Researchers can compile a data set of several plants and note various phenotypic characteristics. A cluster analysis can assist develop a taxonomy of groups and subgroups of similar plants by grouping those observations into a series of clusters.
  • Approaches to Cluster Analysis

    A cluster analysis can be carried out using a variety of methodologies, which are divided into the following categories:

    Method 1: Hierarchical methods

  • Methods based on agglomeration, in which individuals begin in their own cluster. The two 'closest' (most similar) clusters are then joined, and the process is repeated until all of the subjects are in a single cluster. Finally, the best number of clusters is chosen from all cluster options.
  • Divisive methods, in which all subjects are placed in the same cluster and the aforementioned strategy is reversed until each subject is placed in its own cluster. Because agglomerative rather than divisive strategies are more commonly used, this handout will focus on the former rather than the latter.
  • Hierarchical agglomerative methods

    There are a variety of approaches employed in this approach to cluster analysis to select which clusters should be connected at each stage. The most important methods are listed below.

  • Nearest neighbour method (single linkage method) The distance between two clusters is defined as the distance between their two nearest members, or neighbours, in this method. This method is relatively straightforward, but it is frequently criticized since it ignores cluster structure and can lead to an issue known as chaining, in which clusters become lengthy and straggly. When the natural clusters are not spherical or elliptical in shape, it performs better than the other techniques.
  • Furthest neighbour method (complete linkage method). The greatest distance between members — i.e. the distance between the two subjects who are farthest away — is used to define the distance between two clusters in this situation. This method, like the nearest neighbor method, tends to produce compact clusters of similar size but ignores cluster structure. It is also highly susceptible to outliers.
  • Linkage approach based on the average (between groups) (sometimes referred to as UPGMA) The average distance between all pairs of individuals in two clusters is used to calculate the distance between them. This is thought to be a fairly reliable procedure.
  • Centroid method - The distance between centroids is used to calculate the centroid (mean value for each variable) of each cluster. Clusters with the nearest centroids are combined. This procedure is also quite reliable.
  • Ward’s method - This approach combines all feasible cluster pairs and calculates the sum of the squared distances inside each cluster. The result is then averaged across all clusters. The combination with the smallest sum of squares is selected. This method produces clusters that are around the same size, which isn't always desired. It is also highly susceptible to outliers. Despite this, along with the average linkage approach, it is one of the most used methods.
  • Method 2:Non-hierarchical or k-means clustering methods

    The desired number of clusters is set in advance in these procedures, and the 'best' option is chosen. The following are the steps in such a method:

    * Choose initial cluster centers (basically, this is a collection of far apart observations — each subject creates a cluster of one, with the value of the variables for that subject as the center).

    * Assign each subject to the cluster that is closest to the centroid in terms of distance.

    *Locate the centroids of the newly generated clusters.

    *Recalculate the distance between each subject and each centroid, and relocate any observations that aren't in the cluster to which they're closest.

    * Continue to rotate the centroids until they are generally stable.

    When dealing with huge data sets, non-hierarchical cluster analysis is commonly performed. It is sometimes favored because it permits subjects to move across clusters (this is not possible in hierarchical cluster analysis where a subject, once assigned, cannot move to a different cluster).

    One possible strategy is to use a hierarchical approach first to determine the number of clusters in the data, and then use the cluster centers obtained from this as the initial cluster centers in the non-hierarchical method.

    References:

    https://www.qualtrics.com/au/experience-management/research/cluster-analysis/

    https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/cluster-analysis

    https://inseaddataanalytics.github.io/INSEADAnalytics/CourseSessions/Sessions45/ClusterAnalysisReading.html

    https://www.statstutor.ac.uk/resources/uploaded/clusteranalysis.pdf


    "I'm sure I don't have all of the answers or information about Cluster Analysis Approaches right here." I'm hoping to hear your thoughts on this topic in the comments section." You can follow to this blog to receive notifications of new posts.




    Post a Comment

    0 Comments