Cluster validation

Francesco_le · April 8, 2020, 1:56pm

Hello,
I have a problem figuring out if R can help me with my work. I want to start by saying that I am new to this world, I recently started writing commands, and generally using these systems.

I had a starting file that contained this information:

NAME NAME_two column1 column2 column3 up to column10 and finally CLASS
Aa aae 5 3 4 3 3 0 5 1 2 4 YES
Ab and 11 3 5 6 4 5 5 2 3 2 NOT
Ac acd 9 4 4 2 7 5 5 3 6 1 NOT
Ad aaqff 0 2 0 1 0 2 1 1 0 YES
Ae ewg 1 0 2 1 1 0 4 1 0 0 NOT
Af wegv 10 5 9 5 6 0 3 2 3 7 NOT
Ag rwg 10 5 10 6 5 0 3 1 4 4 NOT
Ah wfq 1 0 2 0 1 0 2 1 1 0 NOT
Ai he 1 0 2 2 2 0 4 1 0 0 NOT
Al efgwa 0 0 1 0 1 0 1 0 1 0 NOT
Am h4h 0 0 3 1 1 0 1 0 1 0 NOT

So there are 10 columns with variable numbers (from 0 onwards) and at the end the name of a class (the classes are two: YES or NOT). The elements examined in this way are around 17,000.
With SOMbrero I have created clusters.
At this point, I would like to see if these created clusters make sense, if they were done well or if even the same clusters could be created by chance.

So here I am with the question: can I do this type of analysis with R? Is there a way to understand, to give value to these clusters, and to understand who has worked better and who worse?
I saw that there is a clValid package that could be useful to me: in particular the BSI functions or the index Davies Bouldin. But I didn't understand how I can use them in my case, I don't know how to write this analysis on R. And above all if these analyzes really serve to do what I want.

Thanks for your attention and for who can help me.
Best regards
Francesco Coppola

technocrat · April 9, 2020, 6:39am

Hi Francesco, and welcome!

Please see the FAQ: What's a reproducible example (`reprex`) and how do I do one? Using a reprex, complete with representative data will attract quicker and more answers. It is probably not required for this kind of question, although it might help me better understand the goal.

Also, people pretty much stopped counting when the number of R packages crossed 10,000, so for all but the most popular packages, there may be no one familiar with self-organizing maps who sees the question.

To start thinking about this problem, I'd like to see a sample object from the package that produces a cluster object that can be directly inspected, rather than by using a visual plot. That will help me start thinking about whether a vertex cluster in a graph object (graph as in network, not visuals) has an appropriate statistical test.

Francesco_le · April 9, 2020, 8:19am

Hi technocrat,
I am here to learn why I want to add these skills and connect my degree in Medicinal Chemistry with these skills (and maybe even be able to work in fields related to computational chemistry). So any kind of advice, even just referring to readings of other works, will be welcome on how to proceed in this life mission!
I thank you very much for your answer. Unfortunately, however, as I said, I'm completely new in this world and honestly I wouldn't even know how to set this up. In fact, to get the clusters I used SOMbrero which is a package that has a web interface (so I didn't have to write functions with R). So I don't know which package to install or which library to load ... and that's why I'm here, to ask you.

I also state that I do not know if I understand your technical question correctly. I think I understand that you would like to see how clustering was done?

I gave the input to create 25 clusters (actually even more but I think it's easier to work with small numbers to learn).

Now what I got is therefore a .txt (or .rda) file like this:

"Name" "cluster"
"aa" 4
"ab" 10
"ac" 7
"ad" 9
"ae" 1
"af" 10
"ag" 25
"ah" 1
"ai" 12
"al" 1
"am" 18
and so on (until 17.000 element).

Those in the "yes" class which are the ones I hope to group in a cluster are usually concentrated in a few clusters, I wondered if there was a way to understand and quantify the performance. For now, my problem is not about how many clusters to get.
The big question I have is the following: Is there a way to understand if, those who are part of the "yes" class are there by chance, or why is there really something that binds them? I believe that if there is an answer to this question, there is also a way to quantify the performance of it. Or am I wrong?
I hope I have answered your question, and thank you again for your collaboration.

technocrat · April 9, 2020, 8:29am

No worries, Francesco!

This is a process we all go through. Getting up to speed can be daunting.

A good place to start is with R for Data Science online without charge; buying the book is well worth the modest cost.

You're right that all the data isn't helpful, just enough to illustrate. Where we are going with this thing called reprex is to get something that we can inspect more thoroughly than by just looking at a plot.

At this point, I can't see past the grouping to the objects that produced them, which will inform me where to look for an appropriate statistical test (assuming, who knows?, that one exists).

It's late in Seattle, WA USA and I'm about to sign off for the evening. Tomorrow, I'll see if I can put a reprex together from the examples in help(SOMbrero) If it falls off my radar, just edit one of your posts and I'll be notified automatically.

Here's a bit of orientation to R that may help:

One of the hard things to get used to in R is the concept that everything is an object that has properties. Some objects have properties that allow them to operate on other objects to produce new objects. Those are functions.

Think of R as school algebra writ large: f(x) = y, where the objects are f, a function, x, an object (and there may be several) termed the argument and y is an object termed a value, which can be as simple as a single number (aka an atomic vector) or a very packed object with a multitude of data and labels.

And, because functions are also objects, they can be arguments to other functions, like the old g(f(x)) = y. (Trivia, this is called being a first class object.)

Although there are function objects in R that operate like control statements in imperative/procedural language, they are best used "under the hood." As it presents to users interactively, R is a functional programming language. Instead of saying

take this, take that, do this, then do that, then if the result is this one thing, do this other thing, but if not do something else and give me the answer

in the style of most common programming languages, R allows the user to say

use this function to take this argument and turn it into the value I want for a result

Francesco_le · April 9, 2020, 12:28pm

Really thank you for your explanation! This is exactly what I was looking for: not simply the solution to the problem but also the explanation! Otherwise I don't think I can ever become an independent user of R! But above all because I want to make these systems my job (if I'm lucky to find one).

I immediately start reading what you recommended and I am looking for a guide on how to create a reprex. Thanks again for your help.

technocrat · April 9, 2020, 8:51pm

After sleeping on it, the short answer is that I don't think it's possible to test a graph object for its error term and infer whether it passes some statistical test for the probability that it arise from random variation.

My opinion is based on two reasons: first, because a graph object is an abstraction; and second, we have nothing to compare it to.

The elements of a graph object are nodes in relation to zero or more edges. A graph of a single node with no edges is the simplest possible non-empty graph object. It is by definition clustered since there are no other points with which to share edges.

Further, edges are defined by their relation to at least two points. The relation can have the attribute of directionality in that some property flows from one node to the other by nothing flows in return.

Both nodes and edge exist in binary state. The either do or do not exist. An analogy is a logistic regression test for a binary result.

fit <- glm(y ~ x, data = something)

The something nature is important. Within the graph object either a node or an edge exists or it doesn't. However, graph objects that are representation of external objects depend on whether or not the existence of the external object is correctly identified.

To infer the accuracy of the classification cannot be done without knowing the statistical distribution of the population from which the nodes and edge were drawn. I know of no principled way to do so.

Given the underlying observation is accurate (we have no intrinsic way of knowing otherwise and cannot estimate without knowing what the population distribution should be) the most we can test is whether there is a complete or partial correspondence between the "real" nodes as observed and their representation in the graph.

The clusters identified in a graph object vary with the algorithm used to compute them. There are various measures of connectedness, centrality and other concepts used to identify them base on criteria of what represents a cluster, or sub-graph. A sub-graph may be an isolate, unconnected to any other sub-graph, which is the least uncertain case with respect to forming a cluster with respect to the graph object in which it is embedded. Within any graph or set of sub-graphs there may be overlapping sub-clusters, which require a decision rule for assigning nodes to one, the other or both.

Graph theory, similar to set theory, takes very simple elements that can be combined into highly complicated objects. It is definitely not the best entry into statistical analysis.

This is not the best community for deeper exploration of the subject. For that you should join the statnet community

I hope this has been helpful.

Francesco_le · April 10, 2020, 3:47pm

Thank you so much for your time and your valuable help. Your explanation is very clear and I will try to deepen what you told me!

However, I would like to ask you one thing about this:

I know that all "classes" with "YES" should be together. Using this as a best case, can I then do what you say here? And how? In this way I would have a comparison method to understand which clustering system went better. In any case, now I will also try to write in that group that you told me. Thanks again so much for everything! Good day!

technocrat · April 10, 2020, 4:08pm

1. The set of all 1s in a graph objec can be defined as a cluster.
1. All of the 1s are a cluster if all nodes are directly or indirectly connected.
1. If fewer than all, those that are form a cluster.
1. If the directness of the connection is used to identify clusters, then the degree of separation can be used to partition into groups that meet the criterion.

Francesco_le · April 14, 2020, 7:06am

I'm sorry technocrat, but as I said I'm inexperienced with these things. Can you (when you have time) show me how do it? How to write these functions?

Really sorry for the inconvenience.

technocrat · April 14, 2020, 5:25pm

I'm going to have to refer you to

Douglas Luke, A User’s Guide to Network Analysis in R (2015)
Eric D Kolaczyk, Statistical Analysis of Network Data with R (2014)

for beginning-to-intermediate and intermediate-to-advanced texts for this subject using R.

My unpublished paper Social Network Analysis of the Enron Corpus illustrates the nature of graph objects.

system · May 5, 2020, 5:26pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.