grouping partially multivariate data


this is my first post here in this forum and I am relatively new to RStudio. I am currently working on a project where I collected elevation measurements from geomorphological features over a large area. Each feature is composed of 1 to 4 individual phases distinguished by different elevation values. There are likely to be more than 4 phases that are not necessarily apparent at all multiphase features. Meaning that values from feature 1 can be elevation a,b,c whereas elevations at feature 2 can be a,c and d. There is high variance in the data with no apparent natural breaks or clustering. I now need to divide all these measurements into individual groups to create distinct site wide developmental phases as all the features were created by the same process. I attempted to create individual groups manually but I need something more comprehensible. My only guides for grouping my measurements are the multiphase sites but the majority of features comprise only 1 or 2 phases. The following conditions need to be met:

  1. the number of individual phases need to be kept to an absolute minimum but can be >4.
  2. Individual phases must not overlap.
  3. Variance within each group should be as low as possible without compromising conditions 1 and 2.

Is there a function in Studio that would allow me to do that?

Hi, welcome!

My undergraduate geology prof was fond of saying that a quaternary geologist was one who carried a shovel, mapped in at least seven shade of yellow and used the word "crap" indiscriminately.

Without a reproducible example, called a reprex, I can only offer general guidance, which is to look to K-nearest-neighbor or K-means clustering algorithms, which will attempt to minimize the n-dimensional distance between points. You might start with the knnn package. That package provides for weighting and will do far better than eyeballing what can be a messy process.

Come back with any follow-up questions, please.

(Disclosure: after graduate school, I had only a brief career in geology before wandering off the lithosphere into law.)

Much obliged. I since put a bit of thought into this and how to define the problem so other forum members better understand what I would like to achieve. My survey basically was carried out at different sites each providing 1- 4 observations of a certain event that took place at the site, each differing by their elevation. This tells me that the event that produced the event took place more than once but at least 4 times, each discernible by their individual elevation. There is a lot of variation in the data which makes it challenging to determine which of my observations were created by the same event. What I need is an algorithm that can move individual observation horizontally across a matrix of undetermined length (but at least 4 columns). The aim is to have as few columns as possible whilst having the smallest possible variability. Does this make sense?

Hi, that helps.

Let's step through the data, conceptually.

  1. Each observation is a row, containing variables.
  2. Three of the variables lie in longitude, latitude, elevation 3-space, or "location"
  3. One or more of the variables represent a classification based on your criteria for identifying an event. I'll call these "attributes."
  4. The same location may have multiple observations at different elevations and attributes. I'll refer to these as the "event."
  5. The goal is to classify the observations into an undetermined number of categories that each represents the same regional occurrence of some source process.

The data structure I've described is tidy in the sense that observations are rows and variables are columns.

Generically, this is a classification problem of high(ish) dimensional data. I'm going to assume that the number of observations n is greater than the dimensionality p.

I would begin with location to identify non-intersecting surfaces in 3-space. For this, the sf package will be helpful and the spatial tools it provides. Depending on the underlying process (such as an ashfall), these could be relatively flat, like sedimentary strata. If they are pro-glacial features, they could be relatively convoluted.

If your underlying process model supports it, in 4-space, adding relative time, the surfaces should be stratigraphic.

Next, you want to place each attribute in that 4-space. Your null hypothesis, H_0 is that the attributes are randomly distributed. Your alternative hypothesis, H_1 is that they are non-randomly distributed under a distribution to be determined.

In addition to K nearest neighbor and K means clustering, there's also principal component analysis, a method analogous to linear regression but in higher dimensions, seeking to find an orthogonal plane that minimizes variance.

I'm sorry to have to address your question at such a high level of generality. Representative data might help for me and others to make more specific suggestions.

1 Like

You were right insofar that it is a classification problem with the added complication that my observations are not assigned to a specific attribute yet. I need to determine the attributes (ie column or event) based on my observations.

But I think I need to provide a bit more context to make my problem a bit more tangible. I did a survey of erosional notches in an isolated limestone massif. It is going to be used to model a local sea level curve and reconstruct a palaeo-coast line. There is insufficient temporal data, ie dates for the notches to figure out which notch sites are contemporaneous, nor is there any obvious clustering of values that allows some form of categorisation. The majority of the sites only show one notch, whereas other comprise multiple compounds at different elevations. Each component pertains to a period of still stand during transgression/ regression. Global sea level curves indicate that notches at higher elevations formed during the mid holocene high stand whereas the lower ones are likely of later date. So, there is no meaningful chronological sequence and most notches are devoid of datable material. Elevations of components at one compound notch do not necessarily mirror those at another. There is also no apparent patterning in their spatial distribution that can indicate local uplift/suppression. So their cartographic location is only of limited use in this task. Apart from two outliers, the difference between min and max elevation of all measured notches is c. 4m.

So, my matrix looks a little like this:

A 4.23
B 4.85
C 4.61
D 7.70
E 6.72
F 6.93 7.59
G 4.71
H 7.32 7.81 9.46
I 5.17
J 4.60
K 7.69 8.33
L 5.93 6.35 6.86
M 5.97
N 8.62
O 3.56 5.19
P 6.05 7.46 7.49
Q 6.96
R 4.77 8.97


I now want to deduce a minimum number of periods of still stand that would explain the occurrence of notches at different elevations. So, what I seek is a function that can query and compare data not only within a variable and between rows but also across columns and move measurements into other columns. The only condition is that no value from one row can occupy the same space than another.

Imagine a matrix with 4 columns and 50 rows. Each row is a site and each column is a variable containing an elevation measurement (or observation). 35 of the sites only have values in column A; 10 rows also have values in column B; 3 have values in column C and 2 have values in column D. The currently existing rows are purely containers that allocate measurements to specific site but do not relate to a specific event. What I need R to do, is to assign each measurement or variable to an event by moving individual values in each row from column to column and calculate how well a value fits with other values in a given column. If the value doesn't fit in any of the existing columns, it is moved to a new, temporary column which then becomes part of the process. Should more values fit into this new column, it becomes permanent. If not, then the values are moved to the column where the value fits best. At the end of the process I would like to have a matrix with as few columns as possible with lowest possible variance or deviation per column. Another question is, of course, what is an acceptable maximum number of variables and measure of best fit (variance, deviation?)?

I wish I could just send on the data but I cannot.

1 Like

OK, I've got a much clearer idea now. I'm going to need to chew over this a while. Please send a message if you don't see something in 7-10 days.

Did you get a chance to look into the problem. I've had someone to help me with solving parts of the problem by using reshape2, K-means and Jenks binning. While it created some neat groups, the main problem still persists: each site can only be represented in a group once. I wonder if it's possible to add a condition into the binning process that assigns a variable into a new group. Something like: if a variable x fits into group A BUT a better fitting variable of the same name already exists in this OR if variable x significantly diverts from the group mean then assign it to group B etc. Assign variable x to a new group if it does not fulfil either condition.
The idea is to first create as few groups as possible using the initial method to create a baseline and then apply the conditions.

Thanks for the bump. I've been in the process of moving boxes instead of objects, but I have some time coming up.

I've taken a brief lit survey of the problem and found Coastal notches: Their morphology, formation, and function, Trenhaile, Alan S.,Earth-Science Reviews, Volume 150 – Nov 1, 2015.

I had begun to think of the problem as a result of deltas from an isobase sea level datum, where tidal elevations could be inferred from x-y-z coordinates, using the stratigraphic assumption that elevation correlates with sequence in the absence of structural deformation. By interpolating contours of equal elevations of notches, I hypothesized that you would be able to detect sea level changes, even with some observational points lacking one or more notches.

However, the geomorphological processes are much more involved than I appreciated. The author of the cited paper concludes that

Further research is needed on the nature and efficacy of the erosional processes to critically analyze and test the assumptions of much of the literature to date ... .

I wish that I could help you more with your problem, but it seems to be highly domain sensitive. You might want to contact the author for suggestions.

Author's contact information

Yes, it is a highly complex process and the problem I outlined here is only one of many aspects I am investigating. I am indeed in contact with said author.
However, the help I am looking for here is R-related, not geological. Although, I do appreciate your thoughts on the matter.
I can do and have the classification by hand but an automated solution in R is simply more elegant and comprehensive. But it seems that what I am looking for is rather unusual not easily achievable.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.