I’ve recently needed to find a way to break a large dataset of points into manageable clusters, ideally within a Postgres database so this post looks at methods of clustering spatial data.
I’ve not found anything in PostGIS, although would love to be proved wrong, so I looked to the R statistical language and to using the PL/R extension in Postgres (sorry, this post assumes R and PL/R have been installed).
After muddling around in a bunch of things at the edge of my understanding, I’ve managed to create a PL/R function that takes a set of spatially distributed points, processes them using R’s k-means tools, and returns them with a cluster id appended.
Others may be able to suggest better ways of doing this, especially how to present the returning data but it seems to work.
At the moment it returns a composite value so I’ve created a special type, structured to accept the R output:
CREATE TYPE kmean_cluster_id AS (
lon double precision,
lat double precision,
It means that using the result looks a little unconventional in an SQL script:
(myresult.r_cluster_id).gid AS gid, (myresult.r_cluster_id).my_result AS cluster_id
The input to the function is:
- The number of clusters you wish to generate
- An array of the GIS index (gid) of the data points so that the data can be linked back to the original points,
- Arrays of the latitude (lat) and longitude (long).
CREATE OR REPLACE FUNCTION r_cluster_id(
lon double precision,
lat double precision
RETURNS SETOF kmean_cluster_id AS
data <- cbind(lon, lat)
cl <- kmeans(data, num_of_clusters)
my_result = cl$cluster
result <- data.frame(gid = gid, lon = lon, lat = lat, my_result = my_result)
$BODY$ LANGUAGE 'plr';
An example of using the function to add a cluster column to a table and update it with the kmeans cluster ids:
ALTER TABLE public.my_spatial_points ADD COLUMN cluster_id integer;
UPDATE my_spatial_points SET cluster_id = cluster_id.cluster_id
FROM ( SELECT (my_result.r_cluster_id).gid AS gid,
(my_result.r_cluster_id).my_result AS cluster_id
FROM (SELECT r_cluster_id( 10, -- The number of clusters
) AS my_result
) AS cluster_id
WHERE my_spatial_points.gid = cluster_id.gid;
Loading the resulting table into a GIS package like QGIS and styling it using the new cluster_id column will hopefully give you something that looks like this: