I’ve  recently needed to find a way to break a large dataset of points into manageable clusters, ideally within a Postgres database so this post looks at methods of clustering spatial data.

I’ve not found anything in PostGIS, although would love to be proved wrong, so I looked to the R statistical language and to using the PL/R extension in Postgres (sorry, this post assumes R and PL/R have been installed).

After muddling around in a bunch of things at the edge of my understanding, I’ve managed to create a PL/R function that takes a set of spatially distributed points, processes them using R’s k-means tools, and returns them with a cluster id appended.

Others may be able to suggest better ways of doing this, especially how to present the returning data but it seems to work.

At the moment it returns a composite value so I’ve created a special type, structured to accept the R output:

CREATE TYPE kmean_cluster_id AS (
gid integer,
lon double precision,
lat double precision,
my_result integer);

It means that using the result looks a little unconventional in an SQL script:

(myresult.r_cluster_id).gid AS gid, (myresult.r_cluster_id).my_result AS cluster_id

The input to the function is:

  1. The number of clusters you wish to generate
  2. An array of the GIS index (gid) of the data points so that the data can be linked back to the original points,
  3. Arrays of the latitude (lat) and longitude (long).

CREATE OR REPLACE FUNCTION r_cluster_id(         

         num_of_clusters integer,

         gid integer[],

         lon double precision[],

         lat double precision[]

     ) RETURNS SETOF kmean_cluster_id AS


    data <- cbind(lon, lat)

    cl <- kmeans(data, num_of_clusters)

    my_result = cl$cluster

    result <- data.frame(gid = gid, lon = lon, lat = lat, my_result = my_result)



An example of using the function to add a cluster column to a table and update it with the kmeans cluster ids:

ALTER TABLE public.my_spatial_points ADD COLUMN cluster_id integer;

UPDATE my_spatial_points SET cluster_id = cluster_id.cluster_id

FROM ( SELECT (my_result.r_cluster_id).gid AS gid,

              (my_result.r_cluster_id).my_result AS cluster_id

       FROM (SELECT r_cluster_id( 10, -- The number of clusters

                    array_agg(gid), array_agg(ST_X(the_geom)),

                    array_agg(ST_Y(the_geom)) )

             FROM my_spatial_points

            ) AS my_result

     ) AS cluster_id

WHERE my_spatial_points.gid = cluster_id.gid;

Loading the resulting table into a GIS package like QGIS and styling it using the new cluster_id column will hopefully give you something that looks like this:

k-means clusters

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>