A year or so ago I wrote a blog on how to cluster points of geographical data from within an SQL database. It was a working solution but inelegant and a bit clunky so I’ve finally got around to tidying it up and turning it into a proper  function.

PostgreSQL and PostGIS don’t have built-in clustering functions so I looked to the R statistical language and to using the PL/R extension in Postgres. Sorry, this post assumes R and PL/R have been installed – if you haven’t Boston GIS have a very good guide.

The function is a lot simpler than the previous one, and doesn’t require any special data types – its works just like any other function.

CREATE OR REPLACE FUNCTION r_cluster_kmeans(x float8, y float8, clust_no float8)
RETURNS int
AS $$
   set.seed(101)
   if (pg.state.firstpass == TRUE) {
       pg.state.firstpass <<- FALSE
       c <- kmeans(cbind(farg1, farg2), clust_no)
       assign("cluster", c$cluster, env = .GlobalEnv)
   }

   return(cluster[prownum])
$$
WINDOW
STRICT
VOLATILE
LANGUAGE plr;

And using it it equally easy. The function just needs the X and Y co-ordinates or the point, and the number of clusters you want to create, and it returns the specific cluster the point belongs to:

SELECT gid,
       r_cluster_kmeans(ST_X(the_geom), ST_Y(the_geom), 5) OVER () AS cluster,
       the_geom AS geom
FROM mytable;

In this example, it assigns each point in “mytable” to one of 5 clusters.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>