Wednesday, July 05, 2006

Shapefiles considered harmful

There were a couple of posts about the usefulness of the shapefile going into the future. Jeff Thurston posed the question in his Moving Beyond the Shapefile post recently. A responding post by the Drkside of gis took an appropriate opposition position to the idea of using personal geodatabases as a replacement. Closed formats are no longer suitable in such an important data market. They reduce compatibility and introduce a vendor lock in on data that should, and in many cases, is in the public domain.

As a GIS file format discussion, this is near and dear to my heart. I agree wholeheartedly with the premise that the shapefile's days are and should be numbered. The question is, who will be big enough to put out a competitor? The .shp, .dbx and .shx troika to describe a single set of data reeks of '80s design. A single unit of data should reside in a single file, the complexities of opening 3 files and coordinating the shared representation of data between them would be a non-starter if it were designed today. I understand the convenience of de-coupling the index and attributes from the data when adding records to a shapefile (or should I say shapefiles) since you can just write to the end of each file rather than shuffle things around in a single file. Sorry, but that just isn't a good enough excuse anymore.

This isn't just a helpless rant about a data format that has outlived it's time. I am proposing a real alternative, a non-closed geodatabase, which I will call a "shapedb" that will work with the open source database Sqlite. If you haven't seen it yet, it is a cross platform, open source embedded database engine. It can be built into your commercial or non-commercial product with it's very non-restrictive license. I have been using it in EarthBrowser 2.8 onward and am relying on it heavily in version 3. I can't tell you how easy it makes things to have an embedded database. The ability to manipulate data with SQL statements to extract just the information you need is a quantum leap from the old dumb file format. Sqlite files are cross platform compatible so those endian issues between Motorola and Intel aren't an issue (believe me, that's a big issue usually, even for shapefiles). You can add as much data as you like to the file and index it however you like. Not only that, you can have raster data as well as vector data, gml data, kml data or whatever your requirements are.

Just saying put it in a cross platform database file and all your problems are solved isn't really a complete solution to the problem. There must be standards for how the data is organized and formatted, just like the shapefile. I propose a small group of GIS programmers and a user or two (or perhaps just me if nobody is interested) provide a standard template for each data type. As the simplest example I can think of just to illustrate the point. How about something like this:

Tables describing shapefile, shape objects and attributes:

create table shp (type integer,
xmin real, ymin real, zmin real, mmin real,
xmax real, ymax real, zmax real, mmax real);

create table shp_atts(id integer primary key,
...user defined attributes...);

create table shp_object (id integer primary key,
atts_id integer, shp_order integer,
shp_type integer, nvertices integer,
xmin real, ymin real, zmin real, mmin real,
xmax real, ymax real, zmax real, mmax real,
vertices blob);


The need for a shx file can be eliminated with the shp_order field, just use a select with an "order by shp_order". Another problem that I can't stand is removed as well. Each shape object in a shapefile has to have a corresponding entry in the dbx attributes, which is an ugly redundancy for me. Also you have to manually group shapes by checking against a key in the attributes that is not known beforehand and is different for each shapefile. With the atts_id field, you can group all shapes that are common to a particular entity. You could do a nice query like the following:

select * from shp_object as o, shp_atts as a where
a.state='OR' and a.id=o.atts_id order by shp_order;


You now have the vector outline of Oregon. Amazingly simple!
You can run with that idea and say you put in an attribute as to whether the vector is on the shoreline and all of a sudden you can do something like:

select * from shp_object as o, shp_atts as a where
a.state='OR' and a.shoreline=0 and a.id=o.atts_id
order by shp_order;

Now you have the outline of Oregon without the shoreline portion.

With the right configuration of attributes, you could even put multiple shapefiles into one shapedb, in fact that would very simple and make a lot of sense. Not only that, you could include many different raster formats in the same shapedb with your shape objects. That would create a neat little package to distribute a set of data that belongs together anyway. No more unpacking zip and tar files and getting the file paths correct. Just dump it all into one shapedb and send it out to your customers.

The raster format could be even more simple. You could just have an identifier, format information (like 'image/jpg') and an image blob. Why not just throw the well-known-text projection information into a field as well, or some gml data. The problem with many image formats is the all too common restriction in decompression libraries for the data to reside in a file. For problems like this, you could just dump the data to a temp file which is deleted upon completion of the operation. Using the shapedb format for a 1GB MrSid or ECW file would probably not be the best use of the format anyway for performance reasons. However a set of relatively small tif, png, jpg, jp2 files (a few megabytes each) would work fine, you could include multiple resolution levels, a system of image tiles or whatever application specific use you can think of. However one of the pre-defined table templates should be adhered to if you wanted the shapedb to be compatible with other applications using the shapedb format.

An important consideration is how easy it would be to convert current shapefiles to a shapedb. The short answer is that it would be almost trivial. The long answer is depending on how complex you decided to make the table setup, it could require a little more logic. Just a straight read of the .shp, .dbx and .shx file and inserting each shape object and attribute list into the respective tables, using the ordering of the shx file would do the trick. You could get slightly more tricky and collapse identical dbx records so they were unique and then index off of those in the shp_object table which would improve the speed of your select statements.

So in summary I propose a new format, called shapedb, as a new open format for interchange of GIS data which:
- based on the sqlite database file format
- can be shared cross platform
- store data from many shapefiles simultaneously
- store multiple raster files
- store application/vendor specific data
- data can be accessed and operated upon with SQL statements
- new formats possible by conforming to a table structure "template"

I am currently using this as my data format for EarthBrowser v3 and am considering spinning off an open source library to support the format in other applications.

Let me know what you think!

4 comments:

Anonymous said...

That sounds good to me Matt and i am looking forward to version 3 of EB, when do you expect too launch
version 3 of EB?

Anonymous said...

Hi,

Have you had any more thoughts about shapedb? Any plans to release the library? I think you are on the right track with this, although implementing spatial indexing would be perhaps the biggest difficulty. Your idea ties in very nicely with the David Blasby's "Spatial DB in Box" http://docs.codehaus.org/display/GEOS/SpatialDBBox

Cheers,
Dave

matt_giger said...

Thanks Dave,
I have been working toward creating the shapedb format for some time now and things are going well. There have been some revisions and extensions to how it will work that I think are going to be very nice and useful. I've also been in contact with the NASA worldwind people a little bit and they sound like they may be interested as well.

I'm going to be posting with more updated information on shapedb in the next few weeks so stay tuned... :-)

Dave said...

Hi Matt,

I wonder how your work on shapedb is going? Let me know if you'd like some help with this - I think its a very worthwhile project. I guess you are aware of sqlite's new virtual indexing mechanism...

Cheers,
Dave Robertson