Tuesday, July 11, 2006

The seduction of The One

As a programmer, the notion of The One is very tempting to me. Let me explain.

When designing code, you come across many different elements that have to be coordinated, manipulated and routed. Data and state information may need to be transmitted to other parts of your code, other programs on your system and sometimes even remote systems. Usually you come up with a model of how these different parts will interact with each other and you can make simplifications in the code that will enable enormous flexibility and scalability. For me it also gives me a good feeling inside knowing that I've just created a quality tool that will make the project easier in the future. I don't know much about eastern philosophy, but perhaps this is a Zen or Tao feeling of "rightness" in the code. Anyone who has spent much time programming will know this feeling.

Having an abstraction that provides a single interface from many code state sources to many state consumer destinations is something that, when done right, reduces the complexity of the code by an order of magnitude. This is "The One." A single representation of an idea that interoperates with all or most of your code making state changes nearly effortless.

However in real coding, things are never that simple. There are always problems with dependencies, synchronization and sometimes it is like trying to fit a square peg into a round hole. There is a saying attributed to Einstein along the lines of make things as simple as possible, but no simpler. This rings true again and again when coding. I have wasted countless days, weeks, even months trying to create an abstract superset of functionality that the project would just fit in nicely and have plenty of room to expand, wouldn't that be nice? To go from being an expert programmer to a master code craftsman, one must learn to avoid this pitfall at all costs. Nothing eats up more time than writing code that winds up never being used. We all throw away big blocks of code when a better replacement comes along, that is unaviodable, but in the planning stage of a project is where an over-enthusiastic programmer can really mess things up with a "simplification." There are local maxima and minima in programming and going over a little hill of work will sometimes put you in a state where things are much easier. More often however, doing a little foundation work to smooth the interface out will leave you where you started or even worse, make things more complex.

To tie this to my recent post about the shapedb format, the ability to add raster data to the shapedb is certainly nice and simplifies distribution for related data. However the need that brings rise to the shapedb format is not a convenient repository for data, but the processing overhead required in extracting and converting data into something useful from shapefiles. Now that the madness Hopefully I've just saved myself a few days of trying to make a nice "geodatabase" format that fits all sizes, I'll just focus on vector data for now.

Wednesday, July 05, 2006

Shapefiles considered harmful

There were a couple of posts about the usefulness of the shapefile going into the future. Jeff Thurston posed the question in his Moving Beyond the Shapefile post recently. A responding post by the Drkside of gis took an appropriate opposition position to the idea of using personal geodatabases as a replacement. Closed formats are no longer suitable in such an important data market. They reduce compatibility and introduce a vendor lock in on data that should, and in many cases, is in the public domain.

As a GIS file format discussion, this is near and dear to my heart. I agree wholeheartedly with the premise that the shapefile's days are and should be numbered. The question is, who will be big enough to put out a competitor? The .shp, .dbx and .shx troika to describe a single set of data reeks of '80s design. A single unit of data should reside in a single file, the complexities of opening 3 files and coordinating the shared representation of data between them would be a non-starter if it were designed today. I understand the convenience of de-coupling the index and attributes from the data when adding records to a shapefile (or should I say shapefiles) since you can just write to the end of each file rather than shuffle things around in a single file. Sorry, but that just isn't a good enough excuse anymore.

This isn't just a helpless rant about a data format that has outlived it's time. I am proposing a real alternative, a non-closed geodatabase, which I will call a "shapedb" that will work with the open source database Sqlite. If you haven't seen it yet, it is a cross platform, open source embedded database engine. It can be built into your commercial or non-commercial product with it's very non-restrictive license. I have been using it in EarthBrowser 2.8 onward and am relying on it heavily in version 3. I can't tell you how easy it makes things to have an embedded database. The ability to manipulate data with SQL statements to extract just the information you need is a quantum leap from the old dumb file format. Sqlite files are cross platform compatible so those endian issues between Motorola and Intel aren't an issue (believe me, that's a big issue usually, even for shapefiles). You can add as much data as you like to the file and index it however you like. Not only that, you can have raster data as well as vector data, gml data, kml data or whatever your requirements are.

Just saying put it in a cross platform database file and all your problems are solved isn't really a complete solution to the problem. There must be standards for how the data is organized and formatted, just like the shapefile. I propose a small group of GIS programmers and a user or two (or perhaps just me if nobody is interested) provide a standard template for each data type. As the simplest example I can think of just to illustrate the point. How about something like this:

Tables describing shapefile, shape objects and attributes:

create table shp (type integer,
xmin real, ymin real, zmin real, mmin real,
xmax real, ymax real, zmax real, mmax real);

create table shp_atts(id integer primary key,
...user defined attributes...);

create table shp_object (id integer primary key,
atts_id integer, shp_order integer,
shp_type integer, nvertices integer,
xmin real, ymin real, zmin real, mmin real,
xmax real, ymax real, zmax real, mmax real,
vertices blob);


The need for a shx file can be eliminated with the shp_order field, just use a select with an "order by shp_order". Another problem that I can't stand is removed as well. Each shape object in a shapefile has to have a corresponding entry in the dbx attributes, which is an ugly redundancy for me. Also you have to manually group shapes by checking against a key in the attributes that is not known beforehand and is different for each shapefile. With the atts_id field, you can group all shapes that are common to a particular entity. You could do a nice query like the following:

select * from shp_object as o, shp_atts as a where
a.state='OR' and a.id=o.atts_id order by shp_order;


You now have the vector outline of Oregon. Amazingly simple!
You can run with that idea and say you put in an attribute as to whether the vector is on the shoreline and all of a sudden you can do something like:

select * from shp_object as o, shp_atts as a where
a.state='OR' and a.shoreline=0 and a.id=o.atts_id
order by shp_order;

Now you have the outline of Oregon without the shoreline portion.

With the right configuration of attributes, you could even put multiple shapefiles into one shapedb, in fact that would very simple and make a lot of sense. Not only that, you could include many different raster formats in the same shapedb with your shape objects. That would create a neat little package to distribute a set of data that belongs together anyway. No more unpacking zip and tar files and getting the file paths correct. Just dump it all into one shapedb and send it out to your customers.

The raster format could be even more simple. You could just have an identifier, format information (like 'image/jpg') and an image blob. Why not just throw the well-known-text projection information into a field as well, or some gml data. The problem with many image formats is the all too common restriction in decompression libraries for the data to reside in a file. For problems like this, you could just dump the data to a temp file which is deleted upon completion of the operation. Using the shapedb format for a 1GB MrSid or ECW file would probably not be the best use of the format anyway for performance reasons. However a set of relatively small tif, png, jpg, jp2 files (a few megabytes each) would work fine, you could include multiple resolution levels, a system of image tiles or whatever application specific use you can think of. However one of the pre-defined table templates should be adhered to if you wanted the shapedb to be compatible with other applications using the shapedb format.

An important consideration is how easy it would be to convert current shapefiles to a shapedb. The short answer is that it would be almost trivial. The long answer is depending on how complex you decided to make the table setup, it could require a little more logic. Just a straight read of the .shp, .dbx and .shx file and inserting each shape object and attribute list into the respective tables, using the ordering of the shx file would do the trick. You could get slightly more tricky and collapse identical dbx records so they were unique and then index off of those in the shp_object table which would improve the speed of your select statements.

So in summary I propose a new format, called shapedb, as a new open format for interchange of GIS data which:
- based on the sqlite database file format
- can be shared cross platform
- store data from many shapefiles simultaneously
- store multiple raster files
- store application/vendor specific data
- data can be accessed and operated upon with SQL statements
- new formats possible by conforming to a table structure "template"

I am currently using this as my data format for EarthBrowser v3 and am considering spinning off an open source library to support the format in other applications.

Let me know what you think!