Sensitivity analysis made (relatively) simple

posted by on 19 Jan 2014

Best bang for your analytical buck

As (geo)data scientists, we spend much of our time working with data models that try (with varying degrees of success) to capture some essential truth about the world while still being as simple as possible to provide a useful abstraction. Inevitably, complexity starts to creep into every model and we don’t often stop to assess the value added by that complexity. When working with models that require a large number of parameters and a huge domain of potential inputs that are expensive to collect, it becomes difficult to answer the question:

What parameters of the model are the most sensitive?

In other words, if I am going to spend my resources obtaining/refining data for this model, where should I focus my efforts in order to get the best bang for the buck? If I spend weeks working on deriving a single parameter for the model, I want some assurance that the parameter is critically important to the model’s prediction. The flip-side, of course, is that if a parameter is not that important to the model’s predictive power, I could save some time by perhaps just using some quick-and-dirty approximation.

SALib: a python module for testing model sensitivity

I was thrilled to find SALib which implements a number of vetted methods for quantitatively assessing parameter sensitivity. There are three basic steps to running SALib:

  1. Define the parameters to test, define their domain of possible values and generate n sets of randomized input parameters.
  2. Run the model n times and capture the results.
  3. Analyze the results to identify the most/least sensitive parameters.

I’ll leave the details of these steps to the SALib documentation. The beauty of the SALib approach is that you have the flexibility[1] to run any model in any way you want, so long as you can manipulate the inputs and outputs adequately.

Case Study: Climate effects on forestry

I wanted to compare a forest growth and yield model under different climate change scenarios in order to assess what the most sensitive climate-related variables were. I identified 4 variables:

  • Climate model (4 global circulation models)
  • Representative Concentration Pathways (RCPs; 3 different emission trajectories)
  • Mortality factor for species viability (0 to 1)
  • Mortality factor for equivalent elevation change (0 to 1)

In this case, I was using the Forest Vegetation Simulator(FVS) which requires a configuration file for every model iteration. So, for Step 2, I had to iterate through each set of input variables and use them to generate an appropriate configuration file. This involved translating the real numbers from the samples into categorical variables in some cases. Finally, in order to get the result of the model iteration, I had to parse the outputs of FVS and do some post-processing to obtain the variable of interest (the average volume of standing timber over 100 years). So the flexibility of SALib comes at a slight cost: unless your model works directly with the file formatted for SALib, the input and outputs may require some data manipulation.

After running the all required iterations of the model[2] I was able to analyze the results and assess the sensitivity of the four parameters.

Here’s the output of SALib’s analysis (formatted slightly for readability):

Parameter    First_Order First_Order_Conf Total_Order Total_Order_Conf
circulation  0.193685    0.041254         0.477032    0.034803
rcp          0.517451    0.047054         0.783094    0.049091
mortviab    -0.007791    0.006993         0.013050    0.007081
mortelev    -0.005971    0.005510         0.007162    0.006693

The first order effects represent the effect of that parameter alone. The total order effects are arguably more relevant to understanding the overall interaction of that parameter with your model. The “Conf” columns represent confidence and can be interpreted as error bars.

In this case, we interpret the output as follows:

Parameter    Total Order Effect   
circulation  0.47  +- 0.03  (moderate influence)      
rcp          0.78  +- 0.05  (dominant parameter)
mortviab     0.01  +- 0.007 (weak influence)
mortelev     0.007 +- 0.006 (weak influence)

We can graph each of the input parameters against the results to visualize this:

sagraph

Note that the ‘mortelev’ component is basically flat (as the factor increases, the result stays the same) whereas the choice of ‘rcp’ has a heavy influence (as emissions increase to the highest level, the resulting prediction for timber volumes are noticeably decreased).

The conclusion is that the climate variables, particularly the RCPs related to human-caused emissions, were the strongest determinants[1] of tree growth for this particular forest stand. This ran counter to our initial intuition that the mortality factors would play a large role in the model. Based on this sensitivity analysis, we may be able to avoid wasting effort on refining parameters that are of minor consequence to the output.


Footnotes:

  1. Compared to more tightly integrated, model-specific methods of sensitivity analysis
  2. 20 thousand iterations took approximately 8 hours; sensitivity analysis generally requires lots of processing
  3. Note that the influence of a parameter says nothing about direct causality

Leaflet Simple CSV

posted by on 30 Sep 2013

Simple leaftlet-based template for mapping tabular point data on a slippy map

Anyone who’s worked with spatial data and the web has run across the need to take some simple tabular data and put points on an interactive map. It’s the fundamental “Hello World” of web mapping. Yet I always find myself spending way too much time solving this seemingly simple problem. When you consider zoom levels, attributes, interactivity, clustering, querying, etc… it becomes apparent that interactive maps require a bit more legwork. But that functionality is fairly consistent case-to-case so I’ve developed a generalized solution that works for the majority of basic use cases out there:

leaftlet-simple-csv on github

The idea is pretty generic but useful for most point marker maps: * Data is in tabular delimited-text (csv, etc.) with two required columns: lat and lng * Points are plotted on full-screen Leaflet map * Point markers are clustered dynamically based on zoom level. * Clicking on a point cluster will zoom into the extent of the underlying features. * Hovering on the point will display the name. * Clicking will display a popup with columns/properties displayed as an html table. * Full text filtering with typeahead * Completely client-side javascript with all dependencies included or linked via CDN

Of course this is mostly just a packaged version of existing work, namely Leaflet with the geoCSV and markercluster plugins.

Usage

  1. Grab the leaflet-simple-csv zip file and unzip it to a location accessible through a web server.
  2. Copy the config.js.template to config.js
  3. Visit the index.html page to confirm everything is working with the built-in example.
  4. Customize your config.js for your dataset.

An example config:

var dataUrl = 'data/data.csv';
var maxZoom = 9;
var fieldSeparator = '|';
var baseUrl = 'http://otile{s}.mqcdn.com/tiles/1.0.0/osm/{z}/{x}/{y}.jpg';
var baseAttribution = 'Data, imagery and map information provided by <a href="http://open.mapquest.co.uk" target="_blank">MapQuest</a>, <a href="http://www.openstreetmap.org/" target="_blank">OpenStreetMap</a> and contributors, <a href="http://creativecommons.org/licenses/by-sa/2.0/" target="_blank">CC-BY-SA</a>';
var subdomains = '1234';
var clusterOptions = {showCoverageOnHover: false, maxClusterRadius: 50};
var labelColumn = "Name";
var opacity = 1.0;

The example dataset:

Country|Name|lat|lng|Altitude
United States|New York City|40.7142691|-74.0059738|2.0
United States|Los Angeles|34.0522342|-118.2436829|115.0
United States|Chicago|41.8500330|-87.6500549|181.0
United States|Houston|29.7632836|-95.3632736|15.0
...

I make no claims that this is the “right” way to do it but leveraging 100% client-side javascript libraries and native delimited-text formats seems like the simplest approach. Many of the features included here (clustering, filtering) are useful enough to apply to most situations and hopefully you’ll find it useful.


Python raster stats

posted by on 24 Sep 2013

A common task in many of my data workflows involves summarizing geospatial raster datasets based on vector geometries (i.e. zonal statistics). Despite many alternatives (starspan, the QGIS Zonal Statistics plugin, ArcPy and R) there were none that were

  • open source
  • fast enough
  • flexible enough
  • worked with python data structures

We’d written a wrapper around starspan for madrona (see madrona.raster_stats ) but relying on shell calls and an aging, unmaintained C++ code base was not cutting it.

So I set out to create a solution using numpy, GDAL and python. The rasterstats package was born.

`python-raster-stats` on github

Example

Let’s jump into an example. I’ve got a polygon shapefile of continental US state boundaries and a raster dataset of annual precipitation from the North American Environmental Atlas.

states_precip

>>> states = "/data/workspace/rasterstats_blog/boundaries_contus.shp"
>>> precip = "/data/workspace/rasterstats_blog/NA_Annual_Precipitation_GRID/NA_Annual_Precipitation/data/na_anprecip/hdr.adf"

The raster_stats function is the main entry point. Provide a vector and a raster as input and expect a list of dicts, one for each input feature.

>>> from rasterstats import raster_stats
>>> rain_stats = raster_stats(states, precip, stats="*", copy_properties=True)
>>> len(rain_stats)  # continental US; 48 states plus District of Columbia

49

Print out the stats for a given state:

>>> [x for x in rain_stats if x['NAME'] == "Oregon"][0]

{'COUNTRY': 'USA',
 'EDIT': 'NEW',
 'EDIT_DATE': '20060803',
 'NAME': 'Oregon',
 'STATEABB': 'US-OR',
 'Shape_Area': 250563567264.0,
 'Shape_Leng': 2366783.00361,
 'UIDENT': 124704,
 '__fid__': 35,
 'count': 250510,
 'majority': 263,
 'max': 3193.0,
 'mean': 779.2223903237395,
 'median': 461.0,
 'min': 205.0,
 'minority': 3193,
 'range': 2988.0,
 'std': 631.539502512283,
 'sum': 195203001.0,
 'unique': 2865}

Find the three driest states:

>>> [(x['NAME'], x['mean']) for x in 
       sorted(rain_stats, key=lambda k: k['mean'])[:3]]

[('Nevada', 248.23814034118908),
 ('Utah', 317.668743027571),
 ('Arizona', 320.6157232064074)]

And write the data out to a csv.

from rasterstats import stats_to_csv
with open('out.csv', 'w') as fh:
    fh.write(stats_to_csv(rain_stats))

Geo interface

The basic usage above shows the path of an entire OGR vector layer as the first argument. But raster-stats also supports other vector features/geometries.

  • Well-Known Text/Binary
  • GeoJSON string and mappings
  • Any python object that supports the geo_interface
  • Single objects or iterables

In this example, I use a geojson-like python mapping to specify a single geometry

>>> geom = {'coordinates': [[
   [-594335.108537269, -570957.932799394],
   [-422374.54395311, -593387.5716581973],
   [-444804.1828119133, -765348.1362423564],
   [-631717.839968608, -735441.9510972851],
   [-594335.108537269, -570957.932799394]]],
 'type': 'Polygon'}
 
>>> raster_stats(geom, precip, stats="min median max")
 
[{'__fid__': 0, 'max': 1011.0, 'median': 451.0, 'min': 229.0}]

Categorical

We’re not limited to descriptive statistics for continuous rasters either; we can get unique pixel counts for categorical rasters as well. In this example, we’ve got a raster of 2005 land cover (i.e. general vegetation type).

states_veg

Note that we can specify only the stats that make sense and the categorical=True provides a count of each pixel value.

>>> landcover = "/data/workspace/rasterstats_blog/NA_LandCover_2005.img"
>>> veg_stats = raster_stats(states, landcover, 
    stats="count majority minority unique",
    copy_properties=True,
    nodata_value=0,
    categorical=True)
>>> [x for x in veg_stats if x['NAME'] == "Oregon"][0]

{1: 999956,
 3: 6,
 5: 3005,
 6: 198535,
 8: 2270805,
 10: 126199,
 14: 20883,
 15: 301884,
 16: 17452,
 17: 39246,
 18: 28872,
 19: 2174,
 'COUNTRY': 'USA',
 'EDIT': 'NEW',
 'EDIT_DATE': '20060803',
 'NAME': 'Oregon',
 'STATEABB': 'US-OR',
 'Shape_Area': 250563567264.0,
 'Shape_Leng': 2366783.00361,
 'UIDENT': 124704,
 '__fid__': 35,
 'count': 4009017,
 'majority': 8,
 'minority': 3,
 'unique': 12}

Of course the pixel values alone don’t make much sense. We need to interpret the pixel values as land cover classes:

Value, Class_name
1       Temperate or sub-polar needleleaf forest
2       Sub-polar taiga needleleaf forest
3       Tropical or sub-tropical broadleaf evergreen
4       Tropical or sub-tropical broadleaf deciduous
5       Temperate or sub-polar broadleaf deciduous
6        Mixed Forest
7       Tropical or sub-tropical shrubland
8       Temperate or sub-polar shrubland
9       Tropical or sub-tropical grassland
10      Temperate or sub-polar grassland
11      Sub-polar or polar shrubland-lichen-moss
12      Sub-polar or polar grassland-lichen-moss
13      Sub-polar or polar barren-lichen-moss
14      Wetland
15      Cropland
16      Barren Lands
17      Urban and Built-up
18      Water
19      Snow and Ice

So, for our Oregon example above we can see that, despite Oregon’s reputation as a lush green landscape, the majority land cover class (#8) is “Temperate or sub- polar shrubland” at 2.27m pixels out of 4 millions total.

There’s a lot more functionality that isn’t covered in this post but you get the picture… please check it out and let me know what you think.

Creating UTFGrids directly from a polygon datasource

posted by on 20 Aug 2012

We’ve begun to rely on the interactivity provided by UTFGrids in many of our recent web maps. (Quick recap: UTFGrids are “invisible” map tiles that allow direct interactivity with feature attributes without querying the server.) Earlier this year, I created the initial OpenLayers UTFGrid support and was glad to see it accepted into OpenLayer 2.12 (with some enhancements).

With the client-side javascript support in place, the only missing piece in the workflow was to create the UTFGrid .json files. We had expirimented with several alternate UTFGrid renderers but Mapnik’s rendering was by far the fastest and produced the best results. Using Tilemill was a convenient way to leverage the Mapnik UTFGrid renderer but it came at the cost of a somewhat circuitious and manual workflow:

  1. Load the data up into Tilemill,
  2. Configure interactivity fields
  3. Export to .mbtiles
  4. Convert to .json files

What we really needed was a script to take a polygon shapefile and render the UTFGrids directly to files. Mapnik would provide the rendering while the Global Map Tiles python module would provide the logic for going back and forth between geographic coordinates and tile grid coordinates. From there it’s just a matter of determining the extent of the data set and, for a specified set of zoom levels, looping through and using Mapnik to render the UTFGrid to a .json file in Z/X/Y.json directory structure.

Get `create-utfgrids` on github

If we have a mercator polygon shapefile of ecoregions and want to render UTFGrids for zoom levels 3 through 5 using the dom_desc and div_desc attributes, we could use a command like

$ ./create_utfgrids.py test_data/bailey_merc.shp 3 5 ecoregions -f dom_desc,div_desc

WARNING:
This script assumes a polygon shapefile in spherical mercator projection.
If any of these assumptions are not true, don't count on the results!
 * Processing Zoom Level 3
 * Processing Zoom Level 4
 * Processing Zoom Level 5

and inspect the output (e.g. zoom level 5, X=20, Y=18)

$ cat ecoregions/5/20/18.json | python -mjson.tool
{
    "data": {
        "192": {
            "div_desc": "RAINFOREST REGIME MOUNTAINS", 
            "dom_desc": "HUMID TROPICAL DOMAIN"
        }, 
...
    "grid": [
        "  !!!!!!!!!#####$%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%", 
...

Some caveats:

  • This currently only works for polygon datasets in a Web Mercator projection.
  • It’s only tested with shapefiles as it assumes a single-layer datasource at the moment. Full OGR Datasource support would not be too difficult to add for PostGIS, etc.
  • It assumes a top-origin tile scheme (as do OSM and Google Maps). Supporting TMS bottom-origin schemes in the future should be straightforward.
  • Requires OGR and Mapnik >= 2.0 with python bindings. Finding windows binaries for the required version of Mapnik may be difficult so using OSX/Linux is recommended at this time.

Many thanks to Dane Springmeyer for his help on UTFGrid related matters and and to Klokan Petr Přidal for his MapTiler docs

Introducing the Madrona framework

posted by on 11 Jul 2012

Madrona: A software framework for effective place-based decision making

Madrona

My work at Ecotrust mainly revolves around creating web-based spatial analysis tools - software to bring data-driven science to the place-based descision making process. This began several years ago when I joined the MarineMap team. Since working with Ecotrust, we’ve taken the MarineMap software far beyond it’s original niche. What was once a specific tool for marine protected area planning has now become a powerful framework for all sorts of web-based spatial tools in the realms of marine, forestry, conservation planning, aquatic habitat restoration, etc. So, in a sense, Madrona is a recognition of that evolution.

From the official Madrona release announcement from the Ecotrust blog post:

Over the last year we’ve distilled the best ideas from our most successful tools into a suite of software building blocks that can be mixed and matched to create cutting-edge software for decision support and spatial planning at any scale. These building blocks are already at the heart of our work and now we’re ready to share them with you.

So what is Madrona from a developer’s perspective?

  • A set of python django apps that provide models, views and templates for representing spatial features and solving problems specific to spatial decision tools.
  • A RESTful API for accessing spatial features
  • A collection of javascript libraries (based on JQuery) to provide a web-based interface to the API.

In short, we think its a great platform for spatial tools and we want to open it up to the wider developer audience. Ecotrust already has many madrona-based apps in the wild (with many more in development) but we’re hoping to get other folks using (and contributing to) the Madrona framework in the future.

I know this post is short on technical details but there will more to come … for now, check out the technology page for an overview or the developer’s page to dive in.

Older Posts