Sensitivity analysis made (relatively) simple

posted by on 19 Jan 2014

Best bang for your analytical buck

As (geo)data scientists, we spend much of our time working with data models that try (with varying degrees of success) to capture some essential truth about the world while still being as simple as possible to provide a useful abstraction. Inevitably, complexity starts to creep into every model and we don’t often stop to assess the value added by that complexity. When working with models that require a large number of parameters and a huge domain of potential inputs that are expensive to collect, it becomes difficult to answer the question:

What parameters of the model are the most sensitive?

In other words, if I am going to spend my resources obtaining/refining data for this model, where should I focus my efforts in order to get the best bang for the buck? If I spend weeks working on deriving a single parameter for the model, I want some assurance that the parameter is critically important to the model’s prediction. The flip-side, of course, is that if a parameter is not that important to the model’s predictive power, I could save some time by perhaps just using some quick-and-dirty approximation.

SALib: a python module for testing model sensitivity

I was thrilled to find SALib which implements a number of vetted methods for quantitatively assessing parameter sensitivity. There are three basic steps to running SALib:

  1. Define the parameters to test, define their domain of possible values and generate n sets of randomized input parameters.
  2. Run the model n times and capture the results.
  3. Analyze the results to identify the most/least sensitive parameters.

I’ll leave the details of these steps to the SALib documentation. The beauty of the SALib approach is that you have the flexibility[1] to run any model in any way you want, so long as you can manipulate the inputs and outputs adequately.

Case Study: Climate effects on forestry

I wanted to compare a forest growth and yield model under different climate change scenarios in order to assess what the most sensitive climate-related variables were. I identified 4 variables:

  • Climate model (4 global circulation models)
  • Representative Concentration Pathways (RCPs; 3 different emission trajectories)
  • Mortality factor for species viability (0 to 1)
  • Mortality factor for equivalent elevation change (0 to 1)

In this case, I was using the Forest Vegetation Simulator(FVS) which requires a configuration file for every model iteration. So, for Step 2, I had to iterate through each set of input variables and use them to generate an appropriate configuration file. This involved translating the real numbers from the samples into categorical variables in some cases. Finally, in order to get the result of the model iteration, I had to parse the outputs of FVS and do some post-processing to obtain the variable of interest (the average volume of standing timber over 100 years). So the flexibility of SALib comes at a slight cost: unless your model works directly with the file formatted for SALib, the input and outputs may require some data manipulation.

After running the all required iterations of the model[2] I was able to analyze the results and assess the sensitivity of the four parameters.

Here’s the output of SALib’s analysis (formatted slightly for readability):

Parameter    First_Order First_Order_Conf Total_Order Total_Order_Conf
circulation  0.193685    0.041254         0.477032    0.034803
rcp          0.517451    0.047054         0.783094    0.049091
mortviab    -0.007791    0.006993         0.013050    0.007081
mortelev    -0.005971    0.005510         0.007162    0.006693

The first order effects represent the effect of that parameter alone. The total order effects are arguably more relevant to understanding the overall interaction of that parameter with your model. The “Conf” columns represent confidence and can be interpreted as error bars.

In this case, we interpret the output as follows:

Parameter    Total Order Effect   
circulation  0.47  +- 0.03  (moderate influence)      
rcp          0.78  +- 0.05  (dominant parameter)
mortviab     0.01  +- 0.007 (weak influence)
mortelev     0.007 +- 0.006 (weak influence)

We can graph each of the input parameters against the results to visualize this:

sagraph

Note that the ‘mortelev’ component is basically flat (as the factor increases, the result stays the same) whereas the choice of ‘rcp’ has a heavy influence (as emissions increase to the highest level, the resulting prediction for timber volumes are noticeably decreased).

The conclusion is that the climate variables, particularly the RCPs related to human-caused emissions, were the strongest determinants[1] of tree growth for this particular forest stand. This ran counter to our initial intuition that the mortality factors would play a large role in the model. Based on this sensitivity analysis, we may be able to avoid wasting effort on refining parameters that are of minor consequence to the output.


Footnotes:

  1. Compared to more tightly integrated, model-specific methods of sensitivity analysis
  2. 20 thousand iterations took approximately 8 hours; sensitivity analysis generally requires lots of processing
  3. Note that the influence of a parameter says nothing about direct causality

Leaflet Simple CSV

posted by on 30 Sep 2013

Simple leaftlet-based template for mapping tabular point data on a slippy map

Anyone who’s worked with spatial data and the web has run across the need to take some simple tabular data and put points on an interactive map. It’s the fundamental ”Hello World” of web mapping. Yet I always find myself spending way too much time solving this seemingly simple problem. When you consider zoom levels, attributes, interactivity, clustering, querying, etc… it becomes apparent that interactive maps require a bit more legwork. But that functionality is fairly consistent case-to-case so I’ve developed a generalized solution that works for the majority of basic use cases out there:

leaftlet-simple-csv on github

The idea is pretty generic but useful for most point marker maps:

  • Data is in tabular delimited-text (csv, etc.) with two required columns: lat and lng
  • Points are plotted on full-screen Leaflet map
  • Point markers are clustered dynamically based on zoom level.
  • Clicking on a point cluster will zoom into the extent of the underlying features.
  • Hovering on the point will display the name.
  • Clicking will display a popup with columns/properties displayed as an html table.
  • Full text filtering with typeahead
  • Completely client-side javascript with all dependencies included or linked via CDN

Of course this is mostly just a packaged version of existing work, namely Leaflet with the geoCSV and markercluster plugins.

Usage

  1. Grab the leaflet-simple-csv zip file and unzip it to a location accessible through a web server.
  2. Copy the config.js.template to config.js
  3. Visit the index.html page to confirm everything is working with the built-in example.
  4. Customize your config.js for your dataset.

An example config:

var dataUrl = 'data/data.csv';
var maxZoom = 9;
var fieldSeparator = '|';
var baseUrl = 'http://otile{s}.mqcdn.com/tiles/1.0.0/osm/{z}/{x}/{y}.jpg';
var baseAttribution = 'Data, imagery and map information provided by <a href="http://open.mapquest.co.uk" target="_blank">MapQuest</a>, <a href="http://www.openstreetmap.org/" target="_blank">OpenStreetMap</a> and contributors, <a href="http://creativecommons.org/licenses/by-sa/2.0/" target="_blank">CC-BY-SA</a>';
var subdomains = '1234';
var clusterOptions = {showCoverageOnHover: false, maxClusterRadius: 50};
var labelColumn = "Name";
var opacity = 1.0;

The example dataset:

Country|Name|lat|lng|Altitude
United States|New York City|40.7142691|-74.0059738|2.0
United States|Los Angeles|34.0522342|-118.2436829|115.0
United States|Chicago|41.8500330|-87.6500549|181.0
United States|Houston|29.7632836|-95.3632736|15.0
...

I make no claims that this is the “right” way to do it but leveraging 100% client-side javascript libraries and native delimited-text formats seems like the simplest approach. Many of the features included here (clustering, filtering) are useful enough to apply to most situations and hopefully you’ll find it useful.