An Open Citizen Science Data Platform

Request for Comments

We are building a platform for storing, viewing, and sharing citizen science data. The platform will allow anyone anywhere to create online data sets by uploading data from their own environmental sensors, mobile devices, do-it-yourself science equipment, and other measurement tools.

The system will provide a web-based interface for exploring the data, including maps and graphs (time series, histograms, and scatter plots). We will include tools for editing data set attributes and creating data entry forms, so that anyone can set up an web-based community data entry project.

The platform will be entirely open source. We will provide both a working service and the ability to clone our service. You will be able to deploy a copy of our system to a platform-as-a-service (e.g. Openshift) with a few simple steps. You will never be locked into the system, because you can freely transfer your data and copy our code.

Example Applications

Students can attach sensors to their backpacks, bicycles, and skateboards for measuring their commutes to and from school, including the air quality, noise, weather, and movement. They can explore and analyze the data in class or at home. They can share the data; a student in Kansas can compare her commute with a student in New York City.

Citizens around a polluted bay can set up stations that measure water temperature, salinity, turbidity, and dissolved oxygen. They can also enter manual observations of jellyfish, birds, and unusual storms. The data can be used to help understand ecosystem changes and to augment community outreach related to damaging runoff.

A middle school science class can set up a table-top ecosystem that cycles water between a fish tank and plant growing boxes. Sensors can measure changes in pH, conductivity, and other water attributes. Students can learn about nitrogen cycles and ecosystem dynamics. The sensors and data viewer become a classroom tool, like a microscope, that uses local observation to illustrate larger phenomena.

A biohacker community group can augment their do-it-yourself biology equipment (thermocycler, incubator, algae bioreactor) with sensors that feed data into a public repository, where their experiments can be documented, observed by the public, and later replicated by others.

Primary Features

  • Anyone can create a data set and upload data to it.
  • We provide a web-based interface for viewing the data with maps and graphs (time series, histograms, scatter plots).
  • The system is completely open source. We let you clone our server configuration and give you instructions on how to do so.
  • Data can be spatial, temporal, or neither (e.g. a list of attributes of different plants).
  • Data can be freely downloaded and freely shared (subject to the license attached to the data).
  • Users can configure friendly user entry forms for any data set.
  • REST API for accessing all data and metadata.
  • Low cost: small data sets are free and large data sets are affordable. (If you do not want to pay anything, you can easily set up your own server to host your data.)

Specific Questions

  • What terminology should we use? Field or channel or column? Row or record? Table or data set?
  • What kinds of geospatial data and operations should be supported?
  • License per data set or per record?
  • What can we do to allow the system to handle huge data sets?
  • What should we call it? ManyData? Sensaur?

Preliminary Design Decisions

  • Every field has a name.
  • Every record has an ID number (not necessarily sequential or ascending).
  • Every field has meta data (type and units).
  • User interface meta data (decimal places, allowed entry options, etc.) will be stored with this field meta data.
  • A data set can have missing values.
  • A data set can have multiple timestamp fields, but only one primary timestamp field (to be used when querying by timestamp).
  • A data set can have multiple latitude and longitude fields, but only one primary lat-long pair.
  • Simple devices (e.g. Arduinos) can submit data via HTTP using a private access key hashed with the data record.
  • If a record is submitted twice, it will only be created once (i.e. if all fields have the same value as previous record). A data set created by uploading a file may have duplicate records (all field values the same).
  • A data set will have permissions: users can assign view, edit, append, and admin permissions to other users (and possibly groups of users).
  • API will allow adding, deleting, and reordering fields in existing data sets. This will lock the data set until all records have been updated.
  • A field value can refer to an image or other file.
  • A data set can be configured to automatically route new records to other server instances or outside services (from among a list of supported services).

Preliminary Implementation

  • The preliminary implementation is fairly simple: everything is stored in a single Postgres database.
  • Data set table: stores meta data about a data set, including license, timezone, permissions, and column meta data.
  • Data set record table: stores each record as a database record. This table includes a dataset ID, record ID, primary timestamp field (may be blank), primary latitude/longitude fields (may be blank), and field values (as a JSON text field).
  • Our server is implemented in Python using Django. We plan to provide a complete Django system that can be immediately deployed to a new platform (Openshift, Heroku, etc.).
  • In addition to API documentation, we'll provide internal documentation so that anyone can create new Django packages that extend the system's functionality.

Some Existing Platforms

(Let us know if you know of others.)

Contact

Please send us your comments: feedback@manylabs.org.

This project is sponsored by manylabs.org.


Creative Commons License