menu

NASA Tournament Lab

 6,754

NASA Airathon

Air pollution is a serious environmental threat to our health. Strengthen public health with accurate, high-resolution air quality estimates
stage:
Won
prize:
$50,000
Partners
more
Summary
Timeline
Forum
Teams11
Press
FAQ
Particulate Track
Trace Gas Track
About the Project
Summary

Overview

Air pollution is one of the greatest environmental threats to human health. It can result in heart and chronic respiratory illness, cancer, and premature death.

Currently, no single satellite instrument provides ready-to-use, high resolution information on surface-level air pollutants. This gap in information means that millions of people cannot take daily action to protect their health. Help NASA deliver accurate, high-resolution air quality information to improve public health and safety!

This challenge focuses on two critical air quality measures: particulate matter 2.5 (PM2.5) and nitrogen dioxide (NO2). Each measure is the target for a predictive track in the challenge.

  • Particulate Track (PM2.5)
    PM2.5 refers to particulate matter less than 2.5 micrometers in size. It can last days to weeks in the atmosphere and penetrate deep into human lungs, increasing the risk of heart disease, lower respiratory infections, and poor pregnancy outcomes.
  • Trace Gas Track (NO2) 
    NO2 forms in the atmosphere from the burning of fossil fuels such as coal, oil, or gas, and has a short lifetime on the order of hours near the surface. It can cause respiratory issues, while also contributing to the production of ozone and nitrate aerosols, a component of PM2.5.

Existing high-quality ground monitors measure PM2.5 and NO2, but are expensive and have large gaps in coverage. Models that make use of widely available satellite data have the potential to provide local, daily air quality information. Recent studies have described such algorithms, but leave unsettled what data inputs or models produce the highest performance. Additionally, these experimental models have yet to be made available for easy public consumption.

In this challenge, your task is to use remote sensing data and other geospatial data sources to develop models for estimating daily levels of PM2.5 and NO2 with high spatial resolution. Successful models could provide critical data to help the public take action to reduce their exposure to air pollution.

 

Prizes 

Particulate Track (PM2.5) - $25,000

Trace Gas Track (NO2) - $25,000

Total: $50,000

 

Breakdown

Particulate Track

Evaluated on predicted PM2.5 values across all 5x5km grid cells in the test set.

1st - $12,000

2nd - $8,000

3rd - $5,000

 

Trace Gas Track

Evaluated on predicted NO2 values across all 5x5km grid cells in the test set.

1st - $12,000

2nd - $8,000

3rd - $5,000

 

Note on prize eligibility:

NASA Employees are prohibited by Federal statutes and regulations from receiving an award under this Challenge. NASA Employees are still encouraged to submit a solution. If you are a NASA Employee and wish to submit a solution please contact shobhana.gupta@nasa.gov who will connect you with the NASA Challenge owner. If your solution meets the requirements of the Challenge, any attributable information will be removed from your submission and your solution will be evaluated with other solutions found to meet the Challenge criteria. Based on your solution, you may be eligible for an award under the NASA Awards and Recognition Program or other Government Award and Recognition Program if you meet the criteria of both this Challenge and the applicable Awards and Recognition Program.

If you are an Employee of another Federal Agency, contact your Agency's Office of General Counsel regarding your ability to participate in this Challenge.

If you are a Government contractor or are employed by one, your participation in this challenge may also be restricted. If you are or your employer is receiving Government funding for similar projects, you or your employer are not eligible for award under this Challenge. Additionally, the U.S. Government may have Intellectual Property Rights in your solution if your solution was made under a Government Contract, Grant or Cooperative Agreement. Under such conditions, you may not be eligible for award.

Timeline
Forum
Teams11
Press
FAQ
Particulate Track

Particulate Track

Problem description: Particulate Track

To submit to the Particulate Track, navigate to the challenge track on DrivenData.org here.

The goal of this challenge is to generate daily estimates of surface-level PM2.5 for a set of 5km x 5km geometries across three urban areas:

  1. Los Angeles (South Coast Air Basin)
  2. Delhi
  3. Taipei

These locations represent areas with readily available satellite data, but varying levels of pollution and historical data. High performing models should do well in all three locations.

There will be two separate tracks for the prediction of each pollutant, PM2.5 and NO2. This is the Particulate Track (PM2.5). You can find the Trace Gas Track (NO2) here. More information on the dataset, performance metric, and submission specifications is provided below.

Finalists and runners-up will be determined by performance on the private test set. These participants will then have the opportunity to submit their code to be audited using an out-of-sample verification set. The top 3 eligible teams that pass this final check in each track will be awarded prizes.

 

Data

Submission Format

Performance metric

 

Data

There are two types of pre-approved data that can be used as input to your model: satellite and ancillary (meteorological and topological) data. We provide satellite data through a public s3 bucket hosted by DrivenData. These files have already been subsetted to the correct times and geographies. Ancillary data can be accessed through public portals. You may use any approved data sources you like, but you must use at least one dataset from the list of approved satellite data.

Note that the data are provided in various spatiotemporal resolutions and formats. Malfunctioning instruments and other errors may lead to missing and incorrect values. The datasets may also cover different date ranges or geographical areas. It is your job to figure out how to best combine these datasets to build the best model!

 

Approved features (inputs)

All data sources detailed below are pre-approved for both model training and inference. These data have undergone a careful selection process and are freely and publicly available. Note that only the listed data products and access locations are pre-approved for use. You must go through an approved access location when using these sources for inference.

If you are interested in using additional sources that are not listed, see the process for requesting additional data sources below.

 

SATELLITE DATA

NASA provides many data products through the Earthdata portal. Select data sources that are likely to be of greatest use in this track are listed below. You must use at least one of these satellite sources to train your model.

The primary satellite data products related to estimating PM2.5 measure something called Aerosol Optical Depth (AOD), also called Aerosol Optical Thickness (AOT). It is a unitless measure of aerosols (e.g., urban haze, smoke particles, desert dust, sea salt) distributed within a column of air from Earth's surface to the top of the atmosphere. Note that the units of PM2.5 are typically provided in μg/m3 (micrograms per cubic meter).

MODIS/Terra and Aqua MAIAC Land Aerosol Optical Depth Daily L2G 1 km SIN Grid (MCD19A2):

Multi-Angle Implementation of Atmospheric Correction (MAIAC) is an algorithm that uses data from the two MODIS satellite instruments to derive high-resolution aerosol and land surface reflectance products. The MCD19A2 Version 6 gridded Level 2 product is produced daily at 1 kilometer (km) pixel resolution. These data are provided as Hierarchical Data Format files.

MISR Level 2 FIRSTLOOK Aerosol Product (MIL2ASAF):

MIL2ASAF is a near real-time version of the Multi-angle Imaging SpectroRadiometer (MISR) Level 2 Aerosol product. It contains information on retrieved aerosol column amount, aerosol particle properties, and ancillary information based on Level 1B2 geolocated radiances observed by MISR at 4.4 km spatial resolution. Only the FIRSTLOOK version of this product is allowed for use in this competition. These data are provided as NetCDF files.

Approved satellite data is hosted in a public s3 bucket in the pm25 folder. AWS CLI will be useful for downloading data (you will probably need the --no-sign-request argument to the CLI). See pm25_satellite_metadata.csv for filepaths by bucket region and file format. Further instructions are detailed in airathon_download_instructions_pm25.txt. CSV and text files are available on the data download page. The data is replicated to buckets hosted in the US, the EU (Germany), and Asia (Singapore). Pick the bucket closest to your machine geographically to maximize transfer speeds.

 

ANCILLARY DATA

NASA Digital Elevation Model (NASADEM):

NASADEM data products were derived from original telemetry data from the Shuttle Radar Topography Mission (SRTM). In addition to Terra Advanced Spaceborne Thermal and Reflection Radiometer (ASTER) Global Digital Elevation Model (GDEM) Version 2 data, NASADEM also relied on Ice, Cloud, and Land Elevation Satellite (ICESat) Geoscience Laser Altimeter System (GLAS) ground control points of its lidar shots to improve surface elevation measurements that led to improved geolocation accuracy.

NASADEM can be accessed through Microsoft's Planetary Computer through the nasadem collection via the STAC API. Using the PySTAC library created by Azavea, you can load, traverse, and access data within these STACs programmatically. This quickstart guide demonstrates how to search for data using the STAC API with PySTAC.

To get a SAS Token to enable access to the STAC API, use the Planetary Computer’s Data Authentication API. Alternatively, you can use the planetary-computer package to generate tokens and sign asset HREFs for access.

Global Forecast System: The Global Forecast System (GFS) is a National Centers for Environmental Prediction (NCEP) weather forecast model that generates data for dozens of atmospheric and land-soil variables, including temperatures, winds, precipitation, soil moisture, and atmospheric ozone concentration. The system couples four separate models (atmosphere, ocean model, land/soil model, and sea ice) that work together to accurately depict weather conditions.

Only the GFS Forecasts dataset is pre-approved for this competition (as opposed to GFS Analysis/GFS-ANL). You may use either the 0.5° or 1° grid.

A note on time: Keep in mind, estimates should only factor in data available at the time of inference (training data excluded). You may only use data that would be available up through the day of estimation when generating predictions. Use of any future data after the day of estimation, excluding training data, is prohibited.

 

Additional data sources

Only the approved sources listed above may be used in this challenge track. Any additional data must be reviewed and approved in order to be eligible for use.

If you would like for any additional sources to be approved, you are welcome to submit an official request form and the challenge organizers will review the request. Only select sources that demonstrate a strong case for use will be considered.

To qualify for approval, data sources must meet the following minimum requirements:

  • Freely and publicly available to all participants
  • Produced reliably by an operational data product
  • Does not incorporate reference-grade surface monitor data in any way
  • Provides clear value beyond existing approved sources

Keep in mind that data sources used in this challenge cannot be derived from models that use reference-grade monitoring data as input, or include data collected from reference-grade monitors of NO2 or PM2.5 pollutants. Low-cost sensor data may be considered as a separate category, as they explicitly include sensors that are less expensive to manufacture and are typically less accurate and robust compared with reference-grade monitors. Any data in this category must be approved before use, based on clear documentation of how the data is incorporated and a demonstrated case for added value.

Any requests to add approved data sources must be received by February 21 to be considered. An announcement will be made to all challenge participants if your data source has been approved for use.

 

Labels (outputs)

Reference grade ground monitor data will be provided for all training times and geographies as target measures for model output. PM2.5 is reported in μg/m3, or micrograms per cubic meter.

The training period for PM2.5 spans Feb 2018 - Dec 2020. The test period spans two disjoint periods: Jan 2017 - Jan 2018 and Jan 2021 - Aug 2021.

train_labels.csv contains the following columns:

  • datetime (string): The UTC datetime of the measurement in the format YYYY-MM-DDTHH:mm:ssZ. A value represents the average between 12:00am to 11:59pm local time. The datetime provided represents the start of that 24 hour period in UTC time. Remember that for each observation, you may only use input values that are available before this time.
  • grid_id (string): A 5-character alphanumeric ID that uniquely identifies a 5x5cm grid cell
  • value (float): A float indicating the average daily reference-grade monitor measurement for PM2.5

A unique row is identified by the combination of datetime and grid_id.

You may use historical ground truth training data as feature input to your model. Note that for verification, you may only use historical data up until the point of inference in order to make your prediction. Your algorithm must be able to perform well under these conditions in order to qualify for a prize.

After entering the challenge, you can download the training data on the data download page.

 

Metadata

Additionally, we provide two metadata csv files.

grid_metadata.csv contains metadata about each grid cell and contains the following columns:

  • grid_id (string): A 5-character alphanumeric ID that uniquely identifies a 5x5 km grid cell
  • location (string): The location associated with a grid cell (Delhi, Los Angeles(SoCAB), or Taipei)
  • tz: The timezone used to localize the dates. Note that the dates for Los Angeles (SoCAB) ignore daylight savings time and use the Etc/GMT+8 timezone.
  • wkt (WKT): The geometry / polygonal coordinates of the grid cell

pm25_satellite_metadata.csv contains metadata about hosted satellite data. Each file for a particular dataset is referred to as a granule:

  • granule_id (str): The filename for each granule
  • time_start (datetime): The start time of the granule in YYYY-DD-MMTHH:mm:ss.sssZ
  • time_end (datetime): The end time of the granule in YYYY-DD-MMTHH:mm:ss.sssZ
  • product (str): The concise name for the satellite data source
  • location (str): One of the three locations for this challenge
  • us_url (str): The file location of the granule in the public s3 bucket in the US East (N. Virginia) region
  • eu_url (str): The file location of the granule in the public s3 bucket in the Europe (Frankfurt) region
  • as_url (str): The file location of the granule in the public s3 bucket in the Asia Pacific (Singapore) region
  • cksum (int): The result of running the unix cksum command on the granule
  • granule_size (int): The filesize in bytes

The cksum and granule_size columns are especially useful for confirming that downloaded files are not corrupted.

TIP: To identify the corresponding granule for a given observation from train_labels.csv, you can use the grid_metadata.csv to first find the location. Then, using the location and datetime, you can find s3 locations of relevant granules using satellite_metadata.csv. Remember that you are not allowed to use future data, so the time_end of an granule must be before the datetime of a given observation.

 

Submission format

The format for submissions is a .csv with columns: datetime, grid_id, value. The combined grid_id and datetime are the row identifiers and should exactly match the provided submission format example. You will fill in the value column with your predictions.

To create a submission, download the submission format and replace the placeholder values in the prediction column with your predicted values. Prediction values must be floats or they will not be scored correctly.

 

EXAMPLE

For example, if you predicted a value of 1.0 for the first 5 grid cells for January 7, 2017, your predictions would look like the following:

datetimegrid_idvalue
2017-01-07T16:00:00Z1X1161.0
2017-01-07T16:00:00Z9Q6TA1.0
2017-01-07T16:00:00ZKW43U1.0
2017-01-07T16:00:00ZVR4WG1.0
2017-01-07T16:00:00ZXJF9O1.0

And the first few rows and columns of the .csv file that you submit would look like:

datetime,grid_id,value

2017-01-07T16:00:00Z,1X116,1.0

2017-01-07T16:00:00Z,9Q6TA,1.0

2017-01-07T16:00:00Z,KW43U,1.0

2017-01-07T16:00:00Z,VR4WG,1.0

2017-01-07T16:00:00Z,XJF9O,1.0

You can see an example of the format that your submission must conform to, including headers and row names, in submission_format.csv.

In this challenge, predictions for a given test grid cell may use approved features and predictions from other test samples that would be available at the time of inference (i.e., it is not necessary for each test sample to be processed independently without the use of information from other cases in the test set).

 

REQUIREMENTS FOR WINNING SOLUTIONS

Note that while only grid cells listed in the submission format are required for evaluation during the challenge, winning solutions must be able to produce predictions for the same grid cells on a single new day.

Finalists will need to deliver a solution that includes a trained model that ingests a new date, processes the relevant features, and outputs predicted pollutant concentrations on that date without additional training. Algorithms should be able to run inference within a reasonable runtime of one hour or less to predict one day’s concentrations for all three cities on a single GPU node. Submitted models will be run on an out-of-sample time period to confirm that they are able to execute successfully with comparable performance to the test set.

A readme file accompanying these solutions shall clearly describe the data sources used in both training and inference, including a complete description of where they are incorporated and how they ensure time is treated correctly during inference, along with easy-to-follow instructions for running all parts of the solution. Additonal instructions will be provided to finalists at the end of the submission period.

 

Performance metric

To measure your model’s performance, we’ll use a metric called the coefficient of determination R2 (R squared). R2 indicates the proportion of the variation in the dependent variable that is predictable from the independent variables. This is an accuracy metric, so a higher value is better.

R2 is defined as:

  • nn = number of values in the dataset
  • yiyi = iith true value
  • yi^yi^ = iith predicted value
  • y¯y¯ = average of all true yy values

For each submission, a secondary metric called Root Mean Square Error (RMSE) will also be reported on the leaderboard. RMSE is the square root of the mean of squared differences between estimated and observed values. This is an error metric, so a lower value is better.

RMSE is defined as:

  • yi^yi^ is the iith predicted value
  • yiyi is the iith true value
  • NN is the number of samples

While both R2 and RMSE will be reported, only R2 will be used to determine your official ranking and prize eligibility.

Trace Gas Track
About the Project