Retrievals: Dealing with multiple datasets

Written by Evert Nasedkin. Please cite pRT’s retrieval package (Nasedkin et al. 2024) in addition to pRT (Mollière et al. 2019) if you make use of the retrieval package for your work.

This advanced tutorial will use JWST observations of WASP39 b to demonstrate how to incorporate multiple datasets into the pRT retrieval framework. This is the nominal Radtrans workflow. For the SpectralModel workflow, see the SpectralModel retrieval notebook.

Getting started

Please make sure to have worked through the“Basic Retrieval Tutorial”before looking at the material below.

In this tutorial, we will outline the process of setting up a RetrievalConfig object, which is the class used to set up a pRT retrieval. The basic process is always to set up the configuration, and then pass it to the Retrieval class to run the retrieval using, for example, pyMultiNest. Like mentioned in the “Basic Retrieval Tutorial” several standard plotting outputs will also be produced by the retrieval class. Most of the classes and functions used in this tutorial have more advanced features than what will be explained here, so it’s highly recommended to take a look at the code and API documentation. There should be enough flexibility built in to cover most typical retrieval studies, but if you have feature requests please get in touch, or open an issue on gitlab.

[1]:

# Let's start by importing everything we need.
import os
# To not have numpy start parallelizing on its own
os.environ["OMP_NUM_THREADS"] = "1"
#os.environ["pRT_input_data_path"] = "/path/to/petitRADTRANS/petitRADTRANS/input_data/"

import numpy as np
import matplotlib.pyplot as plt

from petitRADTRANS.radtrans import Radtrans
from petitRADTRANS import physical_constants as cst

# Import the class used to set up the retrieval.
from petitRADTRANS.retrieval import Retrieval,RetrievalConfig

# Import Prior functions, if necessary.
from petitRADTRANS.retrieval.utils import gaussian_prior

# Import atmospheric model function
from petitRADTRANS.retrieval.models import guillot_patchy_transmission

Lets start out by setting up a simple run definition. We’ll add the data AFTER we define the model function below Full details of the parameters can be found in the API documentation.

[4]:

# Lets start out by setting up a simple run definition
# Full details of the parameters can be found in retrieval_config.py

# Since our retrieval has not run yet, we'll set the mode to 'retrieve'.
# If it has already run, we can set it to 'evaluate', so we can make some plots.
# In general, the evaluate functions will also be called after the 'retrieve' mode has finished.

retrieval_config = RetrievalConfig(retrieval_name="WASP39b_Guillot_FreeChem_PatchyGreyHaze",
                                      run_mode="retrieve", # This must be 'retrieve' to run PyMultiNest, 'evaluate' if looking at an existing run
                                      pressures=np.logspace(-8,3,100), # Extend up to 10^-8 bar
                                      amr=False, # We won't be using adaptive mesh refinement for the pressure grid
                                      scattering_in_emission=False) # This would turn on scattering when calculating emission spectra.
                                                          # Scattering is automatically included for transmission spectra.

Data

Let’s start with reading in the data. The data must be a 1D spectrum with error bars or a covariance matrix.

As in the basic tutorial, we’re reading in text files, but this time we also include the column that describes the wavelength bins:

# Wavelength [micron], Bins [micron], Flux [W/m2/micron or (Rp/Rstar)^2], Flux Error [W/m2/micron or (Rp/Rstar)^2]

As mentioned, the Data class is arguably the most important part of setting up the retrieval. Not only do you input your data here, but you also choose your model function and resolution. This means that you can design a retrieval around different datatypes and retrieve simultaneously on both - for example, if you want the day and nightside of a planet, or want to combine the eastward and westward limbs of a transmission spectrum with different models.

You can also set a distance to your object, which will allow you to automatically scale the flux and error of your data using the scale_to_distance() method - useful if you have data normalized to 10pc! Finally, there’s also a arguments scale, scale_err and offset_bool, which tells the retrieval that the flux or uncertaines should be scaled by an arbitrary multiplicative factor or have an additive offset, both which is set up as a normal retrieval parameter using the RetrivalConfig.add_parameter() method. The name must be of the format DATANAME_scale_factor or DATANAME_offset. This is useful if two datasets are incompatible in absolute photometry, but you still want to use the spectral shape to inform the retrieval.

In this retrieval we’re going to include several datasets from different JWST instruments, starting with NIRISS SOSS orders 1 and 2. To include both of them , we simply add more than one dataset to our RetrievalConfig object. Notice that we’re also telling the retrieval that we want the data of Order 2 to have an additive offset: this will allow the data to float relative to Order 1, which remains fixed. This can be used to compensate for differences in transit depth between different instruments.

We’re also using the built-in guillot_patchy_transmission model for this retrieval, rather than writing our own model function from scratch.

[8]:

# Here we import petitRADTRANS to find the path of the example files on your machine.
# In general this is not required, you just put the files in the folder that you are running
# Your script in, for example.
import petitRADTRANS # need to get the name for the example data

path_to_data = ""
path_to_data = petitRADTRANS.__file__.split('__init__.py')[0] # Default location for the example
transmission_directory = "retrieval/examples/transmission/"
retrieval_config.add_data('JWST/NIRISSSOSS/O1',
                       f"{path_to_data}{transmission_directory}observations/JWST/WASP39b_niriss_soss1.txt",
                       data_resolution=700,
                       model_resolution=300,
                       model_generating_function = guillot_patchy_transmission,
                       external_radtrans_reference=None)
retrieval_config.add_data('JWST/NIRISSSOSS/O2',
                       f"{path_to_data}{transmission_directory}observations/JWST/WASP39b_niriss_soss2.txt",
                       data_resolution=700,
                       model_resolution=300,
                       offset_bool=True,
                       model_generating_function = guillot_patchy_transmission,
                       external_radtrans_reference=None)

External references

Sometimes the datasets will include regions that overlap in wavelength, or where one dataset falls entirely within the wavelength range of another dataset. In that case we can use an “external reference”: for the dataset (let’s call it “short”) that falls within the wavelength range of the other (let’s call it “long”) we will not compute a model spectrum. Instead, we will use the model spectrum calculated for the “long” dataset. This way we only need to compute the spectrum once in the same wavelength range, rather than having the retrieval package initialise two Radtrans objects and calculating a spectrum for each. This saves both memory and time. However, be careful here: the model resolution in the reference object should be high enough to properly sample any datasets that reference it! If you have multiple data sets that do not overlap, but lie back-to-back to each other in wavelength space, with at most small gaps, it makes also sense to use external references, since it reduced computational overheads.

In this example, the NIRSpec PRISM data covers the entire NIRISS SOSS wavelength range, so we can use it as a reference for both NIRISS SOSS orders.

Note that here we appear to do something odd: the data resolution is more than twice larger than the model resolution for NIRISS SOSS. In the special case here the SOSS data files are binned to R = 100. Thus, while the spectrum was indeed recorded at R=700, it is OK to use a lower pRT model resolution for the retrievals.

[9]:

retrieval_config.data = {} # Remove the previous data that was added above to start with a clean slate.
retrieval_config.add_data('JWST/NIRSPEC/PRISM',
                       f"{path_to_data}{transmission_directory}observations/JWST/WASP39b_nirspec_prism.txt",
                       data_resolution=100,
                       model_resolution=300,
                       offset_bool=True,
                       model_generating_function = guillot_patchy_transmission)
retrieval_config.add_data('JWST/NIRISSSOSS/O1',
                       f"{path_to_data}{transmission_directory}observations/JWST/WASP39b_niriss_soss1.txt",
                       data_resolution=700,
                       model_resolution=300,
                       model_generating_function = guillot_patchy_transmission,
                       external_radtrans_reference='JWST/NIRSPEC/PRISM') # here we set the external pRT reference to PRISM
retrieval_config.add_data('JWST/NIRISSSOSS/O2',
                       f"{path_to_data}{transmission_directory}observations/JWST/WASP39b_niriss_soss2.txt",
                       data_resolution=700,
                       model_resolution=300,
                       offset_bool=True,
                       model_generating_function = guillot_patchy_transmission,
                       external_radtrans_reference = 'JWST/NIRSPEC/PRISM') # here we set the external pRT reference to PRISM

Model parameters

Here we’re using a more complicated atmospheric model to fit the JWST data. The temperature profile is taken from Guillot 2010, and includes four parameters to describe the shape. We’re freely retrieving the chemical abundances, and include both patchy grey clouds and an enhanced power law slope as a proxy for hazes.

[12]:

# WASP 39 parameters
retrieval_config.add_parameter(name = 'stellar_radius', free = False, value = 0.9324 * cst.r_sun)

# Fix the reference pressure in bar
retrieval_config.add_parameter('reference_pressure',False,value  = 0.01)


# Choose two of log_g, radius and mass priors
retrieval_config.add_parameter('log_g',True,
                            transform_prior_cube_coordinate = \
                            lambda x : 2.0+3.5*x)
retrieval_config.add_parameter('planet_radius', True,
                            transform_prior_cube_coordinate = \
                            lambda x : 0.8*cst.r_jup_mean+ (x*0.8*cst.r_jup_mean))

# Priors for Guillot 2010 Temperature Profile
retrieval_config.add_parameter("T_int", True,
                            transform_prior_cube_coordinate = \
                            lambda x : 100 + 3500*x)
retrieval_config.add_parameter("T_equ", True,
                            transform_prior_cube_coordinate = \
                            lambda x : 100 + 3500*x)
retrieval_config.add_parameter("gamma", True,
                            transform_prior_cube_coordinate = \
                            lambda x :  10**(-(x/2.)**2./2.))
retrieval_config.add_parameter("log_kappa_IR", True,
                            transform_prior_cube_coordinate = \
                            lambda x : -4.0 + 6.0*x)

# Grey cloud top pressure
retrieval_config.add_parameter('log_Pcloud',True,
                            transform_prior_cube_coordinate = \
                            lambda x : -8.+11.*x)

# Enhanced haze scattering slope
# kappa
retrieval_config.add_parameter('haze_factor',True,
                            transform_prior_cube_coordinate = \
                            lambda x : -4.+14.*x)
# gamma
retrieval_config.add_parameter('power_law_opacity_coefficient',True,
                            transform_prior_cube_coordinate = \
                            lambda x : -20.+22.*x)
# Cloud fraction
retrieval_config.add_parameter('patchiness',True,
                            transform_prior_cube_coordinate = \
                            lambda x : x)

# Data offsets
retrieval_config.add_parameter('JWST/NIRSPEC/PRISM_offset',True,
                            transform_prior_cube_coordinate = \
                            lambda x : gaussian_prior(x, 0, 1e-4))
retrieval_config.add_parameter('JWST/NIRISSSOSS/O2_offset',True,
                            transform_prior_cube_coordinate = \
                            lambda x : gaussian_prior(x, 0, 1e-4))

Opacities

[14]:

retrieval_config.set_rayleigh_species(['H2', 'He'])
retrieval_config.set_continuum_opacities(['H2-H2', 'H2-He'])

# Here we setup the line species for a free retrieval,
# setting the prior bounds of the log10 abundance with the abund_lim parameter
# So the retrieved value is the log mass fraction.
retrieval_config.set_line_species(["H2O__Pokazatel", "CO-NatAbund", "CO2", "CH4", "SO2"], eq=False, abund_lim = (-8.0,0.0))

Plotting

Please see “Basic Retrieval Tutorial” if you do not know what’s happening here.

[15]:

##################################################################
# Define what to put into corner plot if run_mode == 'evaluate'
##################################################################
for key, value in retrieval_config.parameters.items():
    value.plot_in_corner = True
    value.corner_label = key.replace("_"," ")

##################################################################
# Define axis properties of spectral plot if run_mode == 'evaluate'
##################################################################
retrieval_config.plot_kwargs["spec_xlabel"] = 'Wavelength [micron]'
retrieval_config.plot_kwargs["spec_ylabel"] = r'$(R_{\rm P}/R_*)^2$ [ppm]'
retrieval_config.plot_kwargs["y_axis_scaling"] = 1e6 # so we have units of ppm
retrieval_config.plot_kwargs["xscale"] = 'linear'
retrieval_config.plot_kwargs["yscale"] = 'linear'

# Use at least ~100 samples to plot 3 sigma curves
retrieval_config.plot_kwargs["nsample"] = 10

##################################################################
# Define from which observation object to take P-T
# in evaluation mode (if run_mode == 'evaluate'),
# add PT-envelope plotting options
##################################################################
retrieval_config.plot_kwargs["take_PTs_from"] = 'JWST/NIRSPEC/PRISM'
retrieval_config.plot_kwargs["temp_limits"] = [150, 3000]
retrieval_config.plot_kwargs["press_limits"] = [1e1, 1e-7]

# If in doubt, define all of the plot_kwargs used here.

Running the retrieval

Like in “Basic Retrieval Tutorial” we can now run the retrieval. Most of the various parameters used to control pyMultiNest or Ultranest can be set in the retrieval.run() function, see the API documentation.

NOTE: This retrieval example only uses 40 live points so that it can be run quickly. However, this should be increased to at least 400 live points for a full retrieval, and will probably require a server or cluster.

Once the retrieval is complete, we can use the plot_all function to generate plots of the best fit spectrum, the pressure-temperature profile and the corner plots, also see “Basic Retrieval Tutorial”.

[ ]:

# If you want to run the retrieval, you need to choose a different output directory name,
# due to pRT requirements.
output_dir = f"{path_to_data}retrieval/examples/transmission/"
retrieval = Retrieval(retrieval_config,
                      output_dir=output_dir,
                      sample_spec=False,         # Output the spectrum from nsample random samples.
                      pRT_plot_style=True,         # We think that our plots look nice.
                      ultranest=False)             # Let's use pyMultiNest rather than Ultranest

# Default pymultinest retrieval setup, but with only 40 live points
retrieval.run(sampling_efficiency=0.8,
              const_efficiency_mode=False,
              n_live_points=40
              # const_efficiency_mode=False,  # Examples for additional PyMultiNest or Ultranest parameters, see API.
              # sampling_efficiency=0.8,
              # log_z_convergence=0.5,
              # step_sampler=False,
              # warmstart_max_tau=0.5,
              # n_iter_before_update=50,
              # resume=False,
              # max_iters=0,
              # frac_remain=0.1,
              # importance_nested_sampling = True,
              #Lepsilon=0.3)
              )

[ ]:

# Automatically generate all of the standard output plots. The contribution
# argument means that the PT profile and abundance profile plots will display
# the contribution function. The mode argument means we'll be plotting a model
# based on the median retrieved parameter values, rather than the minimum likelihood
# model.

retrieval.plot_all(contribution = True, mode = 'median')

Contact

If you need any additional help, don’t hesitate to contact Evert Nasedkin.