Climate data often comes in the format of NetCDF and most of the time we have to deal with a large number of files. For instance, when they are split into one file per year. So, what can we do if we want to process all files in the same way?
Luckily, there are tools to accomplish this task easily and even improve the performance by parallel execution. Here, I will show you a simple way to do this. In this example, I will download a small part of a global climate data set and extract a region from it. It’s just a one-liner.
Download test data
First, we need to download some NetCDF data that we can use to test our method. Therefore, I will use Reanalysis data from the NCEP/NCAR Reanalysis (R2) Project at the NOAA/ESRL Physical Sciences Laboratory. More specifically, I will download six years of global two meter temperature data. Since each file contains data for one year, we will download six files.
wget ftp://ftp2.psl.noaa.gov/Datasets/ncep.reanalysis2.dailyavgs/gaussian_grid/air.2m.gauss.2015.nc
wget ftp://ftp2.psl.noaa.gov/Datasets/ncep.reanalysis2.dailyavgs/gaussian_grid/air.2m.gauss.2016.nc
wget ftp://ftp2.psl.noaa.gov/Datasets/ncep.reanalysis2.dailyavgs/gaussian_grid/air.2m.gauss.2017.nc
wget ftp://ftp2.psl.noaa.gov/Datasets/ncep.reanalysis2.dailyavgs/gaussian_grid/air.2m.gauss.2018.nc
wget ftp://ftp2.psl.noaa.gov/Datasets/ncep.reanalysis2.dailyavgs/gaussian_grid/air.2m.gauss.2019.nc
wget ftp://ftp2.psl.noaa.gov/Datasets/ncep.reanalysis2.dailyavgs/gaussian_grid/air.2m.gauss.2020.nc
Code language: Bash (bash)
Depending on your download speed, the process can take a few minutes to complete. In total, we download about 75 MB.
Use cdo to select a specific region
Our goal is to cut out the same region from each NetCDF file and save it as a new file. Since the NetCDF data is on a lon/lat grid, the appropriate operator for cdo is sellonlatbox. Having a first look with ncview, Australia seems to be a good candidate.
We can use ncview to roughly figure out the coordinates of the grid points to define the box coordinates, or use a tool like bboxfinder. As soon as we found the coordinates, we can give it a try with cdo.
cdo sellonlatbox,97,172,-47,-2 air.2m.gauss.2015.nc air.2m.gauss.2015_box.nc
Code language: Bash (bash)
This cdo command will create a new file which contains only the selected box. We can check again whether we succeeded. I think that it looks good.
Note that we run cdo without any other options and the resulting NetCDF files will be uncompressed. You can easily add that yourself. See the documentation for details.
One line to rule them all
The next step is to apply the cdo command to all files. To do this in parallel batch mode, we can use the find command with some (more or less) clever regular expression to get a list of all the NetCDF files in the directory.
find . -name "air.2m.gauss.[0-9][0-9][0-9][0-9].nc"
./air.2m.gauss.2019.nc
./air.2m.gauss.2018.nc
./air.2m.gauss.2020.nc
./air.2m.gauss.2015.nc
./air.2m.gauss.2017.nc
./air.2m.gauss.2016.nc
Code language: Bash (bash)
As you can see, this is sufficient to match the six files we downloaded.
We can now use this list to process the files. Therefore, I use the pipe symbol |
to hand it over to the xargs command. xargs will then call cdo as many times as necessary until all files are done. I used the following command line.
find . -name "air.2m.gauss.[0-9][0-9][0-9][0-9].nc" | xargs -P 3 -I "%" cdo sellonlatbox,97,172,-47,-2 % %_box.nc
Code language: Bash (bash)
We already know about the find and the cdo part in this command line. So, let’s explain the xargs part. Here, I use only two options for xargs. The -P 3
option tells xargs to start cdo in blocks of three instances. The -I "%"
defines % as a placeholder. It will be replaced in the cdo part with the filename coming from the find command. Note that I use it twice in the cdo command. The first one is for the input file and the second one is suffixed with _box.nc
for the output file.
This is it. If you can spare more processor cores you can increase the number with the -P
flag. If you want to do something different with the files, replace the cdo command with your choice. Indeed, this method can be used to process any file type with any command.
Performance
We can easily test the performance gain. Therefore, we run the command line from above with -P 1
instead of -P 3
. This will call only one instance of cdo at a time, which is like a simple batch job.
time find . -name "air.2m.gauss.[0-9][0-9][0-9][0-9].nc" | xargs -P 1 -I "%" cdo sellonlatbox,97,172,-47,-2 % %_box.nc
cdo sellonlatbox: Processed 6587520 values from 1 variable over 365 timesteps [0.15s 63MB].
cdo sellonlatbox: Processed 6587520 values from 1 variable over 365 timesteps [0.15s 64MB].
cdo sellonlatbox: Processed 6605568 values from 1 variable over 366 timesteps [0.15s 64MB].
cdo sellonlatbox: Processed 6587520 values from 1 variable over 365 timesteps [0.15s 63MB].
cdo sellonlatbox: Processed 6587520 values from 1 variable over 365 timesteps [0.15s 64MB].
cdo sellonlatbox: Processed 6605568 values from 1 variable over 366 timesteps [0.15s 63MB].
real 0m1.048s
Code language: Bash (bash)
In total, it took about one second to process all files. Now with -P 3
.
time find . -name "air.2m.gauss.[0-9][0-9][0-9][0-9].nc" | xargs -P 3 -I "%" cdo sellonlatbox,97,172,-47,-2 % %_box.nc
cdo sellonlatbox: Processed 6605568 values from 1 variable over 366 timesteps [0.15s 63MB].
cdo sellonlatbox: Processed 6587520 values from 1 variable over 365 timesteps [0.15s 63MB].
cdo sellonlatbox: Processed 6587520 values from 1 variable over 365 timesteps [0.15s 62MB].
cdo sellonlatbox: Processed 6587520 values from 1 variable over 365 timesteps [0.15s 64MB].
cdo sellonlatbox: Processed 6587520 values from 1 variable over 365 timesteps [0.16s 64MB].
cdo sellonlatbox: Processed 6605568 values from 1 variable over 366 timesteps [0.16s 63MB].
real 0m0.363s
Code language: Bash (bash)
You can easily see that this was much faster. We will benefit from this reduced execution time if we have to deal with much larger and/or many more input files.