Deciphering Cloud Optimized GeoTIFFs
In this blog, we will discuss Cloud Optimized GeoTIFFs (COGs), but the same concepts apply to any pyramidal TIFFs.
A Cloud Optimized GeoTIFF (COG) is a regular GeoTIFF file, aimed at being hosted on a HTTP file server, with an internal organization that enables more efficient workflows on the cloud.
That description is from cogeo.org, which provides a comprehensive overview of COGs, their specification, the technologies that make them possible, and their applicability. Their website is highly approachable, and if you are unfamiliar with COGs, we recommend visiting their site first.
While the specification for COGs is well documented, COGs can be a bit opaque, and it is often difficult to figure out if the GeoTIFF you have is cloud optimized or not. We have been working more broadly with COGs and large collections of satellite imagery with the ResonantGeoData platform and our dynamic tile serving app in Django: django-large-image. In our experience, the opaqueness of COGs can be a problem when ingesting large amounts of GeoTIFFs into a data catalog where we need to be able to efficiently serve bytes of the image for visualization and analysis. So we’d like to shed a bit of light on COGs and how you can tell whether or not a GeoTIFF is cloud optimized.
What makes a GeoTIFF cloud optimized?
There are two crucial aspects of a COG that enable it to be “cloud optimized”:
- Tiling: The bytes of the image data are arranged in tiles so that geographically close data are adjacent within the file. This is opposed to typical striping patterns.
- Overviews: Embedded in the image are “zoomed-out,” lower-resolution versions of the image down to 256×256 pixels (or 512×512), effectively creating a pyramid of resolutions.
Again, cogeo.org does a wonderful job explaining these concepts. For further details, please refer to their in-depth explanation.
Why are COGs important?
The Cloud Optimized GeoTIFF format provides a data structure for imagery that enables clients to make HTTP range requests to the server to fetch desired subregions and subresolutions on the fly. Having data properly formatted as a COG enables performant dynamic tile servers or real-time streaming of the data for processing without preprocessing or pre-tiling the imagery. These use cases are especially relevant in cloud environments and have been prevalent in some of our workflows to catalog and serve collections of satellite imagery for on-the-fly visualization and analysis.
How to verify if a GeoTIFF is a valid COG?
Conveniently, GDAL has a script that will check the tiling and overviews of a GeoTIFF to let you know if the image is a valid COG. This script can be executed on any TIFF image to determine whether it has the proper tiling and overviews to be cloud optimized.
This script from GDAL is a great comprehensive check for both the overviews and the tiling structure of a TIFF, but GDAL can be a heavy dependency. For a lighter-weight method to check if a COG has the right overviews, we can use tifftools: a pure Python package developed by Kitware for extracting metadata from TIFF images.
$ pip install tifftools
$ tiffdump raster.tif
…
Directory 0: offset 8 (0x8) next 8084 (0x1f94)
ImageWidth (256) SHORT (3) 1<7701>
ImageLength (257) SHORT (3) 1<7811>
…
Directory 1: offset 8084 (0x1f94) next 15958 (0x3e56)
SubFileType (254) LONG (4) 1<1>
ImageWidth (256) SHORT (3) 1<3851>
ImageLength (257) SHORT (3) 1<3906>
…
Directory 2: offset 15958 (0x3e56) next 18192 (0x4710)
SubFileType (254) LONG (4) 1<1>
ImageWidth (256) SHORT (3) 1<1926>
ImageLength (257) SHORT (3) 1<1953>
…
Directory 3: offset 18192 (0x4710) next 18890 (0x49ca)
SubFileType (254) LONG (4) 1<1>
ImageWidth (256) SHORT (3) 1<963>
ImageLength (257) SHORT (3) 1<977>
…
Directory 4: offset 18890 (0x49ca) next 19204 (0x4b04)
SubFileType (254) LONG (4) 1<1>
ImageWidth (256) SHORT (3) 1<482>
ImageLength (257) SHORT (3) 1<489>
…
Directory 5: offset 19204 (0x4b04) next 19422 (0x4bde)
SubFileType (254) LONG (4) 1<1>
ImageWidth (256) SHORT (3) 1<241>
ImageLength (257) SHORT (3) 1<245>
…
The output of tiffdump
will show the Image File Directories (IFDs) in the TIFF (these are the overviews for COGs), which you can verify are in increments of 1/2x the previous IFD resolution. The above example demonstrates a 7701×7811 source image with four overviews, with each zoom level being a factor of ½ the size of the previous until it falls under 256×256. If an overview level is skipped or there are not enough overviews, the TIFF will not be efficient in cloud environments. This is a great surface-level check to see if an image could be a COG without having GDAL installed.
Caveats
On some occasions, we have been handed imagery that was supposed to be a COG, but in fact was not. This might happen when either the image is not properly tiled (data bytes aren’t properly aligned), or the image was missing some or all the overviews.
We downloaded such an example of Landsat 8 Imagery from the USGS Earth Explorer (LC08_L1TP_016039_20170829_20170914_01_T1_B7.TIF
), and we have hosted the data here for anyone to inspect.
Running the GDAL COG validation script for this image shows no errors but does warn about the missing overviews:
$ python validate_cloud_optimized_geotiff.py --full-check=yes LC08_L1TP_016039_20170829_20170914_01_T1_B7.TIF
The following warnings were found:
- The file is greater than 512xH or Wx512, it is recommended to include internal overviews
LC08_L1TP_016039_20170829_20170914_01_T1_B7.TIF is a valid cloud optimized GeoTIFF
While tiled, without the missing overviews, this image is quite inefficient for streaming in cloud environments as we cannot take advantage of HTTP range requests when requesting tiles at subresolutions (e.g. for viewing as a WMTS source on a slippy map).
To demonstrate this inefficiency, we can create a GDAL Virtual File Systems VSI for this asset stored on our server as:
/vsicurl?url=https%3A%2F%2Fdata.kitware.com%2Fapi%2Fv1%2Ffile%2F626733ce4acac99f42fa5894%2Fdownload&use_head=no&list_dir=no
Then we can attempt to access tiles of the image using HTTP range requests under the hood with GDAL. Since this image does not have proper overviews, GDAL must then download the entire image to disk rather than simply stream the bytes it is required to fetch. We can see this happen when using the gdalwarp
command to extract a region for a single WMTS tile:
Original COG | Corrected COG | |
Image File Size | 80.4 Mb | 109.8 Mb |
Memory Usage | ~155 Mb | ~25 Mb |
Execution Time | ~17200 ms | ~14 ms |
# of HTTP Requests | 171 | 3 |
The script to perform this comparison, along with the original data and a COG-converted version of the image are available here.
How do we convert a GeoTIFF to a COG?
The gdal_translate
utility from GDAL is a robust way to convert a GeoTIFF to a COG with the following example options being sufficient for most uses cases:
gdal_translate input.tif output.tif -of COG -co BLOCKSIZE=256 -co BIGTIFF=IF_SAFER -co COMPRESS=LZW -co PREDICTOR=YES
To make this operation easier and accessible within our large-image group of tools, we have created the large-image-converter package, which will run this command on GeoTIFFs to create COGs or use libvips to create a pyramidal TIFF for non-geospatial images.
from large_image_converter import convert
convert("input.tif", "output.tif")
Partner with Kitware
We work with large-image data in both the geospatial and medical capacities and are experienced with data standards for high performance in cloud environments. If you have questions about these technologies, or you would like to discuss your own geospatial and medical image problems and learn how we can help, please reach out at kitware@kitware.com. We look forward to the conversation!
Software
Here are the links to the software highlighted in this blog:
- Tifftools (Kitware): https://github.com/DigitalSlideArchive/tifftools
- Large-image (Kitware): https://github.com/girder/large_image
- GDAL: https://github.com/OSGeo/gdal
Thanks for sharing. How did you count the get requests?
Hi Daniele! We can set up a mock webserver locally that supports HTTP range requests and then count the requests in th logs for that server while running the provided test script here. I specifically used rangehttpserver for this. Let me know if you have any further questions
I’ve published the code and steps here as well
Cool, I need to run some benchmarks, this is going to be very useful, thank you for sharing!