ETL: Extract Data with Node.js

The first step in a Node ETL (Extract, Transform, Load) pipeline is to extract the data that we will be transforming and loading. In this tutorial, we will be creating the extract step in our ETL pipeline in order to extract the data we need from the data source and write the query for extraction.

By the end of this tutorial, you should be able to create an extract step for a Node ETL pipeline.

Goal

Learn how to Extract data from our data source, the first step in a Node ETL pipeline.

Prerequisites

Watch

Overview

The first step in an ETL pipeline is to extract the data that we will be transforming and loading. The extract phase is where you decide which data sources you’ll be extracting data from, and how exactly you’ll query the data.

In our example, we are going to extract all planet data from the NASA Exoplanet API. Your data and access pattern will likely differ from what we do here, and that’s okay, because the same basic concepts still apply.

We are going to create 2 functions: one to extract the data that we need from the data source, and another to orchestrate the different ETL stages.

For our data source, we have chosen a public API to extract data from because it’s easier for you to try out on your own. Your own data source is likely a database you’re connecting to directly. You would still follow this same approach, but would have to deal with connecting to and querying the database through whatever adapter or interface is available to you.

Differences aside, the important part is that you:

Select the data sources you’re extracting from
Decide what data to extract
Query for, or retrieve, the data through the method that is available

We have chosen to do a “full extraction” of the data available to us, meaning we are extracting all the records. You could do a “partial extraction” where you only extract certain records, e.g. records which have been updated since a specific time, or meet some other criteria. This comes down to how you write the query for extraction. Again, this is a case where your own data and needs will dictate how and what you extract from your data source.

Query the data source

This step requires dependencies request and request-promise; if you haven’t already, install them with npm i request request-promise.

For our example, we’ll create our function to retrieve all data from the data source. To extract all the planets in the Exoplanet API, we’re using the following API request:

const rp = require("request-promise");

const EXO_API_URL =
  "https://exoplanetarchive.ipac.caltech.edu/cgi-bin/nstedAPI/nph-nstedAPI";

const getAllPlanets = () =>
  rp({
    uri: EXO_API_URL,
    method: "GET",
    qs: {
      table: "exoplanets",
      format: "json"
    },
    json: true
  });

The format of this API request isn’t really relevant to the topic of ETL, but for the sake of understanding, here’s what’s going on.

We are creating a request to the Exoplanet API; passing request-promise an options object containing the uri to the API endpoint; specifying that we are making a GET request; passing an object of querystrings to append to the URI, which in this case tells the Exoplanet API which table we are querying, and the format to return the data in; and telling request-promise to parse the returned JSON by passing json: true in the options object.

If you want to learn more about making API requests, you can check out our tutorial Make API Requests with Request-Promise.

What you need to know about this request is that it returns a promise that will resolve with an array of all the planet records that we need to transform and then load into the destination. Because it returns a promise, we can use async/await to await the data before passing it off to the Transform step where we will iterate over each record and transform the data into a new structure.

Setup the pipeline

Next, we’ll create a function, startEtlPipeline, to orchestrate the different steps in our ETL pipeline. This function will coordinate each of the 3 steps: extracting data, transforming it, and loading it into the destination. We are using async/await here to await the data from the extract step, so our startEtlPipeline function needs to be an async function in order to allow us to use the await keyword.

It’s also important to catch any errors thrown when using async/await, so we wrap the await call in a try/catch block to handle any thrown errors or promise rejections. If the API call to the Exoplanet API fails, the catch handler will handle and log the error.

const startEtlPipeline = async () => {
  try {
    const planets = await getAllPlanets();
    console.log(`Extracted ${planets.length} planets from the API`);

    // TODO Transform step

    // TODO Load step
  } catch (err) {
    console.error(err);
  }
};

Wait for multiple requests (optional)

In our example, we are only making a single request to extract data from 1 data source. If you need to extract data from multiple data sources at once, or need to make multiple requests to the same data source, you can use Promise.all to wait for all the data to be extracted before proceeding to the next steps. Be careful, however, as Promise.all will execute everything simultaneously, which can overwhelm some resources!

An example would look something like this, assuming all the extractFromSource functions return a Promise:

const startEtlPipeline = async () => {
  try {
    const allExtractPromises = Promise.all([
      extractFromSourceA(),
      extractFromSourceB(),
      extractFromSourceC()
    ]);
    const [dataA, dataB, dataC] = await allExtractPromises;
    console.log(`Extracted data from sources A, B, and C`);
  } catch (err) {
    console.error(err);
  }
};

Above, we’re using array destructuring (an ES6 feature) in order to pull out the data returned from the different sources from the array returned by Promise.all.

Once you’ve extracted your data, the next step is to transform it.

Recap

In this tutorial, we discussed the Extract step of an ETL pipeline, and created a function to get the records from our data source. Later we will transform the data and load it into a new destination.

Further your understanding

How would you access the data from your own data source?
If you need to first connect to a database, how would you make sure the extract step only runs after the connection is established?
What would happen in the Promise.all example if one of the extract functions threw an error? How would the pipeline be affected?

Additional resources

Making asynchronous programming easier with async and await (developer.mozilla.org)
MDN web docs: async function (developer.mozilla.org)
MDN web docs: await (developer.mozilla.org)