Scaling Up the Data Lake with Coordinating Nodes

You must have Root scope to use this feature.

You can scale a DP deployment by adding additional Data Lake and Data Analyzer worker nodes to achieve the higher data ingestion rates, greater storage capacity, and longer data retention in warm storage values described in Capacity Planning for Data Replication and Clustering

Once you've built out your cluster to the point where it includes six or more data nodes, Stellar Cyber recommends that you enhance cluster performance by adding a Data Lake Coordinating Node.

  • A Data Lake Coordinating Node (DL-CN) is a special type of Data Lake worker that does not store any ElasticSearch data itself, but instead focuses on coordinating data searches and requests from different DP services to the data nodes in the cluster. The DL-CN offloads that responsibility from the data nodes themselves, improving overall Data Lake performance. 

The figure below shows a cluster with six Data Nodes that has been provisioned with an additional Data Lake Worker operating as a Coordinating Node.

Data Nodes Defined

Data nodes are any Data Lake component that stores ElasticSearch data, as follows:

  • Data Lake Master nodes with the Maximized Data Storage (MDS) option enabled.

    You can use the show mode command to see the current setting for the MDS option. The MDS option specifies whether the DL node stores data itself (enabled) or only manages storage and ElasticSearch operations (disabled). As you scale up to a cluster deployment with three or more DL-worker nodes, you disable MDS on the DL-Master and provision it with less disk space. The DL-worker nodes have MDS enabled and handle the actual storage while the DL-master provides storage management and search.

  • Data Lake Worker nodes that are not operating as Coordinating Nodes. Coordinating Nodes are not data nodes because they do not store ElasticSearch data.

Once your cluster includes six or more of the node types listed above, Stellar Cyber recommends adding a Data Lake Coordinating Node to improve performance.

Provisioning a Data Lake Coordinating Node

You provision a Data Lake Coordinating Node with less memory and disk space than a normal Data Lake worker. There are two main reasons for this:

  • The DL-CN doesn't require the same disk space as a typical DL-w because it doesn't store any ElasticSearch data itself.

  • A DL-w is only considered as a candidate for Coordinating Node status if it is provisioned with less memory and disk space than a standard DL-w.

With this in mind, Stellar Cyber recommends that you provision a Data Lake Coordinating Node using the same minimum specifications for a Data Analyzer node:

  • vCPUs – 16 (same as DL specification)

  • RAM – 64 GB (half of DL specification)

  • OS Disk – 500 GB (same as DL specification)

  • DL Disk Space – Not required

Eligibility for Coordinating Node Status – The Details

A DL-w's eligibility for Coordinating Node status is determined both by its provisioning and the value of the set mode coordinate option in its CLI:

  • A DL-w is only considered a candidate to run as a Coordinating Node if its provisioned disk space and memory are less than 80% of the largest existing DL-w in the cluster. This is because the cluster always gives priority to creating new Data Lake Workers as data nodes rather than coordinating nodes. Because of this, Stellar Cyber recommends provisioning a Coordinating Node using the DA specifications listed above, which use less memory and do not provision secondary DL storage.

  • A DL-w is only considered as a candidate to run as a Coordinating Node if the set mode coordinate option is set to a value other than disable (either dynamic or an integer value). You can see the current setting of the mode option using show mode in the CLI. Adding a Data Lake Coordinating Node provides details on the different set mode coordinate options.

    Once you've used set mode coordinate to specify either dynamic or an integer value, the node appears in the System | Data Lake | Node List with true in its Coordinate Candidate column, as shown in the example below:

Adding a Data Lake Coordinating Node

You add a Data Lake Coordinating Node using the same general procedure you use to add any new worker node:

  1. Launch and configure the VM for the DL-w, ensuring that it is provisioned using the specifications described in Provisioning a Data Lake Coordinating Node.

  2. Configure the node in the CLI as a resource.

  3. Add the node in the user interface, converting its resource role to Data Lake Worker.

  4. Connect to the node's CLI and use the set mode coordinate option to specify either dynamic or an integer value (refer to About Coordinating Mode Options).

You can find detailed examples on each of these steps for a cloud-based deployment starting with the procedures in this section. You can do the same with a physical appliance by changing its role to resource and then adding it to a cluster as a resource and reconfiguring it as a DL-w in the user interface.

About Coordinating Mode Options

You use the set mode coordinate [dynamic | <1..x>] command to make a node eligible for selection as a DL Coordinating Node:

  • Dynamic – If you select this option, the system creates one Coordinating Node for every three data nodes in the cluster. If there are not sufficient nodes available for selection based on their provisioning, the system creates as many as it can up to the maximum of one for every three data nodes.

  • Integer <1..x> – If you state a specific number of Coordinating Nodes to create, the system attempts to create the number specified. If there are not sufficient nodes available for selection based on their provisioning, the system creates as many as it can up to the number specified.

Note the following:

  • The system always gives priority to creating data nodes. If a node has set mode coordinate enabled (that is, set to either dynamic or an integer) but is provisioned with sufficient resources to run as a data node, it will still run as a data node and not be selected as a coordinating node. Re-provision the node with the resources listed in Provisioning a Data Lake Coordinating Node and try again.

  • If you have added more coordinating node candidates to the cluster than are requested by the set mode coordinate option, the excess DL-w candidates will be idle. For example, if you have added three DL Coordinating Code candidates to the cluster but set mode coordinate is set to 2, one of the candidates will be idle.

Viewing Data Lake Coordinating Nodes in the User Interface

The System | Data Lake page lists Data Lake Coordinating Nodes in the following locations:

  • The Data Lake Configuration table includes a Running Coordinating Nodes column that indicates whether the Data Lake is running coordinating nodes.

  • You can click the entry for the Data Lake in the Node List column to see the actual nodes in the lake, including columns indicating whether each node is a candidate for Coordinating Node or is currently running as a Coordinating Node (see the illustration in Eligibility for Coordinating Node Status – The Details).