Adjusting Baseline Provisioning for Actual Configuration
This topic explains how to fine-tune the provisioning of the Stellar Cyber Platform based on the specifics of your deployment's configuration, from data retention needs to the number of sensors managed.
Use this topic as follows:
-
Start with the baseline sizing guidance in Baseline Cluster Node VM System Requirements to size your deployment.
-
Adjust your deployment's provisioning based on your platform's configuration and load using the instructions in the sections below:
These sizing adjustments are cumulative unless stated otherwise. If you enable multiple features, add the resource requirements from each relevant section.
If you are deploying the Stellar Cyber Platform on a dedicated server running different virtual machines, make sure you guarantee platform performance by observing the rules in Preparing a Server for Stellar Cyber Cluster Deployment.
Platform Sizing Summary
The table below summarizes the considerations you must apply when provisioning Stellar Cyber Platform components. In all cases, start with the baseline sizing rules. Then, adjust your provisioning as summarized in the table below. See the sections in this topic for details.
|
Platform Component |
Sizing Calculation |
|---|---|
| DL-Master Memory and CPU |
Start with the baseline. Then, increase provisioning based on: |
| DL-Worker Memory and CPU |
Start with the baseline. Then, increase provisioning based on retention time for disk space considerations. |
| DL Cluster Size (Overall Cluster Memory/CPU) |
Start with the baseline. Then, increase provisioning based on: |
| DA Cluster Size (Overall Cluster Memory/CPU) |
Start with the baseline. Then, increase provisioning based on: |
Applying Additional Capacity Margin
The guidance in this topic assumes reservation of 30% additional capacity to account for product growth, feature changes, and real-world workload variation. If your deployment includes heavy query activity, many optional features, or unpredictable usage patterns, reserve additional capacity.
Adjusting for Data Retention Needs
The baseline configuration assumes 30 days of raw retention with three retention groups configured. If you increase retention time or configure more retention groups, you must either add more DL-Worker nodes or increase the storage of your existing DL-Worker nodes.
Data Retention Calculator
Use this calculator to estimate the number of DL workers required for your deployment. The result is based on daily ingestion, retention time, and the number of retention groups. The calculator automatically accounts for both retention overhead and storage growth and provides sizing using both baseline DL-Workers and maximum-capacity DL-Workers.
You can increase storage for a DL-Worker up to 16 TB. Each additional 250 GB of storage provides one additional day of retention beyond 30 days, assuming standard ingestion. Each increment also increases memory and CPU requirements. At maximum capacity, a DL-Worker supports approximately 50 days of retention with 16 TB storage, 160 GB memory, and 44 CPU cores (the maximum-capacity DL-Worker build shown in the calculator below).
The calculator uses a maximum daily ingestion of 1500 GB and a maximum retention time of 365 days. If you need additional ingestion or retention time, contact Stellar Cyber for assistance or use the equations in Data Retention Adjustments: The Math Behind the Calculator.
Enter your daily ingestion, retention time, and number of retention groups. The calculator shows both supported sizing models so you can compare standard workers against maximum-capacity workers.
Inputs
|
Daily Ingestion (GB/day) |
|
|
Retention Time (days) |
|
|
Retention Groups |
Valid ranges: 1–1500 GB/day, 1–365 days, and 1–5 retention groups.
Data Retention Adjustments: The Math Behind the Calculator
You must adjust Data Lake provisioning based on retention groups and retention time. Both increase storage demand and system overhead, which affects the number of DL-Workers and the resources each worker requires.
Retention groups increase compute overhead. Retention time increases storage requirements. Always size your DL-Workers based on the larger of the compute-bound and storage-bound results.
Retention Group Correction
The system supports up to five retention groups. Each additional group increases overhead. The sizing baseline assumes three retention groups.
RG = configured retention groups / 3
Example:
- If you configure five retention groups, RG = 5 / 3 = 1.67.
Retention Time Correction
The sizing baseline assumes 30 days of raw data retention. Longer retention requires additional storage, and in some cases additional compute resources.
RT = configured retention days / 30
Example:
- If you configure 90 days of retention, RT = 90 / 30 = 3.
Baseline DL-Worker Capacity
A standard DL-Worker node provides the following baseline resources:
- 128 GB memory
- 32 CPU cores
- 9.5 TB storage
This baseline configuration is designed to support:
- 250 GB daily ingestion
- 30 days retention
Option 1: Add More DL-Workers at Baseline Spec
If you keep each DL-Worker at the baseline specification, calculate the number of workers as follows:
DL-Workers = ((x / 250) + 1) × RG × RT
- x = ingestion in daily ingestion
- RG = retention group factor
- RT = retention time factor
Option 2: Increase DL-Worker Resources
You can increase storage for a DL-Worker up to 16 TB. Each additional 250 GB of storage provides one additional day of retention beyond 30 days, assuming standard ingestion. Each increment also increases memory and CPU requirements.
DL-worker memory = 128 GB + ((Storage(TB) - 9.5) / 0.312) × 1.5 GB
DL-worker CPU = 32 + ((Storage(TB) - 9.5) / 0.312) × 0.5
At maximum capacity, a DL-Worker supports approximately 50 days of retention with 16 TB storage, 160 GB memory, and 44 CPU cores.
When you increase DL-Worker size, you must evaluate both compute capacity and storage capacity. The larger requirement determines the final sizing.
Maximum DL-Worker Calculation
Compute-bound = ((x / 250) + 1) × RG
Storage-bound = (x × 30 × RT) / (1000 × 12.8)
DL-Workers = max[compute-bound, storage-bound]
Examples
Baseline Case
- Ingestion: 500 GB/day
- RT = 1
- RG = 1
DL-Workers = ((500 / 250) + 1) = 3
Result: 3 DL-Workers
High Retention and Groups
- Ingestion: 500 GB/day
- RT = 3 (90 days)
- RG = 1.67 (five retention groups)
DL-Workers = ((500 / 250) + 1) × 1.67 × 3 ≈ 15
Result: 15 DL-Workers
Max DL-Worker Size
Compute-bound = ((500 / 250) + 1) × 1 = 3
Storage-bound = (500 × 30 × 3) / (1000 × 12.8) ≈ 4
Max of Compute-bound vs. Storage-bound = 4
Result: 4 DL-Workers
If you configure retention longer than 60 days and have more than two retention groups, disable the Maximized Data Storage (MDS) option on the DL-Master node, as described in Best Practices for High Availability.
For additional system-wide sizing considerations, see Platform Sizing Summary.
Adjusting DA Provisioning for Data Sinks
Each enabled data sink increases resource usage on the DA-Master and DA-Worker nodes. Apply these increases in addition to any other sizing changes.
CPU Requirement
Add 4 CPU cores to each DA node for each data sink. This requirement applies to both the DA-Master and DA-Worker nodes.
Memory Requirement
For each 250 GB/day of ingestion, add 16 GB memory to each DA node for each data sink.
| Ingestion | Base Memory without Data Sinks | Memory with x Data Sinks | Base CPU without Data Sinks | CPU with x Data Sinks |
|---|---|---|---|---|
| 250 GB/day | 64 GB | 64 + 16×x GB | 32 | 32 + 4×x |
| 500 GB/day | 128 GB | 128 + 32×x GB | 64 | 64 + 8×x |
| 1250 GB/day | 320 GB | 320 + 80×x GB | 160 | 160 + 20×x |
You can meet these requirements by scaling out with more nodes or by increasing the size of existing nodes.
For other DA adjustments, see Adjusting for Connectors and Adjusting for Additional Features.
Adjusting for Connectors
The baseline model assumes 25 connector units for every 250 GB/day of ingestion. One large connector can consume multiple connector units.
Each connector uses an average of 500 MB memory. For every 25 additional connectors beyond the baseline, reserve 12 GB more memory and four more CPU cores.
For example, if ingestion is 250 GB or less, you can configure up to 25 connectors with the baseline provisioning. For each additional set of 25 connectors, you must provision an additional 12 GB of memory and four CPU cores.
| Ingestion | Included Connectors | Additional Memory | Additional CPU |
|---|---|---|---|
| 250 GB/day | 25 | 12 GB per additional 25 connectors | 4 CPU per additional 25 connectors |
| 500 GB/day | 50 | 12 GB per additional 25 connectors | 4 CPU per additional 25 connectors |
| 1250 GB/day | 125 | 12 GB per additional 25 connectors | 4 CPU per additional 25 connectors |
If your deployment uses many connectors, review connector count together with data sink and feature overhead. See Adjusting DA Provisioning for Data Sinks and Adjusting for Additional Features.
Adjusting for External API Activity
If the Stellar Cyber Platform is expected to handle a high volume of external API traffic (for example from scripts or other automation), you must increase system resources accordingly. The required adjustment depends on the API type and usage pattern.
ElasticSearch Query APIs
ElasticSearch query performance depends heavily on query characteristics, including the data being queried, aggregation depth, and overall query complexity.
Because these variables can differ significantly across environments, a fixed sizing model is not practical. Use average query behavior as a baseline and adjust resources further based on observed workload and performance.
Alert Index (SER) Queries
Queries to the Alerts (aella-ser-*) index typically have relatively low overhead and scale predictably across query rate and data volume.
Query Rate
- Baseline: 1 SER query per second
- For each additional SER query per second, increase system capacity by 5%
Data Volume
- Baseline: 1 day of SER data across all tenants queried per second
- For each additional unit of SER data queried per second, increase system capacity by 5%
Raw Data Queries
Raw data queries are resource-intensive and require more aggressive scaling.
A query that retrieves one month of raw data without aggregation can occupy system resources for approximately 60 seconds.
External raw data query capacity is measured by data retrieval rate:
- One unit equals 15 minutes of data retrieved per second
- For example, querying one month of data across all tenants requires a minimum interval of 2,880 seconds under baseline conditions.
For each additional unit of raw data query throughput, increase system capacity by 20%.
Raw data queries that include aggregation require additional resources beyond this baseline.
Case APIs
Case API overhead is primarily driven by the total number of cases stored in the system. As case volume increases, allocate additional resources to maintain performance and responsiveness.
Size the Stellar Cyber Platform based on the highest expected external API demand across these categories.
Adjusting for Concurrent User Sessions
Concurrent user sessions primarily affect the CPU and memory required to support interactive access to the Stellar Cyber Platform.
The first 25 concurrent UI sessions do not require additional capacity from the Platform's baseline configuration. For every additional set of 25 concurrent UI sessions, add:
- 8 GB memory
- 4 CPU cores
Examples
- Up to 25 concurrent sessions – no additional resources required
- 26 to 50 concurrent sessions – add 8 GB memory and 4 CPU cores
- 51 to 75 concurrent sessions – add 16 GB memory and 8 CPU cores
In addition to the session count, you must also consider user-created queries. Query traffic generated by users in the user interface require the same resource adjustments as external API query traffic.
When session concurrency and query volume are both high, size the system based on the combined resource requirements.
Adjusting for ATH Playbooks
Automated Threat Hunting (ATH) Playbooks consume system resources based on the complexity and frequency of their queries. Resource usage varies significantly depending on how each playbook is defined and how often it runs.
To estimate the , evaluate each playbook based on its execution time relative to its run interval.
Playbook Weight
The relative impact of an ATH Playbook is expressed as its weight:
Weight = ES query execution time / playbook run interval
For example, if a playbook runs every hour and requires 36 seconds to complete:
Weight = 36 / 3600 = 0.01
This value represents the proportion of system query capacity consumed by that playbook.
Capacity Guidelines
The Stellar Cyber Platform can process multiple queries in parallel. To maintain stable performance, the combined weight of all ATH Playbooks must remain within a defined limit.
- The total weight of all ATH Playbooks should not exceed 0.4.
- If the total weight exceeds this threshold, increase system capacity.
You can reduce playbook weight by optimizing queries or increasing available compute resources.
Resource Scaling
If ATH Playbook demand exceeds system capacity, adjust provisioning as follows:
- Increase Data Lake capacity to support additional query load.
- Optimize or reduce playbook execution time where possible.
Memory Requirements
ATH Playbooks also increase memory requirements based on the total number of configured playbooks:
- For every 300 ATH Playbooks, add 10 GB of memory.
- If the total exceeds 1000 ATH Playbooks, you must add a Data Lake Coordinating node to your cluster to run the playbooks.
Example
Assume you have the following ATH Playbooks configured:
- Playbook A runs every 60 minutes and requires 36 seconds to complete.
- Playbook B runs every 30 minutes and requires 18 seconds to complete.
- Playbook C runs every 15 minutes and requires 9 seconds to complete.
Calculate the weight of each playbook:
Playbook A = 36 / 3600 = 0.01
Playbook B = 18 / 1800 = 0.01
Playbook C = 9 / 900 = 0.01
Total playbook weight:
Total weight = 0.01 + 0.01 + 0.01 = 0.03
In this example, the combined playbook weight is 0.03, which is below the recommended limit of 0.4. No additional query-related capacity is required.
Now assume the system has 650 ATH Playbooks configured in total.
Memory requirement:
650 / 300 = 2.17
Round up to the next full increment and add memory for 3 groups of 300 playbooks:
3 × 10 GB = 30 GB additional memory
In this example, the system remains within the recommended playbook weight limit, but still requires 30 GB of additional memory based on total ATH Playbook count.
When both playbook count and playbook complexity are high, size the system based on the combined impact of total weight and memory requirements.
Adjusting for Managed Sensor Volume
The number of managed sensors has a direct affect on system performance, particularly for Data Lake processing. As the quantity of managed sensors increases, you must provision additional CPU and memory to maintain consistent ingestion and coordination performance.
Capacity Guidelines for Sensors
The baseline configuration supports up to 1000 sensors without additional resource requirements. Beyond this baseline, you must scale system resources incrementally based on the total number of sensors using the following rule of thumb:
- For every additional 1000 sensors, add:
- 20 GB of memory
- 4 CPU cores
Example
For example, a deployment with 2500 sensors has 1500 sensors more than the baseline value of 1000 and would require an additional 30 GB of memory and 6 CPU cores.
Adjusting for Additional Features
Some optional features add resource overhead beyond baseline sizing. Adjust your Platform's provisioning based on your usage of the features listed below.
Entity-Based Asset Licensing
If your deployment uses the new entity-based asset counting model (as opposed to the classic asset counting model), adjust your provisioning as follows:
For every 250 GB/day of ingestion, add the following to each DA-Master and DA-Worker node:
-
2 CPU cores
-
1 GB memory
InSyncs
If your deployment is integrated with ServiceNow using InSyncs, you must adjust your Data Lake provisioning by adding the following to all DL-Master and DL-Worker nodes:
-
2 GB memory
-
2 CPU cores
Actual overhead depends on the ServiceNow table update rate.
Webhook Ingestion
If your deployent ingests webhooks using the XDR connector, you must adjust the provisioning of all DA-Master and DA-Worker nodes:
Webhook Ingest adds overhead to the DA cluster.
-
Each webhook-ingest-fluentd replica requires 500 MB memory
-
Adding memory does not improve performance
| Workers per Replica | CPU (millicores) | Approximate EPS |
|---|---|---|
| 1 | 1000 | 1000 |
| 2 | 2000 | 1950 |
| 6 | 6000 | 5300 |
Horizontal scaling usually provides the best efficiency.
TIPv2
TIPv2 increases load on the DL cluster and Elasticsearch. You might need a dedicated coordinating node or a separate Elasticsearch service design.
Understanding Factors That Are Difficult to Model
Some workloads cannot be predicted precisely and can increase resource usage significantly:
-
Complex dashboard queries
-
Machine learning job memory usage
If you expect heavy dashboard use or large ML workloads, plan additional capacity beyond the baseline sizing model.
Guidance for Existing Appliance Deployments
If you use an existing appliance deployment, older specifications can still be supported in some cases. However, older appliance sizing leaves little or no room for additional features such as TIPv2 or Webhook Ingest.
As deployments grow, the DL-Master and DA-Master can require separate larger systems. In larger environments, a combined role on the same appliance may no longer be sufficient.
