AI in Cybersecurity: Offensive AI, Defensive AI & the Crucial Data Foundation, Part 3 of 3
Powering the AI Shield: Data Foundations from Cloud to Edge
The third in a series of 3 guest blogposts by Roy Chua, Founder and Principal at AvidThink
Welcome to the concluding post in our series on Artificial Intelligence in Cybersecurity. In Part 1, we explored how AI is empowering cyber attackers. In Part 2, we examined AI-powered cyber defenses, including strategies like behavioral analysis and the promise of autonomous response systems.
In this final post, we will expand our discussion on the importance of the data foundation. As we pointed out in our second blog post, sophisticated algorithms are only as good as the information they have access to—for training or during AI/ML inferencing. We will explore data quality challenges, the complexities of gathering data across modern hybrid environments, and considerations for enabling observability that can feed AI.
The Data Quality Imperative for AI Security
If we feed AI models erroneous or incomplete data, we should expect unreliable detection, excessive false positives, and a false sense of security. Cybersecurity data presents unique quality challenges that we can summarize by borrowing from the four “Vs” of big data:
- Volume: Security systems generate large amounts of data — terabytes or petabytes (or perhaps exabytes shortly) daily from network flows, endpoint and security device logs, cloud events, application logs, and more.
- Velocity: Much of this data will arrive across multiple real-time streams that must be processed immediately for timely threat detection.
- Variety: Data comes from diverse sources in myriad formats — structured logs in varying formats, unstructured text (threat reports, emails), packet captures and network intelligence, from API calls to external sources, etc.
- Veracity: Ensuring the accuracy, completeness, and trustworthiness of security data is critical, as malicious actors will often tamper with logs or telemetry to hide their tracks.
To address these challenges, we will need to build robust data preparation pipelines. This involves cleansing data (handling errors, missing values, inconsistencies), normalizing it into standard formats suitable for analysis, and enriching it with additional context — for example, correlating external IP addresses with threat intelligence feeds, mapping internal asset information, or identifying applications using Deep Packet Inspection (DPI).
Access to properly labeled datasets or a workflow to adequately and accurately label data is crucial for supervised machine learning models. Accurately labeling network traffic as benign or malicious or classifying malware samples by family requires significant effort and domain expertise but is essential for training effective classifiers.
An added complexity in security is that sensitive personally identifiable information (PII) or confidential business data will invariably be present in any data stream. Since we’re operating in the security domain, data privacy and compliance are non-negotiable. Therefore, organizations must implement strong governance, adhere to regulations like GDPR or CCPA, and align with frameworks such as the NIST AI Risk Management Framework to ensure sensitive data used for AI training and operation is handled responsibly and securely — this can be a significant in-house effort. Finally, continuous data quality validation and testing will be needed to ensure the ongoing reliability of AI-driven security insights.
Data Gathering and Processing: Spanning the Cloud-to-Edge Continuum
Another complexity is the multi-domain nature of cyber security. Today’s enterprise environment is not confined to a single campus or data center. Organizations sprawl across hybrid and multi-cloud platforms, enterprise campuses, branch locations, remote work endpoints, SaaS applications, and increasingly, Internet of Things (IoT) devices and edge computing infrastructure. This highly distributed architecture expands the attack surface and complicates achieving consistent security visibility.
Gathering telemetry from these diverse locations presents many challenges:
- Consistency: While difficult, ensuring uniform monitoring and data collection policies across on-premises, cloud, and edge environments is necessary to achieve a holistic view.
- Edge Requirements: Edge computing will introduce new data sources (sensors, industrial controls, local compute nodes). Sending all this raw data back to a central analytics platform can be infeasible due to bandwidth limitations and latency requirements. This will drive the need for intelligent data processing at the edge, meaning local initial filtering, aggregation, and analysis. Organizations will want to ensure that their visibility tools and network intelligence solutions are edge-friendly and capable of efficient local operation.
- Privacy in Distributed Settings: As we discussed, collecting sensitive data from endpoints or remote locations raises privacy concerns. However, if local decisions can be made using distributed AI/ML models without sending data to a data center under a different legal jurisdiction, such systems can provide security while remaining compliant.
To support enterprises as they evolve their distributed IT setups, organizations will need a flexible and scalable architecture to gather and process data across an edge-to-cloud continuum. This will entail pushing out intelligence and bringing AI/ML solutions that fit with distributed deployments.
The Data-Centric Future of Cybersecurity
Throughout this series, we’ve seen how AI has and will continue reshaping cybersecurity, empowering attackers and defenders. While sophisticated algorithms and autonomous capabilities capture news headlines, the foundation for effective AI defense is high-quality, comprehensive, and context-rich data.
To achieve this foundation, enterprises must make a strategic commitment to observability across the hybrid IT landscape, from the central cloud to the distributed edge. They must also implement robust processes for data preparation and quality management. Similarly, security solution providers serving these enterprises need to architect and build platforms that can accommodate data feeds across the IT cloud to edge continuum.
Across the three articles, we’ve touched on many data sources pertinent to AI-enabled cybersecurity defense, from the network to the application and across disparate systems (user identity, endpoints, cloud, etc). Of these sources, one key stream of data in our networked economy is network traffic data.
Gaining visibility into network data requires technologies like Deep Packet Inspection (DPI) and Encrypted Traffic Classification (ETC) to extract essential insights from network traffic, even when the data in packets are encrypted. Enterprises will want to pick security solutions that incorporate such capabilities to provide deeper visibility into the data. Furthermore, such security solutions must ensure a secure data/analytics pipeline and incorporate AI/ML solutions that can support distributed deployments.
At its heart, the AI arms race in cybersecurity is a data arms race. Enterprises and solution vendors who can master the art and science of collecting, refining, securing, and analyzing data will be ideally positioned to build resilient AI-powered defenses capable of protecting against the threats of today and tomorrow.
Take the Next Step with Enhanced Traffic Intelligence
Achieving the data quality for effective AI defense requires specialized tools to extract deep insights from network traffic.
The sponsor of this blog series, Enea, is a leader in embedded network intelligence, providing solutions like the Enea Qosmos Threat Detection SDK and the Enea Qosmos ixEngine, a market-leading DPI and ETC engine. Their technology delivers granular, real-time traffic classification and metadata extraction that cybersecurity market leaders already trust to power their advanced AI-based platforms. Qosmos ixEngine can provide visibility within the challenging landscape of encrypted traffic. It also saves vendors valuable data preparation time as Qosmos ixEngine-generated data is automatically cleansed, validated, organized, documented, labeled, and ready for vendors to use in AI applications.
For more information about Enea’s technology for AI — click here.