Entity Resolution for the Cyber Domain

The foundational level of situation awareness lies in perception of the surrounding environment. In cyberspace, this relates to an ability to enumerate and identify elements of the cyber terrain, particularly, network-connected devices that are employed to accomplish a user’s goals. These devices emit a plethora of signals in network traffic and server logs as they negotiate for services, but they do not share consistent features in those signals that make it straightforward to uniquely identify which hosts are active over a period of interest. We present a cyber Entity Resolution (ER) approach and blocking technique designed to bridge this gap. The technique is based on the construction and comparison of periodic snapshots of collections of “host segments,” that are built up from multiple log files. We present results from an open dataset in a cyber situation awareness prototype, BitBook, that visualizes the host segments generated from parsed network traffic over an observation window, and we briefly discuss its efficacy at discovering hosts in that experiment. We also describe a supervised machine learning approach to assist in ER of hosts between two snapshot collections of host segments. We perform an experiment using server logs from a larger enterprise network to resolve hosts between snapshots taken 15 minutes apart and one day apart; in the worst case, only 4 false positives and 24 false negatives were incorrectly associated among nearly 12K correctly labeled hosts.

1. Introduction

Properly resolving entities in the cyber domain is a necessary condition for enumerating and identifying elements of the cyber terrain. An enduring assumption in cybersecurity research is that if we could obtain full visibility into the network and efficiently sift through the resulting data, we would be able to identify evidence of unexpected or unwanted activity. Full visibility requires a wide array of sensors deployed all over the network and spanning the physical and logical layers of the system. Data from different sensors can only be used in concert if we are able to associate occurrences of common events or entities.

2. Entity Resolution

Entity Resolution is the practice of finding and linking records of the same underlying entity across data sets. This problem is widely recognized and actively researched in other domains such as Homeland Security and epidemiology but has been less formally acknowledged in cybersecurity. It is a problem of considerable importance because without ER, the number of entities is artificially magnified. In cybersecurity, this translates to a perceived increase of the attack surface. More importantly, there is no way to associate observations made at different times and places on different sensors, limiting the available evidence to a strictly local set of observations. Entity Resolution is what enables multi-modal analysis of disparate data sources.

The contributions of this paper are:

  • A formal problem statement for ER in the cyber domain.
  • An approach to incorporate ER into a larger security process.
  • A preliminary implementation to demonstrate the blocking concepts of host segments.
  • A roadmap for refining the techniques for future deployment.

3. The Entity Resolution Problem

Entity Resolution is the process of disambiguating and linking data records to real-world individuals or entities. In the intelligence community, for example, tracking a potential threat actor requires identifying aliases used by the individual in the real and online worlds (e.g., passports and usernames), and attributing the activities and assets assumed under these aliases back to the individual (e.g., messages, bank accounts, and cell phones). Because names are often not unique, ER must also be able to distinguish observations of multiple individuals with similar or identical names. Finally, real-world applications of ER must be able to tolerate clerical errors and missing data.

Similarly, in the cyber domain, a host may manifest in various logs identified by its hostname, IP address, MAC address, or another form of identification (in this work, we assume that the sensors generating the logs do not have credentialed access to every host). Furthermore, these identities are not fixed. When a host connects to a network, it likely gets a dynamically assigned IP address different from previous connections. IP addresses are reused by different hosts. Host names and MAC addresses can also change for legitimate reasons, albeit less frequently. It is not uncommon to have collisions of host names and MAC addresses on a network. The volume of records along with the rate of change makes ER in the cyber domain particularly challenging. In summary, the problem of ER in the cyber domain is to associate a set of timestamped events extracted from different data streams to the real-world entities that generated them.

4. Background and Related Work

To maintain situation awareness of their assets, network owners often run scanners such as Nmap to enumerate active hosts and their services. This provides a snapshot of the network at a particular point in time that is useful for static operating environments where hosts have fixed, pre-allocated IP addresses, rarely disconnect from the network, are never shared among users, and have no aliases. Additionally, many enterprise networks deploy Endpoint Security and Systems Management (ESSM) software such as Tanium on hosts that act like trackers. They provide reliable host information, but they typically only support the most popular operating systems, missing many variants and embedded systems. When dealing with a dynamic operating environment, however, it is impractical to run scanners and agents continuously at a high rate to achieve good coverage over time. Hosts in this environment often connect and disconnect from the network, connect via different interfaces (e.g., Ethernet, Wi-Fi), have shared interfaces (e.g., NAT, Ethernet dongles), have dynamically assigned IP addresses, and are shared among multiple users and contain various aliases. This creates an added layer of complexity as the same host could be represented by multiple identifiers. Thus, Entity Resolution must be applied for data cleaning, a prerequisite of data mining, to correctly de-duplicate and correlate hosts across multiple data streams. Entity Resolution techniques must be able to overcome the challenges of working with raw cyber data, namely:  large quantity (i.e., millions of events per day generated by a typical medium-sized enterprise network), lack of common fields among data sources, and no safe assumption about the rate of change for host identifiers (unlike people or organizations). Recent advances in ER methods have attempted to overcome these challenges by applying comprehensive, data-agnostic approaches to host enumeration.

4.1. Entities Based on a Time Attribute 

Additionally, ER strategies fall into two categories, deterministic (matching specific fields) and probabilistic (using secondary fields as evidence and building a probabilistic model for assigning a match). Describe an example deterministic approach for ER of network devices using deterministic graph features (in order to resolve network paths between entities). Deterministic approaches bucket records into two mutually exclusive groups of ‘matched‘ if linkage fields agree, otherwise ‘unmatched‘. Although the simpler and more accurate of the two approaches, this approach is typically quite sensitive to match rules and fails to capture record matching for linkage fields that only partially agree. Probabilistic approaches, on the contrary, allow for fuzzy matching of linkage fields that partially agree based on a particular threshold. A seminal paper on probabilistic ER provides a helpful contrast of deterministic and probabilistic record linkage approaches using lists of names for epidemiology. Work improves upon these results by realizing two assumptions:  (1) individual fields common to both datasets are completely observed, and (2) the field agreement indicators are conditionally independent within the subsets of record pairs.

Present a probabilistic ER technique using Similarity Flooding that extends previous work on deterministic rule-based ER to resolve duplicate entities applied to network path estimation. Our strategy combines both deterministic (e.g., construction of segments for reducing the dimensionality of the data) and probabilistic (e.g., snapshot-based and segment-based enrichment to extract features to be used in machine learning classification tasks) approaches to ER in an effort to capture all possible record matches.

5. Data for Host Entity Resolution

The cyber domain is characterized by many disparate sources of information about real-world hosts. When selecting data sources for host cataloging and ER, it is useful to evaluate a data source’s utility by four criteria:

  • Authenticity – How truthful is the information?
  • Quality – How easy is it to extract the information? Is there a lot of noise?
  • Coverage – What fraction of the population does the data source cover? What fraction of time?
  • Cost – Is the data readily accessible? Does it require deploying new sensors? Does it have side effects on the network’s operation?

There are several dimensions across which the available data sources can complement one another. The most apparent is that different data sources contain different attributes or field types. Different sensors generating data streams target hosts of different operating systems, hardware platforms, and locations. Data streams are also collected at different characteristic intervals.  Data streams collected from multiple sensors can complement one another providing a more continuous and complete view of the host population over time.

Different sources of data are reliable over different time intervals, and the reliability of individual fields can vary within a single data source. Combining data sources is challenging because a useful overlap must have a matching field type, time interval, and visibility to the same host. The number of fields in a data source extracted from a log file can get very large, while individual records are sparsely populated. Even within a single log file, there are often different types of events reported, each with a different subset of fields. Overcoming these challenges to provide an accurate timeline of user and host activities is the primary goal of this research.

5. Approach

We have developed a prototype system to catalog unique hosts (including virtual machines) on a network and built a timeline of network events associated with each host. We begin by describing how we reduce the dimensionality of the raw data collected. In contrast to ER in medical, intelligence, or law enforcement communities, equating an entity in one record to a corresponding entity in another record typically does not have a lot of support from overlapping values in the data. We experiment with different mechanisms for combining evidence from multiple records to bolster the support.

We define a host segment, which is a coalesced timeline of log events describing a network connection session. A host segment is defined by four variables (tstart, tendIP, MAC). The values of these variables are pulled from different data sources. Ideally, tstart and tend are extracted from special data sources that directly indicate the initiation and termination of a network session.

The challenge of ER lies in aligning different network connection sessions (host segments) to the same underlying host in the physical world. We make extensive use of the fields in records that fall into a host segment’s time interval to provide support for ER between host segments. Primarily, a record is associated to a segment if it contains that segment’s IP and/or MAC. Evidence for ER comes from coincidence of additional fields found in all records associated with a segment.

6.1. Pipeline

We have designed a pipeline to guide the development of the cyber ER prototype. We will refer to a single entry in a log file as an event or a record. The pipeline takes events as input and produces timelines of unique hosts as output. The pipeline consists of seven stages operating on a continuous stream of events. Initially, new events are ingested and parsed; useful attributes are extracted and stored into a database.

Segments are constructed by chaining events given the following constraints:

  • A segment consists of one or more events with identical IP and MAC.
  • The first event is a connect event, or a poll event indicating the network interface is “live.”
  • The last event is a disconnect event or a poll event.
  • No two segments with the same IP and/or MAC can overlap in time.

6.2. Mechanism for Combining Evidence

Given that cyber events happen at a specific time, whereas host segments are defined by a time interval, there are two strategies that could be used to enrich the segments with additional event data. The first is to sample the collection of host segments at discrete intervals, called snapshots, and estimate the host state variables at the time of these snapshots. The second is to integrate or aggregate all events that occur within the segment interval and enrich the host with all this information. In both cases, events can be used to enrich a segment if they occur within the segment’s time interval and share a key with the segment, either the IP or MAC address.

Snapshot-based Enrichment: A snapshot represents an instantaneous state of the network, consisting of a set of active segments and their features at snapshot time.

7. Results

From the perspective of a cyber defense analyst, a key capability enabled by this ER research is the ability to have access to a timeline of host segments on a network in the form of a searchable repository. We have implemented a prototype of such a repository, called BitBook, that continuously ingests available data streams. The tool is written with a Python back-end to handle log ingestion, segment construction, and snapshot construction. The records are then stored in an SQL-compliant database that is accessible by the front-end web application.

The machine learning algorithm that we used to classify host record pairs in this initial experiment was the Generic Entity Resolution Algorithm. Each record consisted of the segment IP, MAC, start time, and end time, enriched with hostname, user, operating system, and location.

7.1. BitBook

To illustrate the tool and its performance, we consider two experiments:

(1) BitBook’s ability to identify multiple host segments using Zeek, a widely used monitoring tool, to log data from a cyber exercise as compared with an alternative blocking technique, Node2Bits, and

(2) A proof-of-concept of end-to-end ER on hosts across two snapshots using server logs from an enterprise network.

7.2. NCCDC Data Experiments

We consider four Zeek logs from day two of the National Collegiate Cyber Defense Competition Championship (NC-CDC) in 2017. In the NCCDC competition, each team was provided a core network consisting of 7 servers and 6 workstations with various operating systems. For our evaluation, we chose to compare traffic from 0930-1030 that is recorded in the Zeek logs as occurring Sept. 10, 2019. Unfortunately, we do not have uniquely identifiable ground truth data for the exercise, which precludes us from using it for a complete ER pipeline. Additionally, it did not appear that Kerberos was the primary authentication protocol, meaning that very few rows of that log file were relevant for host segment enrichment with username/IP information.

7.3. Node2Bits

We found that Node2Bits, though promising in its generality, did not easily extend to discovering unique hosts from the log data over the same period as was used for BitBook.

7.4. Machine Learning

The machine learning algorithm that we used to classify host record pairs in this initial experiment was the Generic Entity Resolution Algorithm. Each record consisted of the segment IP, MAC, start time, and end time, enriched with hostname, user, operating system, and location. In the short-span experiment, two additional features were mistakenly left in the feature set:  a unique identifier field for each record and the snapshot time of the record (which was the same for all records in a snapshot).

While our preliminary results are quite promising and demonstrate that constructing a timeline of host activities may realistically be approached with ER technology, we do not wish to overstate the significance of these results. It is important to recognize that the data used in these experiments were selected to include hosts for which there existed clean endpoint management records, to circumvent the necessity to manually label hosts with unique IDs. This has resulted in a collection of hosts that is more homogeneous and has a more stable collection of enrichment properties than would generally be found in a complete network environment. This ER methodology should be tested against an ecosystem of hosts that includes servers, routers, embedded computing infrastructure, mobile platforms, and a variety of other host types.

8. Future Work

Having demonstrated the utility of organizing disparate log data into host segments and snapshots, as well as performing ER between the snapshots, the next step should be to perform ER between the host segments. This would hopefully ameliorate the long run times due to the complexity of comparisons in the supervised learning problem. Additionally, the machine-learning step should be tightly integrated with BitBook (or other ER tool) to supplement the snapshot information with comparisons to a running catalog of discovered devices, gradually enhancing cyber situation awareness the longer the tool is employed. This would make it easier for an analyst to confidently use the tool in connection with other cybersecurity tools, such as alerting tools (e.g., Snort or Suricata) or vulnerability scanners.

9. Conclusion

We have presented a cyber ER approach to discover and resolve hosts on an enterprise network given access to either server logs or network logs produced by Zeek. We use a blocking technique that defines “host segments” as intermediate entities; these explicitly encode relevant features for cyber situation awareness within an active time interval. We also introduced an intermediate entity for collections of host segments at discrete points in time. We additionally attempted to use Node2Bits to automatically discover hosts on the same dataset by way of comparison but were unable to produce meaningful results. We presented an end-to-end ER experiment for two snapshots separated by 15 minutes and by one day, using the generalized ER machine learning approach to categorize the hosts. This was highly successful, returning only 4 false positives and 24 false negatives out of about 12,000 correctly labeled hosts, although the runtime required to do the training was prohibitive. Future work should consider ER on the host segments themselves and embedding this step into a prototype tool for end-to-end ER.

Acknowledgments

The authors would like to thank Ethan Aubin, Chris Moir, David O’Gwynn, and Jeremy Mineweaser. DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. This material is based upon work supported by the Department of the Army under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Department of the Army.

References

  1. F. Lyon, Nmap network scanning: The official Nmap project guide to network discovery and security scanning. Insecure, 2009.
  2. “When the world stayed home, Tanium stepped up.” Aug 2020. [Online]. Available: https://www.tanium.com/
  3. Barbosa, “Learning representations of web entities for entity resolution,” International Journal of Web Information Systems, 2019.
  4. D. Gottapu, C. Dagli, and B. Ali, “Entity resolution using convolutional neural network,” Procedia Computer Science, vol. 95, pp. 153– 158, 2016.
  5. Gu, M. Yang, Y. Zhang, P. Pan, and Z. Ling, “Fingerprinting network entities based on traffic analysis in high-speed network environment,” Security and Communication Networks, vol. 2018, 2018.
  6. Papadakis, J. Svirsky, A. Gal, and T. Palpanas, “Comparative analysis of approximate blocking techniques for entity resolution,” Proceedings of the VLDB Endowment, vol. 9, no. 9, pp. 684–695, 2016.
  7. Jin, M. Heimann, R. A. Rossi, and D. Koutra, “node2bits: Compact time-and attribute-aware node representations for user stitching,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2019, pp. 483–506.
  8. Philp, N. Chan, and L. F. Sikos, “Decision support for network path estimation via automated reasoning,” in Intelligent Decision Technologies 2019. Springer, 2020, pp. 335–344.
  9. P. Fellegi and A. B. Sunter, “A Theory for Record Linkage,” Journal of the American Statistical Association, vol. 64, no. 328, pp. 1183–1210, Dec. 1969, publisher: Taylor & Francis. [Online]. Available: https://amstat.tandfonline.com/doi/abs/10.1080/01621459.1969.10501049
  10. Sayers, Y. Ben-Shlomo, A. W. Blom, and F. Steele, “Probabilistic record linkage,” International journal of epidemiology, vol. 45, no. 3, pp. 954–964, 2016.
  11. Ferguson, A. Hannigan, and A. Stack, “A new computationally efficient algorithm for record linkage with field dependency and missing data imputation,” International Journal of Medical Informatics, vol. 109, pp. 70–75, Jan. 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S138650561730401X
  12. Philp, N. Chan, and W. Mayer, “Network path estimation in uncertain data via entity resolution,” in Australasian Conference on Data Mining. Springer, 2019, pp. 196–207.
  13. Lukluk, A. Affandi, and M. Hariadi, “Probabilistic record matching for entity resolution using markov logic networks,” in 2018 Electrical Power, Electronics, Communications, Controls and Informatics Seminar (EECCIS). IEEE, 2018, pp. 360–364.
  14. University of Southern California-Information Sciences Institute, “NCCDC Logs Zeek, IMPACT ID:USC-LANDER/NCCDC logs zeek-20170413/rev10749.” [Online]. Available: https://www.impactcybertrust. org/dataset view?idDataset=1403
  15. “NCCDC logs zeek readme.” [Online]. Available: https://ant.isi.edu/datasets/readmes/NCCDC logs-20170413.README.txt

Want to find out more about this topic?

Request a FREE Technical Inquiry!