The National Institute of Standards and Technology Software Assurance Metrics and Tool Evaluation team conducts research in static analysis tools that find security-relevant weaknesses in source code. This article discusses our experiences with Static Analysis Tool Expositions (SATEs) and how we are using that experience to plan SATE VI. Specifically, we address challenges in the development of adequate test cases, the metrics to evaluate tool performance, and the interplay between the test cases and the metrics. SATE V used three types of test cases directed towards realism, statistical significance, and ground truth. SATE VI will use a different approach for producing test cases to get us closer to our goals.
Software assurance is a set of methods and processes to prevent, mitigate or remove vulnerabilities and ensure that the software functions as intended. Multiple techniques and tools should be used for software assurance . One technique that has grown in acceptance is static analysis, which examines software for weaknesses without executing it . The National Institute of Standards and Technology (NIST) Software Assurance Metrics and Tool Evaluation (SAMATE) project has organized five Static Analysis Tool Expositions (SATEs), designed to advance research in static analysis tools that find security-relevant weaknesses in source code. An analysis of SATE V in preparation of the upcoming SATE VI is reported here.
We first discuss our experiences with SATE V, including the selection of test cases, how to analyze the warnings from static analysis tools, and our results. Three selection criteria for the test cases were used: 1) code realism, 2) statistical significance, and 3) knowledge of the weakness locations in code (ground truth). SATE V used test cases satisfying any two out of the three criteria: 1) production test cases with real code and statistical significance, 2) CVE-selected test cases, with real code and ground truth, and 3) synthetic test cases with ground truth and statistical significance. We describe metrics that can be used for evaluating tool effectiveness. Metrics, such as precision, recall, discrimination, coverage and overlap, are discussed in the context of the three types of test cases.
Although our results from the different types of test cases in SATE V bring different perspectives on static analysis tool performance, this article shows that combining such perspectives does not adequately describe real-world use of such tools. Therefore, in SATE VI, we plan to produce test cases incorporating all three criteria, so the results will better reflect real-world use of tools. We discuss the approach we will use: injecting a large number of known, realistic vulnerabilities into real production software. Thus, we will have statistical significance, real code, and ground truth.
Providing metrics and large amounts of test material to help address the need for static analysis tool evaluation is a goal of the National Institute of Standards and Technology (NIST) Software Assurance Metrics and Tool Evaluation (SAMATE) project’s Static Analysis Tool Exposition (SATE). Starting in 2008, we have conducted five SATEs.
SATE, as well as this article, is focused on static analysis tools that find security-relevant weaknesses in source code. These weaknesses, unless avoided or removed early, could lead to security vulnerabilities in the executable software.
SATE is designed for sharing, rather than competing, to advance research in static analysis tools. Briefly, a team led by NIST researchers provides a test set to toolmakers, invites them to run their tools, and they return the tool outputs to us. We then perform partial analysis of tool outputs. Participating toolmakers and organizers share their experiences and observations at a workshop.
The first SATE used open source, production programs as test cases. We learned that not knowing the locations of weaknesses in the programs complicates the analysis task. Over the years, we added other types of test cases.
One type, CVE-selected test cases, is based on the Common Vulnerabilities and Exposures (CVE) , a database of publicly reported security vulnerabilities. The CVE-selected test cases are pairs of programs: an older bad version with publicly reported vulnerabilities (CVEs) and a goodversion, that is, a newer version where the CVEs were fixed. For the CVE-selected test cases, we focused on tool warnings that correspond to the CVEs.
A different approach is computer-assisted generation of test cases. In SATE IV and V, we used the Juliet test suite , which contains tens of thousands of synthetic test cases with precisely characterized weaknesses. This makes tool warnings amenable to mechanical analysis. Like the CVE-selected test cases, there are both a bad version (code that should contain a weakness) and a good version (code that should not contain any weakness).
Initially, we had two language tracks: C/C++ and Java. We added the PHP track for SATE IV. In SATE V, we introduced the Ockham Criteria  to exhibit sound static analysis tools. Table 1 presents toolmaker participation over the years. The PHP track and the Ockham Criteria had one participant each in SATE V. Note, because SATE analyses grew in complexity and length, we changed from yearly SATEs (2008, 2009, and 2010) to the current nomenclature (IV, V, and VI).
Table 1: Number of tools participating per track over SATEs
Software weaknesses can lead to vulnerabilities, which can be exploited by hackers. Definition and classification of security weaknesses in software is necessary to communicate and analyze tool findings. While many classifications have been proposed, Common Weakness Enumeration (CWE) is the most prominent effort [6, 7]. The Common Vulnerabilities and Exposures (CVE) database, comprised of publicly reported security vulnerabilities, was discussed in the Background section. While the CVE database includes specific vulnerabilities in production software, the CWE classification system lists software weakness types, providing a common nomenclature for describing the type and functionality of CVEs to the IT and security communities.
For example, CVE-2009-2559 is a buffer overflow vulnerability in Wireshark, which can be used by hackers to cause denial of service (DoS) . CVE-2009-2559 is associated with two CWEs: CWE-126: Buffer Over-read , which is caused by CWE-834: Excessive Iteration . The NIST National Vulnerability Database (NVD) described it using CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer [11, 12], which is a parent of CWE-126. We describe our use of CVEs and CWEs in our Methodology section.
Researchers have collected test suites and evaluated static analysis tools. Far from attempting a comprehensive review, we list some of the relevant studies here.
Kratkiewicz and Lippmann developed a comprehensive taxonomy of buffer overflows and created 291 test cases – small C programs – to evaluate tools for detecting buffer overflows . Each test case has three vulnerable versions with buffer overflows just outside, moderately outside, and far outside the buffer, and a fourth, fixed, version. Their taxonomy lists different attributes, or code complexities, including aliasing, control flow, and loops, which may complicate analysis by the tools.
The largest synthetic test suite in the NIST Software Assurance Reference Dataset (SARD)  was created by the U.S. National Security Agency’s (NSA) Center for Assured Software (CAS). Juliet 1.0 consists of about 60 000 synthetic test cases, covering 177 CWEs and a wide range of code complexities . CAS ran nine tools on the test suite and found that static analysis tools differed significantly with respect to precision and recall. Also, tools’ precision and recall ordering varied for different weaknesses. CAS concluded that sophisticated use of multiple tools would increase the rate of finding weaknesses and decrease the false positive rate. A newer version of the test suite, Juliet 1.2, correcting several errors and covering a wider range of CWEs and code constructs, was used in SATE V.
Rutar et al. ran five static analysis tools on five open source Java programs, including Apache Tomcat, of varying size and functionality . Due to many tool warnings, they did not categorize every false positive and false negative reported by the tools. Instead, the tool outputs were cross-checked with each other. Additionally, a subset of warnings was examined manually. One of the conclusions by Rutar et al. was that there was little overlap among warnings from different tools. Another conclusion was that a meta-tool combining and cross-referencing output from multiple tools could be used to prioritize warnings .
Kupsch and Miller evaluated the effectiveness of static analysis tools by comparing their results with the results of an in-depth manual vulnerability assessment . Of the vulnerabilities found by manual assessment, the tools found simple implementation bugs, but did not find any of the vulnerabilities requiring a deep understanding of the code or design.
Developing test cases is difficult. There have been many approaches. Zhen Li et al. developed VulPecker, an automated vulnerability detection system, based on code similarity analysis . Their recent study focused on the creation of a Vulnerability Patch Database (VPD), comprised of over 1700 CVEs from nineteen C/C++ open source software. Their CVE-IDs are mapped to diff hunks, which are small files tracking the location of a given weakness and changes in source code across versions.
Instead of extracting CVEs from programs, some studies have looked at injecting vulnerabilities for static analysis tool studies. The Intelligence Advanced Research Projects Activity (IARPA) developed the Securely Taking On New Executable Software of Uncertain Provenance (STONESOUP) program  to inject realistic bugs into production software. The injected vulnerabilities were embedded in real control flow and data flow . These seeded vulnerabilities were snippets of code showcasing a specific vulnerability. However, these embedded snippets were unrelated to the original source program, limiting realism in injected weaknesses. These test cases can be downloaded from the SARD .
In preparation for SATE VI, the SATE team looked extensively at related approaches. One important project was from the MIT Lincoln Laboratory, which developed a large-scale automated vulnerability (LAVA) technique to automatically inject bugs into real programs . The program uses a “taint analysis-based technique” to dynamically identify sites that can potentially hold a bug, and user-controlled data that can be used at those vulnerable locations to trigger the weakness. Thus, the triggering input and the vulnerability are both known. LAVA can inject thousands of bugs in minutes. However, the tool alters the program data flow and only supports a small subset of CWE classes related to buffer overflow, therefore, limiting the realism of the injected weaknesses.
Another automated bug insertion technique is EvilCoder, developed by the Horst Görtz Institut, Germany . Using a static approach, EvilCoder computes code property graphs from C/C++ programs to create a graph database, containing information about types, control flows and data flows. The program identifies paths that could be vulnerable, but are currently safe. Bug insertion is accomplished by breaking or removing security checks, making a path insecure. The limitation of this static analysis-based approach is that it does not produce triggering inputs to demonstrate the injected bugs.
II Test cases
Tool users want to understand how effective tools are in finding weaknesses in source code. Based on our SATE experiences, a perfect test case satisfies three criteria.
First, for tool results to be generally applicable, test cases should be representative of real, existing software. In other words, they should be similar in complexity to real software.
Second, for tool results to be statistically significant, the test cases must contain many different weakness instances of various weakness types. Since CWE has hundreds of weakness classes and the weaknesses can occur in a wide variety of code constructs, large numbers of test cases are needed.
Finally, to recognize tools’ blind spots, we need the ground truth – knowledge of all weakness locations in the software. In other words, without the ground truth we cannot know which weaknesses remain undetected by tools. Additionally, it greatly simplifies analysis of tool outputs by enabling mechanical matching, based on code locations and weakness types.
In summary, the three selection criteria for test cases are 1) realistic, existing code, 2) large amounts of test data to yield statistical significance, and 3) ground truth. Figure 1 illustrates these criteria. So far, we do not have test cases that satisfy all three criteria simultaneously. For SATE V, we have produced test cases satisfying any two out of the three criteria (Figure 1). We chose the following three types of test cases:
First, production software large enough for statistical significance and, by definition, representative of real software. However, the weaknesses in it are at best only partially known.
Second, a set of test cases (i.e., a test suite) mechanically generated, so that each test case contains one weakness instance embedded in a set of code complexities. We used the Juliet test suite, a diverse set of clearly identified weakness instances, for this set. This approach has ground truth and produces statistically significant results. However, the synthetic test cases may not be representative of real code.
Finally, CVE-selected test cases that contain vulnerabilities that were deemed important to be included in the CVE database. These test cases are real software and have ground truth. However, the determination of CVE locations in code is a time-consuming task, which makes it hard to achieve statistical significance.
Figure 1: Types of test cases
To measure the value of static analysis tools, we need to define metrics to decide which attributes and characteristics should be considered. For SATE analyses, we established a universal way of measuring the tools’ output objectively. The following metrics address several questions about tool performance.
First, what types of weaknesses can a tool find? Coverage is measured by the number of unique weakness types reported over the total number of weakness types included in the test set.
Second, what proportion of weaknesses can a tool find? Recall is calculated by dividing the number of correct findings (true positives) by the total number of weaknesses present in the test set, i.e., the sum of the number of true positives (TP) and the number of false negatives (FN). Recall = TP / (TP + FN)5 .
Third, what proportion of covered flaws can a tool find? Applicable recall (App.Recall) is recall reduced to the types of weaknesses a tool can find. It is calculated by dividing the number of true positives (TP) by the number of weaknesses in the test set, which are covered by a tool. In other words, a tool’s performance is not penalized if it does not report weaknesses that it does not look for (App.FN). App.Recall = TP / (TP + App.FN)
Fourth, how much can I trust a tool? Precision is the proportion of correct warnings produced by a tool and is calculated by dividing the number of true positives by the total number of warnings. The total number of warnings is the sum of the number of true positives (TP) and the number of false positives (FP). Precision = TP / (TP + FP)
Fifth, how smart is a tool? Bad and good code often look similar. It is useful to determine whether the tools can differentiate between the two. Although precision captures that aspect of tool efficiency, it is relevant only when good sites are prevalent over bad sites. When there is parity in the number of good and bad sites, e.g., in some synthetic test suites, a tool could indiscriminately flag both good and bad test cases as having a weakness and still achieve a precision of 50 %. Discrimination, however, recognizes a true positive on a particular bad test case only if a tool did not report a false positive on the corresponding good test case. A tool that flags every test case as flawed would achieve a discrimination rate of 0 %.
Finally, can tool findings be confirmed by other tools? Overlap represents the proportion of weaknesses found by more than one tool. The use of independent tools would find more weaknesses (higher recall), whereas the use of similar tools would provide a better confidence in the common warnings’ accuracy.
Table 2 summarizes the applicability of the metrics on the three types of test cases.
Table 2: Mapping metrics to test case types
|Software w/ CVEs
|Synthetic Test Cases
Figure 1 summarizes the types of test cases. The mapping of their metrics is clearly delineated in Table 2. Production software has realism and statistical significance, but no ground truth. CVE-selected test cases have realism and ground truth, but no statistical significance. Synthetic test cases have statistical significance and ground truth, but no realism.
Precision and overlap can be calculated for production software test cases. However, due to the lack of ground truths, recall and discrimination cannot be determined, and only limited results for coverage can be obtained. In contrast, because the CVE-selected test cases are real software with ground truth, both recall and overlap can be calculated. However, because locating vulnerabilities is both difficult and time-consuming, precision cannot be determined, and limited results can be obtained for coverage and discrimination. Although these metrics are applicable to synthetic test cases (i.e., can be calculated), these cases may not generalize to real-world software.
IV Test Case Results
This section focuses on SATE V test case results from the C/C++ track. For this track, we had selected two common open source software programs for the production software analyses: Asterisk version 10.2.0, an IP PBX platform2, and Wireshark version 1.8.0, a network traffic analyzer. Asterisk comprises over 500,000 lines of code; Wireshark contains more than 2 million lines of code. These test cases can be downloaded from the NIST Software Assurance Reference Dataset (SARD) . For the CVE-selected test cases, we also asked toolmakers to run their tools on later, fixed versions of these test cases, using Asterisk version 10.12.2 and Wireshark version 1.8.7. We used the NSA CAS Juliet test set for the synthetic test cases .
Different methods were used to evaluate tool warnings depending upon the type of test case. As we discussed in Section II, synthetic test cases contain precisely characterized weaknesses. Metadata includes the locations where vulnerabilities occur, good and bad blocks of code, and CWEs. Consequently, the analysis of all warnings generated by tools is possible. For each test case, we selected tool findings if its CWE matched the corresponding test case’s CWE group.
As pointed out in Section II, finding the locations of CVEs in pairs of good and bad code was a time-consuming process. The metadata from production software is rich enough to demonstrate whether a tool found a CVE through automatic analysis. However, because CVEs were few in number and tools did not uniformly report vulnerabilities, we also conducted manual analyses. For each CVE, we selected the tool finding reported at the corresponding lines of code, only considering the finding if its CWE and the CVE’s CWE belonged to the same CWE group. Once found, an expert would confirm whether the automated analysis was correct. In addition to extracting CVE test cases this way, our experts also manually checked the code for matches missed by the algorithm. Our experts would rate the CVEs as having been precisely identified or coincidentally (indirectly) identified.
The analysis of production test cases was different. Analyses of tool warnings and reporting were often labor-intensive and required a high level of expertise. A simple binary true/false positive verdict on tool warnings did not provide adequate resolution to communicate the relationship of the warning to the underlying weakness . Because of the large number of tool warnings and the lack of ground truth, we randomly selected warnings from each tool report, based on the weakness category and the security rating. After sampling 879 warnings and manually reviewing their correctness, we assigned each warning to a warning category. A security warning was related to an exploitable security vulnerability. A quality warning was not directly related to security, but it required a software developer’s attention. An insignificant classification referred to a true warning, but insignificant claim. A false warning rating corresponded to a false positive, and an unknown rating was one whose correctness could not be determined.
SATE is not a competition. To prevent endorsement of the participating toolmakers, we anonymized data. The results generated from Tools A through H are reported here.
Figure 2 shows the precision vs. discrimination tool results for the synthetic test cases. The precision results are similar across all tools, whereas discrimination results are not. This is because the number of buggy sites is similar to the number of safe sites, as is the case for synthetic and CVE-selected test cases. Thus, discrimination is a better metric to differentiate tools. Note that for real software, most sites are safe and only a small proportion of sites are buggy, so precision would be very low if a tool reports a warning for every site, flawed or not.
Figure 2: Precision vs. discrimination tool results for the Synthetic test cases – Source: Author(s)
The synthetic test cases offer an excellent demonstration of tool efficiency. Table 3 combines metric results from testing of the Juliet synthetic test suite. Tool F demonstrated the highest applicable recall and discrimination, but displayed the lowest coverage. Tool B, on the other hand, exhibited the broadest coverage and lower discrimination than that of Tool F.
Table 3: Applicable recall, coverage, and discrimination for the Synthetic test cases – Source: Author(s)
Figures 4 to 6 display the results for two metrics: recall and precision. The figures on the left provide a comparison of synthetic and CVE-selected test cases. The figures on the right provide a comparison of synthetic and production test cases. As examples, we use Tools B, H, and A to demonstrate the discrepancies between the results on different types of test cases. Recall was generally higher on synthetic test cases than in the CVE-related test cases. However, Tool A performed better with respect to CVEs in this case. Similarly, a comparison of the precision results indicates that the tools generated fewer false positives on the synthetic test cases than on the production test cases, leading to higher precision. Lower code complexity may account for the better recall and precision on the synthetic test cases compared to the CVE-related and production test cases.
Recall was generally higher on synthetic test cases than in the CVE-selected test cases. However, Tool A performed better with respect to CVEs in this case. Similarly, a comparison of the precision results indicates that the tools generated fewer false positives on the synthetic test cases than on the production test cases, leading to higher precision. Lower code complexity may account for the better recall and precision on the synthetic test cases compared to the CVE-selected and production test cases.
Figure 4: Recall for Synthetic vs. CVE test cases and precision for Synthetic vs. Production test cases – Source: Author(s)
Figure 5: Recall for Synthetic vs. CVE test cases and precision for Synthetic vs. Production test cases – Source: Author(s)
Figure 6: Recall for Synthetic vs. CVE test cases and precision for Synthetic vs. Production test cases – Source: Author(s)
Our examples illustrate the differences between the three types of test cases, making generalization challenging. For the production test cases, there was no ground truth, so tool recall could not be determined. Tools mostly reported different defects, so there was low overlap. Also, the results from synthetic cases may not generalize to real-world software. Clearly, characterizing a large set of CVE-selected test cases is very time consuming, so there was not enough test data collected for statistical significance. We will discuss a different approach in the context of our next SATE, SATE
VI Future SATE VI Plans
The lack of vulnerability corpora has always hampered researchers’ work in software assurance, because high quality test data is essential to achieve meaningful studies applicable to real-world software development. The real challenge does not solely lie in having test cases at our disposal, but rather to have them display specific criteria: ground truth, bug realism, and statistical significance.
Our main goal for SATE VI is to improve the quality of our test suites by producing test cases satisfying these three criteria. Time is a critical factor in the development or selection of new test cases, their use by toolmakers, and the subsequent analysis and reporting of results. CVE extraction yields real bugs, however there are too few CVEs to showcase numerous bugs in a single version of software. Having to run tools on multiple versions of large test cases is time consuming and can be problematic for SATE.
Manual bug injection enables a greater number and diversity in real bugs, but also takes time and effort. To prepare test cases for SATE VI, our team is using a semi-automated process. For each class of weaknesses that we want to insert, the first step is to automatically identify sites that are currently safe, but could become vulnerable with manual transformation, as in EvilCoder . A site is a conceptual place in a program where an operation is performed and a weakness might occur. For example, for C programs, every buffer access is a site where a buffer overflow might occur.
The next step is to find execution paths leading to those sites. We will use guided fuzzing techniques to produce user inputs. Then, we will perform manual source code transformations, where the injected (or seeded) vulnerabilities will use the data flow and control flow of the original program. Finally, we will implement triggering and regression tests to demonstrate the injected bugs and check for conflicts between different injected bugs.
It is essential to understand that finding safe sites is much easier than finding vulnerable sites. Missing a safe site only represents the loss of one potential injected bug. To identify those sites, we must analyze our program the way a compiler does. To achieve this, we are analyzing the abstract syntax tree (AST) and extracting specific patterns. Ultimately, we want to use those sites to guide manual bug injection.
Identifying a site does not provide the input leading to it. We plan to use fuzzing tools to determine such input.
Our team will gather a set of CVEs and extract real-world insecure patterns to mimic production software vulnerabilities. Source transformations will be performed manually to reproduce common industry practices and yield realistic injected bugs. To achieve this, we will verify that the seeded vulnerabilities do not significantly alter the original data flow and control flow of the target program.
We must demonstrate that a given input leads to a real vulnerability. Manual bug injection requires much effort and high-level analysis to produce exploits. In fact, demonstrating exploitability is very challenging for static analyzers. Therefore, it is sufficient to demonstrate that our program exhibits abnormal behavior due to injected bugs. Consider this: an off-by-one buffer overflow will not always result in a program crashing, however, it can be validated using an assert statement.
In this article, we have discussed our experiences with SATE that can be useful for the software assurance community. Specifically, the article focused on the selection of test cases and how to analyze the output warnings from tools. We described metrics that could be used for evaluating tool effectiveness. Because tools report different weaknesses, there is little overlap in results.
SATE V covered three types of test cases: 1) production test cases, which had real code and statistical significance, 2) CVE-selected test cases, which had real code and ground truth, and 3) synthetic test cases, which had both ground truth and statistical significance. Although synthetic test cases cover a broad range of weaknesses, such test cases cannot be generalized to real-world software, like production cases. CVE extraction yields real bugs in production software, but it is both time-consuming and generates no statistical significance. Finally, static analysis tools can identify a large number of warnings in production software, which is real code. However, we do not know the location of all vulnerabilities, i.e., ground truth. Therefore, we require a better test suite, covering all three criteria for test cases.
Our main goal for future SATEs is to improve the quality of our analyses by producing test cases satisfying all three criteria. We believe inserting security-relevant vulnerabilities into real-world software can help us achieve this goal.
We learned through the study of three sophisticated and fully-automated injection techniques that the injected bugs are either insufficiently realistic [18, 20] or lack triggering inputs . Purely manual injection has the benefit of yielding more realistic bugs, however it is time-consuming. Our team is considering a semi-automated process, speeding the discovery of potential sites, so we can perform manual source code transformations. In particular, we want to make sure that the seeded vulnerabilities do not significantly alter the data flow and control flow of the original program, and programming follows common development practices. Since demonstrating the injected bugs is essential, we will ensure that the injected bugs trigger abnormal program behavior.
- Larsen, G., Fong, E. K. H., Wheeler, D. A., & Moorthy, R. S. (2014, July). State-of-the-art resources (SOAR) for software vulnerability detection, test, and evaluation. Institute for Defense Analyses IDA Paper P-5061. Retrieved from http://www.acq.osd.mil/se/docs/P-5061-software-soar-mobility-Final-Full-Doc-20140716.pdf
- SAMATE. (2017). Source code security analyzers (SAMATE list of static analysis tools). Retrieved from https://samate.nist.gov/index.php/Source_Code_Security_Analyzers.html
- MITRE. (2017, July 20). Common vulnerabilities and exposures. Retrieved from https://cve.mitre.org/
- Center for Assured Software, U.S. National Security Agency (2011, December). CAS static analysis tool study – Methodology. Retrieved from http://samate.nist.gov/docs/ CAS_2011_SA_Tool_Method.pdf
- Black, P. E., & Ribeiro, A. (2016, March). SATE V Ockham sound analysis criteria. NISTIR 8113. https://dx.doi.org/10.6028/NIST.IR.8113. Retrieved from http://nvlpubs.nist.gov/nistpubs/ir/2016/NIST.IR.8113.pdf
- MITRE. (2017, June 6). Common weakness enumeration: Process: Approach. Retrieved from https://cwe.mitre.org/about/process.html#approach
- MITRE. (2017, June 7). Common weakness enumeration: About CWE. Retrieved from https://cwe.mitre.org/about/index.html
- MITRE. (2017). CVE-2009-2559. Retrieved from http://cve.mitre.org/cgi-bin/cvename.cgi?name=cve-2009-2559
- MITRE. (2017, May 5). CWE-126: Buffer over-read. Retrieved from http://cwe.mitre.org/data/definitions/126.html
- MITRE. (2017, May 5). CWE- CWE-834: Excessive iteration. Retrieved from http://cwe.mitre.org/data/definitions/834.html
- MITRE. (2017). CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer. Retrieved from http://cwe.mitre.org/data/definitions/119.html
- National Vulnerability Database, National Institute of Standards and Technology. (2010, August 21). CVE-2009-2559 Detail. Retrieved from https://nvd.nist.gov/vuln/detail/CVE-2009-2559
- Kratkiewicz, K., & Lippmann, R. (2005). Using a diagnostic corpus of C programs to evaluate buffer overflow detection by static analysis tools. Proceedings of the Workshop on the Evaluation of Software Defect Detection Tools, 2005. Retrieved from https://www.ll.mit.edu/ mission/cybersec/publications/publication-files/full_papers/ 050610_Kratkiewicz.pdf
- SAMATE, National Institute of Standards and Technology. (2017). Software Assurance Reference Dataset. Retrieved from https://samate.nist.gov/SARD/
- Rutar, N., Almazan, C. B., & Foster, J. S. (2004). A comparison of bug finding tools for Java. Proceedings of the 15th IEEE International Symposium on Software Reliability Engineering (ISSRE’04), France, November 2004. https://dx.doi.org/10.1109/ISSRE.2004.1
- Kupsch, J. A., & Miller, B. P. (2009). Manual vs. automated vulnerability assessment: A case study. In Proceedings of the 1st International Workshop on Managing Insider Security Threats (MIST-2009), Purdue University, West Lafayette, IN, June 15-19, 2009.
- Li, Z., Zou, D., Xu, S, Jin, H., Qi, H., & Hu, J. (2016). VulPecker: An automated vulnerability detection system based on code similarity analysis. In Proceedings of the 32nd Annual Conference on Computer Security Applications, pp. 201-213. https://dx.doi.org/10.1145/2991079.2991102
- De Oliveira, C., & Boland, F. (2015). Real world software assurance test suite: STONESOUP (Presentation). IEEE 27th Software Technology Conference (STC ‘2015) October 12-15, 2015.
- De Oliveira, C. D., Fong, E., & Black, P. E. (2017, February). Impact of code complexity on software analysis. NISTIR 8165. https://dx.doi.org/10.6028/NIST.IR.8165. Retrieved from http://nvlpubs.nist.gov/nistpubs/ir/2017/NIST.IR.8165.pdf
- Dolan-Gavitt, B., Hulin, P., Kirda, E., Leek, T., Mambretti, A., Robertson, W., Ulrich, F., & Whelan, R. (2016). LAVA: Large-scale automated vulnerability addition. In Proceedings of the 2016 IEEE Symposium on Security and Privacy, pp. 110-121. https://dx.doi.org/10.1109/SP.2016.15
- Pewny J., & Holz, T. (2016). EvilCoder: Automated bug insertion. In Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC’16), pp. 214-255. https://dx.doi.org/10.1145/2991079.2991103
- Black, P. E. (2012). Static analyzers: Seat belts for your code. IEEE Security & Privacy, 10(2), 48-52. https://dx.doi.org/10.1109/MSP.2012.2