Cyber Profiling: Using Instant Messaging Author Writeprints for Cybercrime Investigations

https://commonfund.nih.gov/sites/default/files/BD2K_computer_image.jpg
Image Credit: NIH

Posted: February 9, 2016 | By: Dr. Angela Orebaugh, Dr. Jason Kinser, Dr. Jeremy Allnutt

The explosive growth in the use of instant messaging (IM) communication in both personal and professional environments has resulted in an increased risk to proprietary, sensitive, and personal information and safety due to the influx of IM-assisted cybercrimes, such as phishing, social engineering, threatening, cyber bullying, hate speech and crimes, child exploitation, sexual harassment, and illegal sales and distribution of software. IM-assisted cybercrimes are continuing to make the news with child exploitation, cyber bullying, and scamming leading last month’s headlines. Instant messaging’s anonymity and use of virtual identities hinders social accountability and presents a critical challenge for cybercrime investigation. Cyber forensic techniques are needed to assist cybercrime decision support tools in collecting and analyzing digital evidence, discovering characteristics about the cyber criminal, and assisting in identifying cyber criminal suspects.

Introduction

The anonymous nature of the Internet allows online criminals to use virtual identities to hide their true identity to facilitate cybercrimes. Although central IM servers authenticate users upon login, there is no means of authenticating or validating peers (buddies). Current IM products are not addressing the anonymity and ease of impersonation over instant messaging. Author writeprints can provide cybercrime investigators a unique tool for analyzing IM-assisted cybercrimes. Writeprints are based on behavioral biometrics, which are persistent personal traits and patterns of behavior that may be collected and analyzed to aid a cybercrime investigation. (Li et al., 2006) Instant messaging behavioral biometrics include online writing habits, known as stylometric features, which may be used to create an author writeprint to assist in identifying an author, or characteristics of an author, of a set of instant messages. The writeprint is a digital fingerprint that represents an author’s distinguishing stylometric features that occur in his/her computer-mediated communications. Writeprints may be used as input to a criminal cyberprofile and as an element of a multimodal system for cybercrime investigations. Writeprints can be used in conjunction with other evidence, criminal investigation techniques, and biometrics techniques to reduce the potential suspect space to a certain subset of suspects; identify the most plausible author of an IM conversation from a group of suspects; link related crimes; develop an interview and interrogation strategy; and gather convincing digital evidence to justify search and seizure and provide probable cause.

Instant Messaging and Cybercrime

Instant messaging’s anonymity hinders social accountability and leads to IM-assisted cybercrime facilitated by the following:

  • Users can create any virtual identity,
  • Users can log in from anywhere,
  • Files can be transmitted, and
  • Communication is often transmitted unencrypted.

In IM communications, criminals use virtual identities to hide their true identity. They can use multiple screen names or impersonate other users with the intention of harassing or deceiving unsuspecting victims. Criminals may also supply false information on their virtual identities, for example a male user may configure his virtual identity to appear as female. Since most IM systems use the public Internet, the risk is high that usernames and passwords may be intercepted, or an attacker may hijack a connection or launch a man-in-the-middle (MITM) attack. With hijacking and MITM attacks, the victim user thinks he/she is communicating with a buddy but is really communicating with the attacker masquerading as the victim’s buddy. Instant messaging’s anonymity allows cyber criminals such as pedophiles, scam artists, and stalkers to make contact with their victims and get to know those they target for their crimes (Cross, 2008). IM-assisted cybercrimes, such as phishing, social engineering, threatening, cyber bullying, hate speech and crimes, child exploitation, sexual harassment, and illegal sales and distribution of software are continuing to increase (Moores and Dhillon, 2000). Additionally, criminals such as terrorist groups, gangs, and cyber intruders use IM to communicate (Abbasi and Chen, 2005). Criminals also use IM to transmit worms, viruses, Trojan horses, and other malware over the Internet.

With increasing IM cybercrime, there is a growing need for techniques to assist in identifying online criminal suspects as part of the criminal investigation. Cyber forensics is the application of investigation and analysis techniques to gather evidence suitable for presentation in a court of law with the goal of discovering the crime that took place and who was responsible (Bassett et al., 2006). With IM communications, it is necessary to have cyber forensics techniques to assist in determining the IM user’s real identity and collect digital evidence for investigators and law enforcement.

Behavioral Biometrics Writeprints for Authorship Analysis

Determining an IM user’s real identity relies on the fact that humans are creatures of habit and have certain persistent personal traits and patterns of behavior, known as behavioral biometrics (Revett, 2008). Online writing habits, known as stylometric features, include composition syntax and layout, vocabulary patterns, unique language usage, and other stylistic traits. Thus, certain stylometric features may be used to create an author writeprint to help identify an author of a particular piece of work (De Vel et al., 2001). A writeprint represents an author’s distinguishing stylometric features that occur in his/her instant messaging communications. These stylometric features may include average word length, use of punctuation and special characters, use of abbreviations, and other stylistic traits. Writeprints can provide cybercrime investigators a unique behavioral biometric tool for analyzing IM-assisted cybercrimes. Writeprints can be used as input to a criminal cyberprofile and as an element of a multimodal system to perform cyber forensics and cybercrime investigations.

Instant messaging communications contain several stylometric features for authorship analysis research. Certain IM specific features such as message structure, unusual language usage, and special stylistic markers are useful in forming a suitable writeprint feature set for authorship analysis (Zheng et al., 2006). The style of IM messages is very different than that of any other text used in traditional literature or other forms of computer-mediated communication. The real time, casual nature of IM messages produces text that is conversational in style and reflects the author’s true writing style and vocabulary (Kucukyilmaz et al., 2008). Significant characteristics of IM are the use of special linguistic elements such as abbreviations, and computer and Internet terms, known as netlingo. The textual nature of IM also creates a need to exhibit emotions. Emotion icons, called emoticons, are sequences of punctuation marks commonly used to represent feelings within computer-mediated text (Kucukyilmaz et al., 2008). An author’s IM writeprint may be derived from network packet captures or application data logged during an instant messaging conversation. Although some types of digital evidence, such as source IP addresses, file timestamps, and metadata may be easily manipulated, author writeprints based on behavioral biometrics are unique to an individual and difficult to imitate.

Creating IM Writeprints

A stylometric feature set is composed of a predefined set of measurable writing style attributes. Given t predefined features, each set of IM messages for a given author can be represented as a t-dimensional vector, called a writeprint. Figure 1 presents a stylometric feature set for a 356-dimensional vector writeprint with lexical, syntactic, and structural features. (Orebaugh et al., 2014) The number of features in each category is shown in parenthesis.

Lexical features mainly consist of count totals and are further broken down into emoticons, abbreviations, word-based, and character-based features. Syntactic features include punctuation and function words in order to capture an author’s habits of organizing sentences. Function words include conjunctions, prepositions, and other words that carry little meaning when used alone, such as “the” or “of”. They provide relationships to content words in the sentence, such as “ball” or “bounce”. Analyzing function words as opposed to content words allows topic-independent results that reflect an author’s preferred ways to express himself or herself and form sentences. Structural features capture the way an author organizes the layout of text. With IM communications there are no standard headers, greetings, farewells, or signatures, leaving simply the average characters and words per message in terms of structural layout. A list of function words, abbreviations, and emoticons are included in Appendix A.

fig1

Writeprints are created by generating totals for each stylometric feature, resulting in the output of a writeprint (Wx) for a set of messages {M1,…,Mp} for an author (An) or author category (Cm). A writeprint may be viewed in a comma-separated value (CSV) format where each value represents a total for a specific feature. An example writeprint for an author An using a selected feature set {F1,…,Fq}, where q =100, for a set of messages {M1,…,Mp} looks like the following:

cybnum

After writeprints are generated they may then be normalized, standardized, and input into various statistical models for analysis. Figure 2 shows the output of the Principal Component Analysis (PCA) model for writeprints for seven authors. (Orebaugh et al., 2014) The figure shows the first 3 principal components for multiple author conversations, mapped in three-dimensional space. In this example, each author has a relatively well-defined cluster representing his or her writeprint. Different authors separate from each other, while multiple conversations of an author cluster together. This type of example may be used in an investigation to show that sample evidentiary writeprints do or do not overlap with certain suspect writperints, thus helping investigators narrow the suspect space, develop an interrogation strategy, link related crimes, or justify probable cause.

top7_fmt

Figure 2. IM Writeprint PCA Output

Want to find out more about this topic?

Request a FREE Technical Inquiry!