ef7c95c87c4c889729ef34b2f943205a.ppt
- Количество слайдов: 25
ACTIONABLE KNOWLEDGE DISCOVERY FOR THREATS INTELLIGENCE SUPPORT ~ A MULTI-DIMENSIONAL DATA MINING METHODOLOGY 2 nd Int. Workshop on Domain Driven Data Mining Pisa - Dec 15 th, 2008 Olivier Thonnard Royal Military Academy Polytechnic Faculty Belgium olivier. thonnard@rma. ac. be Marc Dacier Symantec Research Labs Sophia Antipolis France marc_dacier@symantec. com
Outline 1. Introduction 2. A multi-dimensional & domain-driven approach for mining network traffic (eg malicious) 3. Experimental environment 4. A real-world example 5. Conclusions
Introduction According to the security community, today’s cybercriminality: Is increasingly organized Involves the commoditization of various activities : By selling 0 -days and new (undetected) malwares By selling /renting compromised hosts or entire botnets Seems to be specialized in certain countries Coordination patterns …
Threats intelligence What is the prevalence of emerging coordinated malicious activities? Which countries / IP blocks seem to be more affected? Can we observe various “communities” of machines coordinating their efforts? How to discover knowledge about: 1. 2. The modus operandi of attack phenomena The underlying root causes of attacks How to analyze Internet threats from a global strategic level? Can we enable some sort of Internet threat “situational awareness”
Our « multi-dimensional KDD » approach to analyze network threats Collect real-world attack traces from a number of (worldwide) distributed sensors Network of honeypots = “Honeynet” Threats analysis (semi-automated): Collect “attack events” from each sensor Multi-dimensional KDD: 1) Extract relevant nuggets of knowledge DDDM (with expert-defined features ) – 2) Using Clique algorithms (clique-based clustering) extraction of maximal weighted cliques Synthesizing those pieces of knowledge, to create “concepts” describing the attack phenomena – Using Cliques combinations DDDM
+/- 40 sensors, 30 countries, 5 continents Leurré. com Project 6
Leurre. com / SGNET Honeynet Global distributed honeynet (http: //www. leurrecom. org) +50 sensors distributed in more than 30 countries worldwide Ongoing effort of EURECOM since 2003 Same configuration for all sensors : (V 1. 0): low-interaction honeypots based on honeyd (V 2. 0) : high-interaction honeypots based on Script. Gen Data enrichment: Dataset enriched with contextual information: Geo, reverse-DNS, ASN, external blacklists (Spam. Haus, Shadowserver, Dshield, Emerging. Threats, etc) Parsed and uploaded into an Oracle DB All partners have full access (for free) to the whole DB
Research context WOMBAT Worldwide Observatory of Malicious Behaviors And Threats EU-FP 7 project ( http: //www. wombat-project. eu ) Joint effort in collecting, sharing and analyzing data on global Internet threats
Definition 1: Attack profiles In our honeynet: A source = an IP address that targets a honeypot platform on a given day, with a certain port sequence. All sources are clustered into “attack (profiles)” based on certain network characteristics(*): targeted port sequence, #packets, attack duration, packet payload, … Attack tool Fingerprint(s) (*) F. Pouget, M. Dacier, Honeypot-Based Forensics. Aus. CERT Asia Pacific Information technology Security Conference 2004.
Definition 2: Attack event on sensor ‘x’ Event 1 Event 2 Event 3
Dimensions used to create “attack cliques” We need to identify salient features for the creation of meaningful cliques (“viewpoints“) expert-defined characteristics for each dimension Geolocation Botnets located in specific regions So-called “safe harbors” for the hackers IP netblocks / ISP’s of origin Bias in worm propagation (e. g. malware coding strategies) “Uncleanliness” of certain networks (e. g. clusters of zombie machines) Many others Time series Synchronized activities targeting different sensors Targeted sensors Remark: distance used for distributions Kullback-Leibler, Chi-2, and Kolmogorov-Smirnov
Cliques combination: Creating multi-dimensional “concepts” Geographical cliques of attack events Temporal cliques of attack events Dimension 2 -concept + time Remark: for each dimension, we extract maximal weighted cliques using the « dominant sets » approximation (! needs a full similarity matrix)
Dynamic creation of Concept lattices Initial set of attack events D 2 -concepts D 3 -concepts D 4 -concept Dimensional Level Cliques = D 1 -concepts
Some experiments Some analysis details: Timeframe: Sep 2006 June 2008 Network traffic volume : 282, 363 IP sources (grouped into 351 attack events) Nr of targeted sensors: 36 In 20 different countries, 18 different subnets 136 different attack profiles (i. e. attack clusters)
Experimental results Cliques overview Attack Dimension Geolocation IP Subnets (Class A) Targeted sensors Attack time series Nr of cliques 45 30 17 82 Volume of sources (%) Most targeted port sequences 66. 4 1027 U, I, 1433 T, 1026 U, I 445 T, 5900 T, 1028 U, 9763 T, I 445 T 80 T, 15264 T, 29188 T, 6134 T, 6769 T, 1755 T, 64264 T, 1028 U 1027 U 1026 U, 32878 T, 64783 T, 4152 T, 25083 T, 9661 T, 25618 T, … 56. 0 1027 U, I, 1433 T, 1026 U, I 445 T, 5900 T, 1028 U, 9763 T, 15264 T, 29188 T, 6134 T, 6769 T, 1755 T, 50656 T, 64264 T, 1028 U 1027 U 1026 U, 32878 T, 64783 T, 18462 T, 4152 T, 25083 T, 9661 T, 25618 T, 7690 T, … 70. 1 I, 1433 T, I 445 T, 1025 T, 5900 T, 1026 U, I 445 T 139 T 445 T, 4662 T, 9763 T, 1008 T, 6211 T, I 445 T 80 T, 15264 T, 29188 T, 12293 T, 33018 T, 6134 T, 6769 T, 1755 T, 2968 T, 26912 T, 50656 T, 64264 T, 32878 T, … 92. 2 135 T, I, 1433 T, I 445 T, 5900 T, 1026 U, I 445 T 139 T 445 T, I 445 T 80 T, 6769 T, 1028 U 1027 U 1026 U, 50286 T, 2967 T, …
Visualizing Cliques using Multi-dimensional Scaling High-dimensional dataset Low-dimensional map retaining the global and local structure ‘Dimensionality reduction’ Build a matrix with e. g. : Rows = attack events Columns = feature vectors Example : Geolocation vector of 226 country variables MDS techniques Linear PCA Non-linear Sammon mapping, Isomap, LLE, (t-)SNE
Clique number Visualizing Cliques using MDS and Country labels
Combining Cliques: Real-world example Botnet scans on ports: I, I-445 T-139 T, I-445 T-80 T Attack events {1, 2, 3, …, 67 } ts 1 ts 4 ts 2 ts 6 Cliques of Time series time p 7 superclique g 1 g 9 g 16 Geo cliques g 12 g 32 Only scanners ! (ICMP) Only attackers! (I-445 T-139 T…) s 4 s 12 s 19 s 26 s 28 s 30 s 24 Subnets cliques Dimension Platform cliques
Visualizing Cliques using Multi-dimensional Scaling scanners Clique number attackers
Real-world example: Botnet attack waves Inferred facts: Different waves in time Those 4 botnet waves have hit the same group of platforms Dynamic evolution of the botnet population (IP blocks) between each attack wave Separation of attackers and scanners
Scanners vs Attackers … Scanning bots Attacking bots
Conclusions This KDD methodology can produce concise, high-level summaries of attack traffic: Attack cliques deliver insights into global attack phenomena Facilitates the interpretation of traffic correlations: Attack concepts are rich in semantic It helps to uncover certain modus operandi Flexible and open to additional correlation « viewpoints » : New clique dimension can be added easily when experts find it relevant (i. e. domain-driven)
Future work Integration of other relevant attack features: Botnet / worm patterns separation Malware characteristics (e. g. from high-interaction traffic) Find appropriate combination of attack dimensions: Generation of higher-level “concepts” describing real- world phenomena Knowledge engineering: Exploit attack concepts “reasoning system” Decision tree, expert system, k. NN, … ?
Thank you. Any question? Note: If you’d like to participate in the WOMBAT project (*), please do not hesitate to contact us: Engin Kirda: engin. kirda@eurecom. fr Marc Dacier: marc_dacier@symantec. com Olivier Thonnard: olivier. thonnard@rma. ac. be (*) Leita, C. ; Pham, V. ; Thonnard, O. ; Ramirez-Silva, E. ; Pouget, F. ; Kirda , E. ; Dacier, M. The Leurre. com Project: Collecting Internet Threats Information Using a Worldwide Distributed Honeynet. 1 st WOMBAT workshop, April 21 st-22 nd, Amsterdam.
Leurre. com V 2. 0: SGNET(*) Novel high-interaction honeypots SGNET = Script. Gen Hpots + Argos emulator + Nepenthes Malware analysis: Virus. Total + Anubis Sandbox Script. Gen Anubis “ 0 -day” Malware repository Automated submissions (*) Corrado Leita and Marc Dacier. SGNET: a worldwide deployable framework to support the analysis of malware threat models. (EDCC 2008, Lithuania)
ef7c95c87c4c889729ef34b2f943205a.ppt