
a09699bfc6a47026aba8de1099c83a50.ppt
- Количество слайдов: 29
A High Throughput String Matching Architecture for Intrusion Detection and Prevention Lin Tan U of Illinois, Urbana Champaign Tim Sherwood UC, Santa Barbara
Outline • Why String Matching – Matching against multiple strings • The Aho-Corasick Algorithm – The Devil in the Constants • A Bit-Split Algorithm • Hardware Design and Analysis • Conclusions
To Protect and Serve • Our machines are constantly under attack • Cannot rely on end users, we need networks which actively defend themselves. IDS/IPS are promising ways of providing protection Market for such systems: $918. 9 million by the end of 2007. Snort: a widely accepted open source IDS This requires the protection system to be able to operate at 10 to 40 Gb/s. (We aim at current and next generation networks. )
Our Contributions • String Matching Architecture: – 0. 4 MB and 10 Gbps for Snort rule set ( >10, 000 characters) • Bit-Split String Matching Algorithm – Reduces out edges from 256 to 2. • Performance/area beats the best techniques we examined by a factor of 10 or more.
Scanning for Intrusions Code. Red worm: web flow established uricontent with “/root. exe” Software Scan IDS Traffic In Traffic Out Most IDS define a set of rules. A string defines a suspicious transmission. We are not building a full IDS, rather building the primitives from which full systems can be built
Multiple String Matching • The multiple string matching algorithm: – Input: A set of strings/patterns S, and a buffer b – Output: Every occurrence of an element of S in b A string can be anywhere in the payload of a packet. Input: A B D FC A B Strings: A B A CA B – Extra constraint: b is really a stream • How to implement: Option 1) search for each string independently Option 2) combine strings together and search all at once
Why hardware • Snort: >1, 000 rules, growing at 1 rule/day or more • Active research into automated rule building • Strings are not limited to be just [a-z]+ • We need a high speed string matching technique with stringent worst case performance. • Many algorithms are targeted for average case performance. Aho-Corasick can scan once and output all matches. But it is too big to be on-chip.
Outline • Why String Matching – Matching against multiple strings • The Aho-Corasick Algorithm – The Devil in the Constants • A Bit-Split Algorithm • Hardware Design and Analysis • Conclusions
The Aho-Corasick Algorithm • Given a finite set P of patterns, build a deterministic finite automaton G accepting the set of all patterns in P.
An AC Automaton Example • Example: P = {he, she, his, hers} Initial State Transition Function State Accepting State • The Construction: linear time. • The search of all patterns in P: linear time h h h 2 h h s 8 9 S h 7 s h i 6 S 3 i S r s S 1 e 0 h 4 e h r S S 5 h S (Edges pointing back to State 0 are not shown).
Linear Time: So what’s the problem • How to implement it on chip? 256 Next State Pointers 2 … … 16, 384 … 0 0 <14> 1 <14> 2 <14> 3 <14> 1 255 <14> • Problem: Size too big to be on-chip – ~ 10, 000 nodes – 256 out edges per node – Requires 16, 384*256*14 = ~10 MB • Solution: partition into small state machines – Less strings per machine – Less out edges per machine
Outline • Why String Matching – Matching against multiple strings • The Aho-Corasick Algorithm – The Devil in the Constants • A Bit-Split Algorithm • Hardware Design and Analysis • Conclusions
Our Main Idea: Bit-Split • Partition rules (P) into smaller sets (P 0 to Pn) • Build AC state-machine for each subset • For each DFA Pi, rip state-machine apart into 8 tiny state-machines (Bi 0 through Bi 7) • Each of which searches for 1 bit in the 8 bit encoding of an input character – Only if all the different B machines agree can there actually a match
Binary Encoding P 0 = { he, she, his, hers }
An example of Bit-Split P 0 = { he, she, his, hers } P 0 B 03 0001 0000 0110 1000 h h S r h h 9 0 i 6 s Sh 8 7 s 0111 0011 s S S 1 e 2 h 3 h h i 4 h r b 0 {0} 0 e 5 b 1 { 0 , 1 } 0 b 3 {0, 1, 2, 6 } 0 1 1 1 b 2 { 0 , 3 } 1 0 1 1 { 0, 3 b 4{0, 1, 4}} S 0 0 0 b 6{0, 1, 2, 5, 6} S b 3{0, 1, 2, 6} h S (Edges pointing back to State 0 are not shown). 1 0 1 b 5{0, 3, 7, 8} 1 1 0 b 7{0, 3, 9}
Compact State Set P 0 = { he, she, his, hers } P 0 B 03 0 h h h 2 S r h h i 6 s Sh 8 7 s 9 s S 1 e 0 4 h r e 5 1 b 4 { 0 S 1 b 2 { 0 h h 1 b 1 { } S 3 i b 0 { } } 1 1 } 0 0 0 b 6{ 2, 5 } S 0 b 3{ 2 } h S (Edges pointing back to State 0 are not shown). 1 1 1 b 5{7} 1 0 b 7{9}
An example of Bit-Split P 0 = { he, she, his, hers } P 0 b 0 {} h 0 s h h 2 S h r h i 6 s Sh 8 7 s 9 3 i h 1 1 h r e 5 b 1{} 0 0 1 b 4 {} S 0 b 3{2} S (Edges pointing back to State 0 are not shown). 1 1 b 3 {} 1 1 0 1 b 5 {} b 6{2, 5} 0 0 1 1 1 0 b 5{7} 1 h 0 0 b 6{2, 5} 1 0 b 4{2} 1 0 0 b 2{} 0 1 b 2{} h 4 S B 04 b 0 {} 0 S b 1{} 1 S 1 e B 03 0 b 8{2, 7} 1 b 7 {} 0 b 7{9} b 9{9} 0 1 0
Nice Properties • The number of states in Bij is rigorously bounded by the number of states in Pi • No exponential blow up in state • Linear construction time • Possible to traverse multiple edges at a time to multiply throughput
Matching on the example h h h 2 S r h h i 6 s Sh 8 7 s 9 s S 1 e 0 S 3 h h i 4 h r e S S 5 h S Input stream: h x h e rs Only scan the input stream once.
Matching on the example hxhe 0100 P 0 h h 2 S h r h i 6 s Sh 8 7 s 9 3 i h 1 1 h r e 5 b 1{} 0 0 1 b 4 {} S 0 b 3{2} S 1 1 b 3 {} 1 1 0 1 b 5 {} b 6{2, 5} 0 0 1 1 1 0 b 5{7} 1 h 0 0 b 6{2, 5} 1 0 b 4{2} 1 0 0 b 2{} 0 1 b 2{} h 4 S B 04 b 0 {} 0 S b 1{} 1 S 1 e B 03 b 0 {} h 0 s 1110 0 b 8{2, 7} 1 b 7 {} 0 b 7{9} b 9{9} 1 0 0 How do you “combine” the results from the different state machines? Only if all the state machines agree, is there actually a match.
How to Implement • The AC state machine is equivalent to the 8 tiny state machines. • The 8 tiny state machines can run independently, which means in parallel • Intersection done with bit-wise AND. • 8 is intuitive but not optimal • How to build a system to implement this algorithm? – Our algorithm makes it feasible to be on-chip
A Hardware Implementation String Match Engine State Machine Tile 2 -bit Input [0: 1] Partial Match Vector [2: 3] 2 16 16 Tile 3 [6: 7] <8> [4: 5] Tile 2 8 <8> Partial <8> <16> 8 16 4: 1 Mux Input Output Latch Rule Module N 8 0 1 2 3 255 16 Full Match Vector Rule Module 1 8 4 Next State Pointers Match Vector … Tile 1 … Complete Set of Matches for All Rules Contr ol Block Tile 0 decoder 8 Current State <8> Byte from Payload Rule Module 0 Config Data 2 bits from each byte Partial Match Vector • A rule module is equivalent to an AC state machine • Rule modules, tiles are structurally equivalent • All full match vectors are concatenated to indicate which strings are matched • One tile stores one tiny bit-split state machine
An efficient Implementation Cycle 3 Cycle 2 Cycle 1 Cycle 0 e h x h 01 01 10 10 11 10 01 10 10 10 2 2 2 Tile 0 00 01 10 11 h x h e 0 0 1 0 0 0000 00 01 10 11 PMV 1 0 2 0 0 0000 2 0 3 0 0 1000 3 0 4 0 0 1110 4 5 0 4 0 0 x h 0 0 0 1 2 3 0 0 0 3 4 3 2 2 5 0 0 6 2 0000 x e h 0 2 0 0000 1 1 0 3 0 0000 1 0 5 0 0000 3 1 6 5 0 4 h 1 2 0000 1000 Tile 3 00 01 10 11 PMV 0 0000 4 1111 h 2 Tile 1 PMV 01 00 00 00 7 0 2 00 01 10 11 PMV 0 1 0 0 2 0000 1 1 3 0 2 0000 2 4 0 0 2 0000 3 1 0 5 6 1000 0 1000 4 1 7 0 2 0000 h h x e 5 0 0 4 7 0010 5 0 4 5 0 0000 5 1 0 0 8 0000 6 6 0 0 3 5 1100 6 7 0 2 0 1100 6 4 0 0 2 0010 7 7 0 0 4 2 0001 7 9 0 3 0 0000 7 1 0 5 6 1100 8 8 8 1 0 3 0 0010 8 4 0 0 2 0001 9 9 9 1 0 3 0 0001 9 e h x h 1000 0000 e e h x h 1111 1110 1000 0000 e h x h 1100 0000 Cycle 3 + P Cycle 2 + P Cycle 1 + P Cycle 0 + P e h x h 1000 0000 0000
Performance of Hardware Key Metric: Throughput*Character/Area
Related Work • Software based – Good for ~100 Mb/s, common case • FPGA-based – Many schemes map rules down to a specialized circuit • Near optimal utilization of hardware resources – Implementing state machines on block-RAMs [Cho and Mangione. Smith] – Concurrent to our work: mapping state machines to on-chip SRAM [Aldwairi et. al. ] – Bloom filters [Dharmapurikar et al. ] • Excellent filter in the common case • TCAM-based – Require all patterns to be shorter or equal to TCAM width – Cutting long patterns: 2 Gbps with 295 KB TCAM [Yu et. al. ]
Conclusions • New Tile-based Architecture – 0. 4 MB and 10 Gbps for Snort rule set ( >10, 000 characters) – Possible to be used for other applications, e. g. IP lookups, packet classification. • New Bit-split Algorithm: – General purpose enough for many other applications, e. g. spam detection, peephole optimization, IP lookups, packet classification, etc. – Feasible to be implemented on other tile-based architecture.
Thank you! Questions?
• Backup Slides
An efficient Implementation Cycle 3 Cycle 2 Cycle 1 Cycle 0 e h x h 01 01 10 10 11 10 01 10 10 10 2 2 2 Tile 0 00 01 10 11 h x h e 0 0 1 0 0 0000 00 01 10 11 PMV 1 0 2 0 0 0000 2 0 3 0 0 1000 3 0 4 0 0 1110 4 5 0 4 0 0 x h 0 0 0 1 2 3 0 0 0 3 4 3 2 2 5 0 0 6 2 0000 x e h 0 2 0 0000 1 1 0 3 0 0000 1 0 5 0 0000 3 1 6 5 0 4 h 1 2 0000 1000 Tile 3 00 01 10 11 PMV 0 0000 4 1111 h 2 Tile 1 PMV 01 00 00 00 7 0 2 00 01 10 11 PMV 0 1 0 0 2 0000 1 1 3 0 2 0000 2 4 0 0 2 0000 3 1 0 5 6 1000 0 1000 4 1 7 0 2 0000 h h x e 5 0 0 4 7 0010 5 0 4 5 0 0000 5 1 0 0 8 0000 6 6 0 0 3 5 1100 6 7 0 2 0 1100 6 4 0 0 2 0010 7 7 0 0 4 2 0001 7 9 0 3 0 0000 7 1 0 5 6 1100 8 8 8 1 0 3 0 0010 8 4 0 0 2 0001 9 9 9 1 0 3 0 0001 9 e h x h 1000 0000 e e h x h 1111 1110 1000 0000 e h x h 1100 0000 Cycle 3 + P Cycle 2 + P Cycle 1 + P Cycle 0 + P e h x h 1000 0000 0000