Скачать презентацию GPLAG Detection of Software Plagiarism by Program Dependence Скачать презентацию GPLAG Detection of Software Plagiarism by Program Dependence

cd24de6abe63551e30a63e0b8f8b4a4a.ppt

  • Количество слайдов: 42

GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen, Jiawei GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen, Jiawei Han, Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Waston Research Center Presented by Chao Liu 1

Motivations o Blossom of open-source projects n Source. Forge. net: 125, 090 projects as Motivations o Blossom of open-source projects n Source. Forge. net: 125, 090 projects as July 2006 o Convenience for software plagiarism? n You can always find something online o Core-part plagiarism n Ripping off GUIs and irrelevant parts n (Illegally) reuse the implementations of corealgorithms o Our goal n Efficient detection of core-part plagiarism 2

Challenges o Effectiveness n Professional plagiarists n Automated plagiarism o Efficiency n Only a Challenges o Effectiveness n Professional plagiarists n Automated plagiarism o Efficiency n Only a small part of code is plagiarized, how to detect it efficiently? 3

Outline o Plagiarism Disguises o Review of Plagiarism Detection o GPLAG: PDG-based Plagiarism Detection Outline o Plagiarism Disguises o Review of Plagiarism Detection o GPLAG: PDG-based Plagiarism Detection o Efficiency and Scalability o Experiments o Conclusions 4

Original Program A procedure in a program, called join 01 static void 02 03 Original Program A procedure in a program, called join 01 static void 02 03 make_blank (struct line *blank, int count) { 04 05 06 07 08 09 10 11 12 13 14 15 int i; unsigned char *buffer; struct field *fields; blank->nfields = count; blank->buf. size = blank->buf. length = count + 1; blank->buf. buffer = (char*) xmalloc (blank->buf. size); buffer = (unsigned char *) blank->buf. buffer; blank->fields = (struct field *) xmalloc (sizeof (struct field) * count); for (i = 0; i < count; i++){. . . } } 5

Disguise 1: Format Alteration Insert comments and blanks 01 static void 02 make_blank (struct Disguise 1: Format Alteration Insert comments and blanks 01 static void 02 make_blank (struct line *blank, int count) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 07 blank->nfields = count; // initialization 08 blank->buf. size = blank->buf. length = count + 1; 09 blank->buf. buffer = (char*) xmalloc (blank->buf. size); 10 buffer = (unsigned char *) blank->buf. buffer; 11 blank->fields = (struct field *) xmalloc (sizeof (struct field) * count); 12 for (i = 0; i < count; i++){ 13 . . . 14 15 } } 6

Disguise 2: Identifier Renaming Rename variables consistently 01 static void 02 fill_content (struct line Disguise 2: Identifier Renaming Rename variables consistently 01 static void 02 fill_content (struct line *fill, int num) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 07 fill->nfields = num; // initialization 08 fill->buf. size = fill->buf. length = num + 1; 09 fill->buf. buffer = (char*) xmalloc (fill->buf. size); 10 buffer = (unsigned char *) fill->buf. buffer; 11 fill->fields = (struct field *) xmalloc (sizeof (struct field) * num); 12 for (i = 0; i < num; i++){ 13 . . . 14 15 } } 7

Disguise 3: Statement Reordering Reorder non-dependent statements 01 static void 02 fill_content (struct line Disguise 3: Statement Reordering Reorder non-dependent statements 01 static void 02 fill_content (struct line *fill, int num) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 11 fill->fields = (struct field *) xmalloc (sizeof (struct field) * num); 08 fill->buf. size = fill->buf. length = num + 1; 09 fill->buf. buffer = (char*) xmalloc (fill->buf. size); 10 buffer = (unsigned char *) fill->buf. buffer; 07 fill->nfields = num; // initialization 12 for (i = 0; i < num; i++){ 13 . . . 14 15 } } 8

Disguise 4: Control Replacement Use equivalent control structure 01 static void 02 fill_content (struct Disguise 4: Control Replacement Use equivalent control structure 01 static void 02 fill_content (struct line *fill, int num) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 11 fill->fields = (struct field *) xmalloc (sizeof (struct field) * num); 08 fill->buf. size = fill->buf. length = num + 1; 09 fill->buf. buffer = (char*) xmalloc (fill->buf. size); 10 buffer = (unsigned char *) fill->buf. buffer; 07 fill->nfields = num; // initialization 12 i = 0; 13 while (i < num){ 14 . . . 15 i++; 16 17 } } 9

Disguise 5: Code Insertion Insert immaterial code 01 static void 02 fill_content (struct line Disguise 5: Code Insertion Insert immaterial code 01 static void 02 fill_content (struct line *fill, int num) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 11 fill->fields = (struct field *) xmalloc (sizeof (struct field) * num); 08 fill->buf. size = fill->buf. length = num + 1; 09 fill->buf. buffer = (char*) xmalloc (fill->buf. size); 10 buffer = (unsigned char *) fill->buf. buffer; 07 fill->nfields = num; // initialization 12 i = 0; 13 while (i < num){ 14 . . . for (int j = 0; j < i; j++); 15 i++; 16 17 } } 10

Fully Disguised 11 Fully Disguised 11

Outline o o o Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Outline o o o Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions 12

Review of Plagiarism Detection o String-based [Baker et al. 1995] n A program represented Review of Plagiarism Detection o String-based [Baker et al. 1995] n A program represented as a string n Blanks and comments ignored. o AST-based [Baxter et al. 1998, Kontogiannis et al. 1995] n A program is represented as an Abstract Syntax Tree (AST) n Fragile to statement reordering, control replacement and code insertion o Token-based [Kamiya et al. 2002, Prechelt et al. 2002] n Variables of the same type are mapped to the same token n A program is represented as a token string n Fingerprint of token strings is used for robustness [Schleimer n n et al. 2003] Partially robust to statement reordering, control replacement and code insertion Representatives: Moss and JPlag 13

Outline o o o Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Outline o o o Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions 14

Graphic representation of source code int sum(int array[], int count) { int i, sum; Graphic representation of source code int sum(int array[], int count) { int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum; } int add(int a, int b) { return a + b; } 15

Graphic representation of source code int sum(int array[], int count) int add(int a, int Graphic representation of source code int sum(int array[], int count) int add(int a, int b) { int i, sum; { return a + b; sum = 0; for(i = 0; i < count; i++){ } sum = add(sum, array[i]); } return sum; } 16

Control Dependency int sum(int array[], int count) int add(int a, int b) { int Control Dependency int sum(int array[], int count) int add(int a, int b) { int i, sum; { return a + b; sum = 0; for(i = 0; i < count; i++){ } sum = add(sum, array[i]); } return sum; } 17

Data Dependency int sum(int array[], int count) int add(int a, int b) { int Data Dependency int sum(int array[], int count) int add(int a, int b) { int i, sum; { return a + b; sum = 0; for(i = 0; i < count; i++){ } sum = add(sum, array[i]); } return sum; } 18

Plagiarism Detectible? 19 Plagiarism Detectible? 19

Corresponding PDGs PDG for the Original Code PDG for the Plagiarized Code 20 Corresponding PDGs PDG for the Original Code PDG for the Plagiarized Code 20

PDG-based Plagiarism Detection o A program is represented as a set of PDGs n PDG-based Plagiarism Detection o A program is represented as a set of PDGs n Let g be a PDG of Procedure P in the original program n Let g’ be a PDG of Procedure P’ in the plagiarism suspect o Subgraph isomorphism implies plagiarism n If g is subgraph isomorphic to g’, P’ is likely plagiarized from P n γ-isomorphism: Graph g is γ-isomorphic to g’ if there exists a subgraph s of g such that s is subgraph isomorphic to g’, and |s|≥ γ |g|. n If g is γ–isomorphic to g’, the PDG pair (g, g’) is regarded as a plagiarized PDG pair, and is then returned to human beings for examination. 21

Advantages o Robust because it is hard to overhaul PDGs n Dependencies encode program Advantages o Robust because it is hard to overhaul PDGs n Dependencies encode program logic n Incentive of plagiarism 22

Outline o o o Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Outline o o o Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions 23

Efficiency and Scalability o Search space n If the original program has n procedures Efficiency and Scalability o Search space n If the original program has n procedures and the plagiarism suspect has m procedures o n*m subgraph isomorphism testings o Pruning search space n Lossless filter n Statistical lossy filter 24

Lossless filter o Interestingness n PDGs smaller than an interesting size K are excluded Lossless filter o Interestingness n PDGs smaller than an interesting size K are excluded from both sides o γ-isomorphism definition n A PDG pair (g, g’) is discarded if |g’| <γ|g|. 25

Lossy Filter o Observation n If procedure P’ is plagiarized from procedure P, its Lossy Filter o Observation n If procedure P’ is plagiarized from procedure P, its PDG g’ should look similar to g. n So discard those dissimilar PDG pairs o Requirement n This filter must be light-weighted 26

Vertex Histogram o Represent PDG g by h(g) = (n 1, n 2, …, Vertex Histogram o Represent PDG g by h(g) = (n 1, n 2, …, nk), where ni is the frequency of the ith kind of vertices. o Similarly, represent PDG g’ by h(g’) = (m 1, m 2, …, mk). o Direct similarity measurement? n How to define a proper similarity threshold? n Is thus defined threshold programindependent? 27

Hypothesis Testing-based Approach o Basic idea n Estimate a k-dimensional multinomial distribution from h(g) Hypothesis Testing-based Approach o Basic idea n Estimate a k-dimensional multinomial distribution from h(g) n Test whether h(g’) is likely an observation from n If it is, g’ looks similar to g, and an isomorphism testing is needed. n Otherwise, (g, g’) is discarded 28

Technical Details 29 Technical Details 29

Technical Details (cont’d) 30 Technical Details (cont’d) 30

Work-flow of GPLAG o PDGs are generated with Codesurfer o Isomorphism testing is implemented Work-flow of GPLAG o PDGs are generated with Codesurfer o Isomorphism testing is implemented with VFLib. 31

Outline o o o Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Outline o o o Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions 32

Experiment Design o Subject programs o Effectiveness o Filter efficiency o Core-part plagiarism detection Experiment Design o Subject programs o Effectiveness o Filter efficiency o Core-part plagiarism detection 33

Effectiveness o o o 2 -hour manual plagiarism, but can be automated? GPLAG detects Effectiveness o o o 2 -hour manual plagiarism, but can be automated? GPLAG detects all plagiarized PDG pairs within 1 second PDG isomorphism also reveals what plagiarism disguises are applied 34

Efficiency o Subject programs n bc, less and tar. n Exact copy as plagiarism. Efficiency o Subject programs n bc, less and tar. n Exact copy as plagiarism. o Lossless and lossy filter n Pruning PDG-pairs. n Implication to overall time cost. 35

Pruning Uninteresting PDG-pairs o Lossless only o Lossless and lossy 36 Pruning Uninteresting PDG-pairs o Lossless only o Lossless and lossy 36

Implication to Overall Time Cost o Time-out for subgraph isomorphism testing, time hogs. o Implication to Overall Time Cost o Time-out for subgraph isomorphism testing, time hogs. o Lossless filter does not save much time. o Lossy filter significantly reduces the time cost. o Major time saving comes from the avoidance of time hogs. 37

Detection of Core-part Plagiarism o Lower time cost with lossy filter. o Lower false Detection of Core-part Plagiarism o Lower time cost with lossy filter. o Lower false positives with lossy filter. 38

Outline o o o Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Outline o o o Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions 39

Conclusions o We developed a new algorithm GPLAG for software plagiarism detection o It Conclusions o We developed a new algorithm GPLAG for software plagiarism detection o It is more effective to fight against “professional” plagiarists o We developed a statistical lossy filter, which improves the efficiency of GPLAG o We experimentally verified the effectiveness and efficiency of GPLAG 40

Q & A Thank You! 41 Q & A Thank You! 41

References [1] B. S. Baker. On finding duplication and near duplication in large software References [1] B. S. Baker. On finding duplication and near duplication in large software systems. In Proc. of 2 nd Working Conf. on Reverse Engineering, 1995. [2] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. Clone detection using abstract syntax trees. In Proc. of Int. Conf. on Software Maintenance, 1998. [3] K. Kontogiannis, M. Galler, and R. De. Mori. Detecting code similarity using patterns. In Working Notes of 3 rd Workshop on AI and Software Engineering, 1995. [4] T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic tokenbased code clone detection system for large scale source code. IEEE Trans. Softw. Eng. , 28(7), 2002. [5] L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. J. of Universal Computer Science, 8(11), 2002. [6] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. SIGMOD, 2003. [7] V. B. Livshits and T. Zimmermann. Dynamine: Finding common error patterns by mining software revision histories. In Proc. of 13 th Int. Symp. on the Foundations of Software Engineering, 2005. [8] C. Liu, X. Yan, and J. Han. Mining control flow abnormality for logic error isolation. In In Proc. 2006 SIAM Int. Conf. on Data Mining, 2006. [9] C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu. Mining behavior graphs for ”backtrace” of noncrashing bugs. In SDM, 2005. 42