AST Analysis
Overview
Section titled “Overview”similarity-go uses Go’s Abstract Syntax Tree (AST) analysis to detect code similarities at a semantic level. Unlike text-based comparison tools, AST analysis understands the structure and meaning of code, enabling detection of functionally similar code even when surface-level differences exist.
Abstract Syntax Trees
Section titled “Abstract Syntax Trees”An Abstract Syntax Tree is a tree representation of the syntactic structure of source code. Each node represents a construct occurring in the programming language.
Example Go Code to AST
Section titled “Example Go Code to AST”Consider this simple Go function:
func add(a, b int) int { return a + b}The AST representation includes:
- FuncDecl (Function Declaration)
- Ident “add” (Function name)
- FieldList (Parameters)
- Field “a” (Parameter 1)
- Field “b” (Parameter 2)
- Ident “int” (Parameter type)
- FieldList (Results)
- Field (Return type)
- Ident “int”
- Field (Return type)
- BlockStmt (Function body)
- ReturnStmt (Return statement)
- BinaryExpr (Binary expression)
- Ident “a” (Left operand)
- Token ”+” (Operator)
- Ident “b” (Right operand)
- BinaryExpr (Binary expression)
- ReturnStmt (Return statement)
Similarity Detection Algorithm
Section titled “Similarity Detection Algorithm”1. AST Parsing
Section titled “1. AST Parsing”For each Go file, similarity-go:
- Parses the source code using
go/parser - Extracts function declarations from the AST
- Normalizes the AST structure for comparison
2. Tree Comparison
Section titled “2. Tree Comparison”The algorithm compares AST trees using:
Structural Similarity
Section titled “Structural Similarity”- Node types: Comparing the types of AST nodes
- Tree depth: Analyzing the depth and breadth of structures
- Control flow: Identifying similar branching patterns
Semantic Similarity
Section titled “Semantic Similarity”- Operations: Identifying similar operations regardless of variable names
- Patterns: Recognizing common programming patterns
- Data flow: Understanding how data moves through functions
3. Normalization Techniques
Section titled “3. Normalization Techniques”To improve accuracy, similarity-go normalizes ASTs by:
Variable Name Abstraction
Section titled “Variable Name Abstraction”// Original functionsfunc processUser(name string) { fmt.Println(name) }func handleData(value string) { fmt.Println(value) }
// Normalized representationfunc FUNC(VAR1 string) { fmt.Println(VAR1) }Constant Value Abstraction
Section titled “Constant Value Abstraction”// Original functionsfunc checkAge() bool { return age > 18 }func validateScore() bool { return score > 85 }
// Normalized representationfunc FUNC() bool { return VAR1 > CONST1 }Comment and Whitespace Removal
Section titled “Comment and Whitespace Removal”// Original with commentsfunc calculate(x int) int { // Add ten to input result := x + 10 return result}
// Normalized (comments and extra whitespace removed)func FUNC(VAR1 int) int { VAR2 := VAR1 + 10 return VAR2}AST Node Analysis
Section titled “AST Node Analysis”Supported Node Types
Section titled “Supported Node Types”similarity-go analyzes these key AST node types:
| Node Type | Description | Weight |
|---|---|---|
FuncDecl | Function declarations | High |
IfStmt | If statements | High |
ForStmt | For loops | High |
SwitchStmt | Switch statements | High |
BinaryExpr | Binary expressions | Medium |
CallExpr | Function calls | Medium |
AssignStmt | Assignments | Medium |
ReturnStmt | Return statements | Medium |
Ident | Identifiers | Low |
BasicLit | Literals | Low |
Weighting System
Section titled “Weighting System”Different node types contribute different weights to the similarity score:
- Control flow structures (if, for, switch): 30%
- Function calls and operations: 25%
- Variable assignments: 20%
- Expression patterns: 15%
- Literal values: 10%
Algorithm Implementation
Section titled “Algorithm Implementation”Tree Traversal
Section titled “Tree Traversal”type ASTComparer struct { threshold float64 weights map[ast.Node]float64}
func (c *ASTComparer) Compare(tree1, tree2 ast.Node) float64 { // Recursive tree comparison similarity := c.compareNodes(tree1, tree2) return similarity}
func (c *ASTComparer) compareNodes(n1, n2 ast.Node) float64 { // Type comparison if reflect.TypeOf(n1) != reflect.TypeOf(n2) { return 0.0 }
// Content comparison based on node type switch node1 := n1.(type) { case *ast.FuncDecl: return c.compareFuncDecl(node1, n2.(*ast.FuncDecl)) case *ast.IfStmt: return c.compareIfStmt(node1, n2.(*ast.IfStmt)) // ... more cases }}Similarity Scoring
Section titled “Similarity Scoring”The final similarity score is calculated using:
Similarity = Σ(NodeWeight × NodeSimilarity) / TotalWeightWhere:
NodeWeight: Predefined weight for each node typeNodeSimilarity: Similarity score for individual nodes (0.0 - 1.0)TotalWeight: Sum of all node weights
Advanced Features
Section titled “Advanced Features”Fuzzy Matching
Section titled “Fuzzy Matching”For improved detection, similarity-go implements fuzzy matching:
- Edit distance: Measures structural differences
- Subsequence matching: Identifies common code patterns
- Approximate matching: Handles minor variations
Context Awareness
Section titled “Context Awareness”The algorithm considers:
- Function signatures: Parameter types and return values
- Variable scopes: Local vs. package-level variables
- Import dependencies: Used packages and libraries
Performance Optimizations
Section titled “Performance Optimizations”- Early termination: Stops comparison when similarity drops below threshold
- Caching: Stores results for previously compared AST subtrees
- Parallel processing: Compares multiple functions concurrently
Limitations and Considerations
Section titled “Limitations and Considerations”Current Limitations
Section titled “Current Limitations”- Cross-package analysis: Limited to single package context
- Generic types: Basic support for Go generics
- Reflection: Cannot analyze dynamically generated code
- Build constraints: Doesn’t process build-tag-specific code
Future Improvements
Section titled “Future Improvements”- Enhanced generic type support
- Cross-module analysis
- Integration with Go modules
- Support for cgo code
Tuning AST Analysis
Section titled “Tuning AST Analysis”Configuration Options
Section titled “Configuration Options”algorithms: ast_comparison: enabled: true weight: 0.7
# AST-specific options normalize_variables: true normalize_constants: true ignore_comments: true
# Node weights node_weights: control_flow: 0.3 function_calls: 0.25 assignments: 0.2 expressions: 0.15 literals: 0.1Best Practices
Section titled “Best Practices”- Higher thresholds for strict similarity requirements
- Lower thresholds for refactoring opportunity detection
- Enable normalization for better cross-style matching
- Adjust node weights based on code characteristics
Integration with Other Algorithms
Section titled “Integration with Other Algorithms”AST analysis works alongside other similarity detection methods:
- Token comparison: For lexical similarity
- Structure comparison: For high-level architectural patterns
- Semantic analysis: For understanding code intent
The combined approach provides comprehensive similarity detection across multiple dimensions of code similarity.