AST Analysis

Overview

similarity-go uses Go’s Abstract Syntax Tree (AST) analysis to detect code similarities at a semantic level. Unlike text-based comparison tools, AST analysis understands the structure and meaning of code, enabling detection of functionally similar code even when surface-level differences exist.

Abstract Syntax Trees

An Abstract Syntax Tree is a tree representation of the syntactic structure of source code. Each node represents a construct occurring in the programming language.

Example Go Code to AST

Consider this simple Go function:

func add(a, b int) int {
    return a + b
}

The AST representation includes:

FuncDecl (Function Declaration)
- Ident “add” (Function name)
- FieldList (Parameters)
  - Field “a” (Parameter 1)
  - Field “b” (Parameter 2)
  - Ident “int” (Parameter type)
- FieldList (Results)
  - Field (Return type)
    - Ident “int”
- BlockStmt (Function body)
  - ReturnStmt (Return statement)
    - BinaryExpr (Binary expression)
      - Ident “a” (Left operand)
      - Token ”+” (Operator)
      - Ident “b” (Right operand)

Similarity Detection Algorithm

1. AST Parsing

For each Go file, similarity-go:

Parses the source code using go/parser
Extracts function declarations from the AST
Normalizes the AST structure for comparison

2. Tree Comparison

The algorithm compares AST trees using:

Structural Similarity

Node types: Comparing the types of AST nodes
Tree depth: Analyzing the depth and breadth of structures
Control flow: Identifying similar branching patterns

Semantic Similarity

Operations: Identifying similar operations regardless of variable names
Patterns: Recognizing common programming patterns
Data flow: Understanding how data moves through functions

3. Normalization Techniques

To improve accuracy, similarity-go normalizes ASTs by:

Variable Name Abstraction

// Original functions
func processUser(name string) { fmt.Println(name) }
func handleData(value string) { fmt.Println(value) }

// Normalized representation
func FUNC(VAR1 string) { fmt.Println(VAR1) }

Constant Value Abstraction

// Original functions
func checkAge() bool { return age > 18 }
func validateScore() bool { return score > 85 }

// Normalized representation
func FUNC() bool { return VAR1 > CONST1 }

Comment and Whitespace Removal

// Original with comments
func calculate(x int) int {
    // Add ten to input
    result := x + 10
    return result
}

// Normalized (comments and extra whitespace removed)
func FUNC(VAR1 int) int {
    VAR2 := VAR1 + 10
    return VAR2
}

AST Node Analysis

Supported Node Types

similarity-go analyzes these key AST node types:

Node Type	Description	Weight
`FuncDecl`	Function declarations	High
`IfStmt`	If statements	High
`ForStmt`	For loops	High
`SwitchStmt`	Switch statements	High
`BinaryExpr`	Binary expressions	Medium
`CallExpr`	Function calls	Medium
`AssignStmt`	Assignments	Medium
`ReturnStmt`	Return statements	Medium
`Ident`	Identifiers	Low
`BasicLit`	Literals	Low

Weighting System

Different node types contribute different weights to the similarity score:

Control flow structures (if, for, switch): 30%
Function calls and operations: 25%
Variable assignments: 20%
Expression patterns: 15%
Literal values: 10%

Algorithm Implementation

Tree Traversal

type ASTComparer struct {
    threshold float64
    weights   map[ast.Node]float64
}

func (c *ASTComparer) Compare(tree1, tree2 ast.Node) float64 {
    // Recursive tree comparison
    similarity := c.compareNodes(tree1, tree2)
    return similarity
}

func (c *ASTComparer) compareNodes(n1, n2 ast.Node) float64 {
    // Type comparison
    if reflect.TypeOf(n1) != reflect.TypeOf(n2) {
        return 0.0
    }

    // Content comparison based on node type
    switch node1 := n1.(type) {
    case *ast.FuncDecl:
        return c.compareFuncDecl(node1, n2.(*ast.FuncDecl))
    case *ast.IfStmt:
        return c.compareIfStmt(node1, n2.(*ast.IfStmt))
    // ... more cases
    }
}

Similarity Scoring

The final similarity score is calculated using:

Similarity = Σ(NodeWeight × NodeSimilarity) / TotalWeight

Where:

NodeWeight: Predefined weight for each node type
NodeSimilarity: Similarity score for individual nodes (0.0 - 1.0)
TotalWeight: Sum of all node weights

Advanced Features

Fuzzy Matching

For improved detection, similarity-go implements fuzzy matching:

Edit distance: Measures structural differences
Subsequence matching: Identifies common code patterns
Approximate matching: Handles minor variations

Context Awareness

The algorithm considers:

Function signatures: Parameter types and return values
Variable scopes: Local vs. package-level variables
Import dependencies: Used packages and libraries

Performance Optimizations

Early termination: Stops comparison when similarity drops below threshold
Caching: Stores results for previously compared AST subtrees
Parallel processing: Compares multiple functions concurrently

Limitations and Considerations

Current Limitations

Cross-package analysis: Limited to single package context
Generic types: Basic support for Go generics
Reflection: Cannot analyze dynamically generated code
Build constraints: Doesn’t process build-tag-specific code

Future Improvements

Enhanced generic type support
Cross-module analysis
Integration with Go modules
Support for cgo code

Tuning AST Analysis

Configuration Options

algorithms:
  ast_comparison:
    enabled: true
    weight: 0.7

    # AST-specific options
    normalize_variables: true
    normalize_constants: true
    ignore_comments: true

    # Node weights
    node_weights:
      control_flow: 0.3
      function_calls: 0.25
      assignments: 0.2
      expressions: 0.15
      literals: 0.1

Best Practices

Higher thresholds for strict similarity requirements
Lower thresholds for refactoring opportunity detection
Enable normalization for better cross-style matching
Adjust node weights based on code characteristics

Integration with Other Algorithms

AST analysis works alongside other similarity detection methods:

Token comparison: For lexical similarity
Structure comparison: For high-level architectural patterns
Semantic analysis: For understanding code intent

The combined approach provides comprehensive similarity detection across multiple dimensions of code similarity.