Skip to content

AST Analysis

similarity-go uses Go’s Abstract Syntax Tree (AST) analysis to detect code similarities at a semantic level. Unlike text-based comparison tools, AST analysis understands the structure and meaning of code, enabling detection of functionally similar code even when surface-level differences exist.

An Abstract Syntax Tree is a tree representation of the syntactic structure of source code. Each node represents a construct occurring in the programming language.

Consider this simple Go function:

func add(a, b int) int {
return a + b
}

The AST representation includes:

  • FuncDecl (Function Declaration)
    • Ident “add” (Function name)
    • FieldList (Parameters)
      • Field “a” (Parameter 1)
      • Field “b” (Parameter 2)
      • Ident “int” (Parameter type)
    • FieldList (Results)
      • Field (Return type)
        • Ident “int”
    • BlockStmt (Function body)
      • ReturnStmt (Return statement)
        • BinaryExpr (Binary expression)
          • Ident “a” (Left operand)
          • Token ”+” (Operator)
          • Ident “b” (Right operand)

For each Go file, similarity-go:

  1. Parses the source code using go/parser
  2. Extracts function declarations from the AST
  3. Normalizes the AST structure for comparison

The algorithm compares AST trees using:

  • Node types: Comparing the types of AST nodes
  • Tree depth: Analyzing the depth and breadth of structures
  • Control flow: Identifying similar branching patterns
  • Operations: Identifying similar operations regardless of variable names
  • Patterns: Recognizing common programming patterns
  • Data flow: Understanding how data moves through functions

To improve accuracy, similarity-go normalizes ASTs by:

// Original functions
func processUser(name string) { fmt.Println(name) }
func handleData(value string) { fmt.Println(value) }
// Normalized representation
func FUNC(VAR1 string) { fmt.Println(VAR1) }
// Original functions
func checkAge() bool { return age > 18 }
func validateScore() bool { return score > 85 }
// Normalized representation
func FUNC() bool { return VAR1 > CONST1 }
// Original with comments
func calculate(x int) int {
// Add ten to input
result := x + 10
return result
}
// Normalized (comments and extra whitespace removed)
func FUNC(VAR1 int) int {
VAR2 := VAR1 + 10
return VAR2
}

similarity-go analyzes these key AST node types:

Node TypeDescriptionWeight
FuncDeclFunction declarationsHigh
IfStmtIf statementsHigh
ForStmtFor loopsHigh
SwitchStmtSwitch statementsHigh
BinaryExprBinary expressionsMedium
CallExprFunction callsMedium
AssignStmtAssignmentsMedium
ReturnStmtReturn statementsMedium
IdentIdentifiersLow
BasicLitLiteralsLow

Different node types contribute different weights to the similarity score:

  • Control flow structures (if, for, switch): 30%
  • Function calls and operations: 25%
  • Variable assignments: 20%
  • Expression patterns: 15%
  • Literal values: 10%
type ASTComparer struct {
threshold float64
weights map[ast.Node]float64
}
func (c *ASTComparer) Compare(tree1, tree2 ast.Node) float64 {
// Recursive tree comparison
similarity := c.compareNodes(tree1, tree2)
return similarity
}
func (c *ASTComparer) compareNodes(n1, n2 ast.Node) float64 {
// Type comparison
if reflect.TypeOf(n1) != reflect.TypeOf(n2) {
return 0.0
}
// Content comparison based on node type
switch node1 := n1.(type) {
case *ast.FuncDecl:
return c.compareFuncDecl(node1, n2.(*ast.FuncDecl))
case *ast.IfStmt:
return c.compareIfStmt(node1, n2.(*ast.IfStmt))
// ... more cases
}
}

The final similarity score is calculated using:

Similarity = Σ(NodeWeight × NodeSimilarity) / TotalWeight

Where:

  • NodeWeight: Predefined weight for each node type
  • NodeSimilarity: Similarity score for individual nodes (0.0 - 1.0)
  • TotalWeight: Sum of all node weights

For improved detection, similarity-go implements fuzzy matching:

  • Edit distance: Measures structural differences
  • Subsequence matching: Identifies common code patterns
  • Approximate matching: Handles minor variations

The algorithm considers:

  • Function signatures: Parameter types and return values
  • Variable scopes: Local vs. package-level variables
  • Import dependencies: Used packages and libraries
  • Early termination: Stops comparison when similarity drops below threshold
  • Caching: Stores results for previously compared AST subtrees
  • Parallel processing: Compares multiple functions concurrently
  1. Cross-package analysis: Limited to single package context
  2. Generic types: Basic support for Go generics
  3. Reflection: Cannot analyze dynamically generated code
  4. Build constraints: Doesn’t process build-tag-specific code
  • Enhanced generic type support
  • Cross-module analysis
  • Integration with Go modules
  • Support for cgo code
algorithms:
ast_comparison:
enabled: true
weight: 0.7
# AST-specific options
normalize_variables: true
normalize_constants: true
ignore_comments: true
# Node weights
node_weights:
control_flow: 0.3
function_calls: 0.25
assignments: 0.2
expressions: 0.15
literals: 0.1
  1. Higher thresholds for strict similarity requirements
  2. Lower thresholds for refactoring opportunity detection
  3. Enable normalization for better cross-style matching
  4. Adjust node weights based on code characteristics

AST analysis works alongside other similarity detection methods:

  • Token comparison: For lexical similarity
  • Structure comparison: For high-level architectural patterns
  • Semantic analysis: For understanding code intent

The combined approach provides comprehensive similarity detection across multiple dimensions of code similarity.