Similarity Detection

Detection Methodology

similarity-go employs a multi-layered approach to detect code similarities, combining several algorithms to provide comprehensive and accurate results. This methodology ensures that similar code is identified regardless of superficial differences in formatting, variable names, or minor structural variations.

Algorithm Overview

Three-Layer Detection System

Syntactic Layer - AST structure analysis
Lexical Layer - Token sequence comparison
Semantic Layer - Functional behavior analysis

Each layer contributes to the final similarity score, with configurable weights to adapt to different use cases.

Syntactic Analysis (AST-based)

The syntactic layer analyzes the Abstract Syntax Tree structure of Go functions.

Key Features

Structure comparison: Compares tree topology and node relationships
Pattern recognition: Identifies common programming patterns
Control flow analysis: Examines branching and looping structures

Implementation

type SyntacticAnalyzer struct {
    nodeWeights map[ast.Node]float64
    normalizer  *ASTNormalizer
}

func (s *SyntacticAnalyzer) Compare(func1, func2 *ast.FuncDecl) float64 {
    // Normalize ASTs for comparison
    norm1 := s.normalizer.Normalize(func1)
    norm2 := s.normalizer.Normalize(func2)

    // Calculate structural similarity
    return s.calculateStructuralSimilarity(norm1, norm2)
}

Weighting Factors

Element	Weight	Rationale
Control structures	35%	Define program logic
Function calls	25%	Indicate operations
Variable operations	20%	Show data manipulation
Expressions	15%	Computational patterns
Literals	5%	Minor implementation details

Lexical Analysis (Token-based)

The lexical layer compares sequences of tokens extracted from the source code.

Token Extraction

type TokenAnalyzer struct {
    ignoreComments   bool
    ignoreWhitespace bool
    normalizeNames   bool
}

func (t *TokenAnalyzer) ExtractTokens(src string) []Token {
    fset := token.NewFileSet()
    tokens := []Token{}

    scanner := &scanner.Scanner{}
    scanner.Init(fset.AddFile("", fset.Base(), len(src)), []byte(src), nil, 0)

    for {
        pos, tok, lit := scanner.Scan()
        if tok == token.EOF {
            break
        }
        tokens = append(tokens, Token{pos, tok, lit})
    }

    return t.normalizeTokens(tokens)
}

Similarity Metrics

Longest Common Subsequence (LCS)

Finds the longest sequence of tokens that appear in the same order.

Similarity_LCS = 2 × LCS(tokens1, tokens2) / (len(tokens1) + len(tokens2))

Jaccard Index

Measures overlap between token sets.

Jaccard = |tokens1 ∩ tokens2| / |tokens1 ∪ tokens2|

Edit Distance

Calculates minimum operations to transform one token sequence to another.

Similarity_Edit = 1 - (EditDistance / max(len(tokens1), len(tokens2)))

Semantic Analysis

The semantic layer attempts to understand the functional behavior of code.

Data Flow Analysis

Tracks how data moves through functions:

type DataFlowAnalyzer struct {
    cfg *ControlFlowGraph
}

func (d *DataFlowAnalyzer) AnalyzeDataFlow(fn *ast.FuncDecl) *DataFlowGraph {
    // Build data flow graph
    dfg := &DataFlowGraph{}

    // Track variable definitions and uses
    ast.Inspect(fn, func(n ast.Node) bool {
        switch node := n.(type) {
        case *ast.AssignStmt:
            d.processAssignment(dfg, node)
        case *ast.CallExpr:
            d.processFunctionCall(dfg, node)
        }
        return true
    })

    return dfg
}

Control Flow Analysis

Examines the execution paths through functions:

Basic blocks: Sequences of instructions without branches
Branch points: Conditional statements and loops
Dominance relationships: Control dependencies between blocks

Functional Equivalence Detection

Identifies functions that produce the same output for the same input:

func detectFunctionalEquivalence(f1, f2 *Function) float64 {
    // Compare input/output signatures
    sigSimilarity := compareSignatures(f1.Signature, f2.Signature)

    // Compare side effects
    effectSimilarity := compareSideEffects(f1.Effects, f2.Effects)

    // Compare computational complexity
    complexitySimilarity := compareComplexity(f1.Complexity, f2.Complexity)

    return weightedAverage(sigSimilarity, effectSimilarity, complexitySimilarity)
}

Composite Scoring

Score Aggregation

The final similarity score combines all layers:

FinalScore = w1×SyntacticScore + w2×LexicalScore + w3×SemanticScore

Where weights (w1, w2, w3) are configurable and sum to 1.0.

Default Weights

Layer	Default Weight	Use Case
Syntactic	0.5	General code structure
Lexical	0.3	Token-level patterns
Semantic	0.2	Functional behavior

Adaptive Weighting

Weights can be adjusted based on:

Code characteristics: Algorithm-heavy vs. CRUD operations
Project phase: Development vs. maintenance
Detection goals: Refactoring vs. plagiarism detection

Threshold Optimization

Dynamic Thresholds

similarity-go can adjust thresholds based on context:

func (d *Detector) calculateAdaptiveThreshold(context *AnalysisContext) float64 {
    baseThreshold := d.config.Threshold

    // Adjust based on function size
    sizeAdjustment := d.calculateSizeAdjustment(context.FunctionSize)

    // Adjust based on complexity
    complexityAdjustment := d.calculateComplexityAdjustment(context.Complexity)

    // Adjust based on domain
    domainAdjustment := d.calculateDomainAdjustment(context.Domain)

    return baseThreshold + sizeAdjustment + complexityAdjustment + domainAdjustment
}

Threshold Recommendations

Use Case	Recommended Threshold	Rationale
Duplicate detection	0.85-0.95	High precision needed
Refactoring opportunities	0.70-0.85	Balance precision/recall
Code review assistance	0.60-0.75	Broader coverage
Learning from examples	0.50-0.70	Show variations

Performance Optimizations

Early Termination

func (c *Comparator) compareWithEarlyTermination(f1, f2 *Function) float64 {
    // Quick signature check
    if sigSimilarity := compareSignatures(f1, f2); sigSimilarity < 0.3 {
        return 0.0 // Early exit
    }

    // Incremental scoring with threshold checks
    score := 0.0
    weights := []float64{0.5, 0.3, 0.2}

    for i, analyzer := range c.analyzers {
        partialScore := analyzer.Compare(f1, f2)
        score += weights[i] * partialScore

        // Early termination if maximum possible score < threshold
        maxPossibleScore := score + sum(weights[i+1:])
        if maxPossibleScore < c.threshold {
            return 0.0
        }
    }

    return score
}

Hierarchical Filtering

Quick filters: Fast, rough similarity estimates
Medium precision: Moderate computational cost
High precision: Expensive, accurate analysis

Only candidates passing lower levels proceed to higher precision analysis.

Caching Strategies

Result caching: Store comparison results
Intermediate caching: Cache normalized ASTs and token sequences
Signature caching: Quick lookups for function signatures

Accuracy Metrics

Evaluation Methodology

similarity-go uses standard metrics for evaluation:

Precision: True positives / (True positives + False positives)
Recall: True positives / (True positives + False negatives)
F1-Score: Harmonic mean of precision and recall
AUC: Area under the ROC curve

Benchmark Results

Based on evaluation against manually labeled datasets:

Dataset	Precision	Recall	F1-Score
Open source Go projects	0.92	0.88	0.90
Enterprise codebases	0.89	0.91	0.90
Tutorial/educational code	0.95	0.93	0.94

Configuration Examples

High Precision Mode

algorithms:
  syntactic:
    weight: 0.6
    normalize_variables: true
    normalize_constants: true

  lexical:
    weight: 0.3
    use_lcs: true
    use_jaccard: true

  semantic:
    weight: 0.1
    analyze_dataflow: true

detection:
  threshold: 0.90
  min_function_lines: 10

Refactoring Discovery Mode

algorithms:
  syntactic:
    weight: 0.4
    normalize_variables: true
    normalize_constants: false

  lexical:
    weight: 0.4
    use_edit_distance: true

  semantic:
    weight: 0.2
    analyze_control_flow: true

detection:
  threshold: 0.75
  min_function_lines: 5

Future Enhancements

Machine Learning Integration

Neural embeddings: Code representations in vector space
Deep learning: Pattern recognition in large codebases
Transfer learning: Adapt models to specific domains

Advanced Semantic Analysis

Type inference: Better understanding of Go’s type system
Interface analysis: Similarity based on implemented interfaces
Dependency analysis: Consider external library usage patterns

Cross-Language Support

While currently Go-focused, the architecture supports extension to other languages with similar AST-based analysis capabilities.