Skip to content

Similarity Detection

similarity-go employs a multi-layered approach to detect code similarities, combining several algorithms to provide comprehensive and accurate results. This methodology ensures that similar code is identified regardless of superficial differences in formatting, variable names, or minor structural variations.

  1. Syntactic Layer - AST structure analysis
  2. Lexical Layer - Token sequence comparison
  3. Semantic Layer - Functional behavior analysis

Each layer contributes to the final similarity score, with configurable weights to adapt to different use cases.

The syntactic layer analyzes the Abstract Syntax Tree structure of Go functions.

  • Structure comparison: Compares tree topology and node relationships
  • Pattern recognition: Identifies common programming patterns
  • Control flow analysis: Examines branching and looping structures
type SyntacticAnalyzer struct {
nodeWeights map[ast.Node]float64
normalizer *ASTNormalizer
}
func (s *SyntacticAnalyzer) Compare(func1, func2 *ast.FuncDecl) float64 {
// Normalize ASTs for comparison
norm1 := s.normalizer.Normalize(func1)
norm2 := s.normalizer.Normalize(func2)
// Calculate structural similarity
return s.calculateStructuralSimilarity(norm1, norm2)
}
ElementWeightRationale
Control structures35%Define program logic
Function calls25%Indicate operations
Variable operations20%Show data manipulation
Expressions15%Computational patterns
Literals5%Minor implementation details

The lexical layer compares sequences of tokens extracted from the source code.

type TokenAnalyzer struct {
ignoreComments bool
ignoreWhitespace bool
normalizeNames bool
}
func (t *TokenAnalyzer) ExtractTokens(src string) []Token {
fset := token.NewFileSet()
tokens := []Token{}
scanner := &scanner.Scanner{}
scanner.Init(fset.AddFile("", fset.Base(), len(src)), []byte(src), nil, 0)
for {
pos, tok, lit := scanner.Scan()
if tok == token.EOF {
break
}
tokens = append(tokens, Token{pos, tok, lit})
}
return t.normalizeTokens(tokens)
}

Finds the longest sequence of tokens that appear in the same order.

Similarity_LCS = 2 × LCS(tokens1, tokens2) / (len(tokens1) + len(tokens2))

Measures overlap between token sets.

Jaccard = |tokens1 ∩ tokens2| / |tokens1 ∪ tokens2|

Calculates minimum operations to transform one token sequence to another.

Similarity_Edit = 1 - (EditDistance / max(len(tokens1), len(tokens2)))

The semantic layer attempts to understand the functional behavior of code.

Tracks how data moves through functions:

type DataFlowAnalyzer struct {
cfg *ControlFlowGraph
}
func (d *DataFlowAnalyzer) AnalyzeDataFlow(fn *ast.FuncDecl) *DataFlowGraph {
// Build data flow graph
dfg := &DataFlowGraph{}
// Track variable definitions and uses
ast.Inspect(fn, func(n ast.Node) bool {
switch node := n.(type) {
case *ast.AssignStmt:
d.processAssignment(dfg, node)
case *ast.CallExpr:
d.processFunctionCall(dfg, node)
}
return true
})
return dfg
}

Examines the execution paths through functions:

  • Basic blocks: Sequences of instructions without branches
  • Branch points: Conditional statements and loops
  • Dominance relationships: Control dependencies between blocks

Identifies functions that produce the same output for the same input:

func detectFunctionalEquivalence(f1, f2 *Function) float64 {
// Compare input/output signatures
sigSimilarity := compareSignatures(f1.Signature, f2.Signature)
// Compare side effects
effectSimilarity := compareSideEffects(f1.Effects, f2.Effects)
// Compare computational complexity
complexitySimilarity := compareComplexity(f1.Complexity, f2.Complexity)
return weightedAverage(sigSimilarity, effectSimilarity, complexitySimilarity)
}

The final similarity score combines all layers:

FinalScore = w1×SyntacticScore + w2×LexicalScore + w3×SemanticScore

Where weights (w1, w2, w3) are configurable and sum to 1.0.

LayerDefault WeightUse Case
Syntactic0.5General code structure
Lexical0.3Token-level patterns
Semantic0.2Functional behavior

Weights can be adjusted based on:

  • Code characteristics: Algorithm-heavy vs. CRUD operations
  • Project phase: Development vs. maintenance
  • Detection goals: Refactoring vs. plagiarism detection

similarity-go can adjust thresholds based on context:

func (d *Detector) calculateAdaptiveThreshold(context *AnalysisContext) float64 {
baseThreshold := d.config.Threshold
// Adjust based on function size
sizeAdjustment := d.calculateSizeAdjustment(context.FunctionSize)
// Adjust based on complexity
complexityAdjustment := d.calculateComplexityAdjustment(context.Complexity)
// Adjust based on domain
domainAdjustment := d.calculateDomainAdjustment(context.Domain)
return baseThreshold + sizeAdjustment + complexityAdjustment + domainAdjustment
}
Use CaseRecommended ThresholdRationale
Duplicate detection0.85-0.95High precision needed
Refactoring opportunities0.70-0.85Balance precision/recall
Code review assistance0.60-0.75Broader coverage
Learning from examples0.50-0.70Show variations
func (c *Comparator) compareWithEarlyTermination(f1, f2 *Function) float64 {
// Quick signature check
if sigSimilarity := compareSignatures(f1, f2); sigSimilarity < 0.3 {
return 0.0 // Early exit
}
// Incremental scoring with threshold checks
score := 0.0
weights := []float64{0.5, 0.3, 0.2}
for i, analyzer := range c.analyzers {
partialScore := analyzer.Compare(f1, f2)
score += weights[i] * partialScore
// Early termination if maximum possible score < threshold
maxPossibleScore := score + sum(weights[i+1:])
if maxPossibleScore < c.threshold {
return 0.0
}
}
return score
}
  1. Quick filters: Fast, rough similarity estimates
  2. Medium precision: Moderate computational cost
  3. High precision: Expensive, accurate analysis

Only candidates passing lower levels proceed to higher precision analysis.

  • Result caching: Store comparison results
  • Intermediate caching: Cache normalized ASTs and token sequences
  • Signature caching: Quick lookups for function signatures

similarity-go uses standard metrics for evaluation:

  • Precision: True positives / (True positives + False positives)
  • Recall: True positives / (True positives + False negatives)
  • F1-Score: Harmonic mean of precision and recall
  • AUC: Area under the ROC curve

Based on evaluation against manually labeled datasets:

DatasetPrecisionRecallF1-Score
Open source Go projects0.920.880.90
Enterprise codebases0.890.910.90
Tutorial/educational code0.950.930.94
algorithms:
syntactic:
weight: 0.6
normalize_variables: true
normalize_constants: true
lexical:
weight: 0.3
use_lcs: true
use_jaccard: true
semantic:
weight: 0.1
analyze_dataflow: true
detection:
threshold: 0.90
min_function_lines: 10
algorithms:
syntactic:
weight: 0.4
normalize_variables: true
normalize_constants: false
lexical:
weight: 0.4
use_edit_distance: true
semantic:
weight: 0.2
analyze_control_flow: true
detection:
threshold: 0.75
min_function_lines: 5
  • Neural embeddings: Code representations in vector space
  • Deep learning: Pattern recognition in large codebases
  • Transfer learning: Adapt models to specific domains
  • Type inference: Better understanding of Go’s type system
  • Interface analysis: Similarity based on implemented interfaces
  • Dependency analysis: Consider external library usage patterns

While currently Go-focused, the architecture supports extension to other languages with similar AST-based analysis capabilities.