Random header image... Refresh for more!

Really simple Uniq in Scala

I don’t know about you, but I use the standard uniq tool available in *NIX systems all the time. It only removes duplicate adjacent lines from its input stream though, so to retain only unique lines throughout your entire input you’d usually pipe your input through sort first. On larger datasets, this can start to become time consuming.

It is, however, dirt simple to build a uniq tool in Scala which outputs the unique lines from an entire input without having to sort the input first. And I really do mean dirt-simple; it took only a few moments to write. Here it is:

package com.sdstrowes.util

import scala.io.Source
import scala.collection.immutable.HashSet

object Uniq {
  def main(args: Array[String]) : Unit = {
	val lines = Source.fromInputStream(System.in).getLines
	var uniq = HashSet[String]()

	while (lines.hasNext)
	  uniq = uniq + lines.next

	uniq.foreach(l => print(l))
  }
}

This accepts input on stdin, and outputs onto stdout. So insert it into your pipeline rather than “sort | uniq“. Rudimentary testing on my inputs suggests I get a x2 speedup. Of course, your mileage may vary.

1 comment

1 On Deriving AS Links from BGP Snapshots — You shot the invisible swordsman. { 05.03.10 at 11:04 pm }

[...] $bgpdata | cut -d ” ” -f 2- | scala com.sdstrowes.util.Uniq > [...]