Really simple Uniq in Scala

I don’t know about you, but I use the standard uniq tool available in *NIX systems all the time. It only removes duplicate adjacent lines from its input stream though, so to retain only unique lines throughout your entire input you’d usually pipe your input through sort first. On larger datasets, this can start to become time consuming.

It is, however, dirt simple to build a uniq tool in Scala which outputs the unique lines from an entire input without having to sort the input first. And I really do mean dirt-simple; it took only a few moments to write. Here it is:

package com.sdstrowes.util

import scala.io.Source
import scala.collection.immutable.HashSet

object Uniq {
  def main(args: Array[String]) : Unit = {
	val lines = Source.fromInputStream(System.in).getLines
	var uniq = HashSet[String]()

	while (lines.hasNext)
	  uniq = uniq + lines.next

	uniq.foreach(l => print(l))
  }
}
This accepts input on stdin, and outputs onto stdout. So insert it into your pipeline rather than “_sort uniq_”. Rudimentary testing on my inputs suggests I get a x2 speedup. Of course, your mileage may vary.

Footnote

Posted by Stephen Strowes on Thursday, April 22nd, 2010. You can follow me on twitter.

Recent Posts

(full archive)

All content, including images, © Stephen D. Strowes, 2000–2016. Hosted by Digital Ocean.