Finding minimum size distance of a group of keywords in a document

22 June 2017

An interesting problem that happens to be very common in search engine realms is to find the shortest snippet of text containing a certain group of keywords in a document.

The solution for the problem is to find the min Window from indexes i to j in text such that all target keywords are present and distance from i to j is minimal.

One approach is to start a sliding window mechanism with the help of a Queue and for every keyword found check if head of queue (first inserted keyword in window) is equal to new found keyword, if so remove the keyword from head and place it in the tail of queue. This way will allow the algorithm to test for all windows of matches found for every keyword, and in the end returning the min. Window size.

in-depth algorithm explanation

Implementation of the above algorithm in Scala

As an example suppose we have a big string representing the document text in this example we will fetch all occurances of the search term "Hello World" (keywords: [Hello, World})

Source code

object MinKWindowSum {
  def main(args: Array[String]): Unit = {
    val document = "This Hello is a huge text with thousands of Hello words and other lines and World and many other Hello docs Words of World in many langs and features"
    println(getMinWindowSize(document, "Hello World"))

def getMinWindowSize(doc:String, s:String): Int = { val keywords = s.split(" ").toSet val idxs = => (k -> ("(?i)\Q" + k + "\E").r.findAllMatchIn(doc).map(.start))) .map{ case (keyword,itr) => itr.foldLeft(List(String,Int))((result, num) => result :+ (keyword, num))} .foldLeft(List(String,Int))((res, list) => res ++ list) .sortBy(.2) var min = Int.MaxValue var minI = 0 var minJ = 0 var currWindow = ListBuffer(String,Int) for( tuple <- idxs ) { if (!currWindow.isEmpty && currWindow.head.1.equals(tuple.1)) currWindow.remove(0) currWindow += tuple if (keywords.subsetOf( { val currMin = currWindow.last.2 - currWindow.head.2 + currWindow.last.1.length if (min > currMin) { min = currMin minI = currWindow.head.2 minJ = currWindow.last.2
} } } println("min = " + min + " ,i = " + minI + " j = " + minJ ) min } }

This block:

    val idxs = => (k -> ("(?i)\Q" + k + "\E").r.findAllMatchIn(doc).map(.start)))
    .map{ case (keyword,itr) => itr.foldLeft(List(String,Int))((result, num) => result :+ (keyword, num))}
    .foldLeft(List(String,Int))((res, list) => res ++ list)

Creates a list of tuples List[(String, Int)] which contains each keyword and its respective index found in the text sorted in order by index. This makes it easier for calculating the min window on the next step.

Source code on github

comments powered by Disqus