22 June 2017
An interesting problem that happens to be very common in search engine realms is to find the shortest snippet of text containing a certain group of keywords in a document.
The solution for the problem is to find the min Window from indexes i to j in text such that all target keywords are present and distance from i to j is minimal.
One approach is to start a sliding window mechanism with the help of a Queue and for every keyword found check if head of queue (first inserted keyword in window) is equal to new found keyword, if so remove the keyword from head and place it in the tail of queue. This way will allow the algorithm to test for all windows of matches found for every keyword, and in the end returning the min. Window size.
in-depth algorithm explanation
Implementation of the above algorithm in Scala
As an example suppose we have a big string representing the document text in this example we will fetch all occurances of the search term "Hello World" (keywords: [Hello, World})
Source code
def getMinWindowSize(doc:String, s:String): Int = {
val keywords = s.split(" ").toSet
val idxs = keywords.map(k => (k -> ("(?i)\Q" + k + "\E").r.findAllMatchIn(doc).map(.start)))
.map{ case (keyword,itr) => itr.foldLeft(List(String,Int))((result, num) => result :+ (keyword, num))}
.foldLeft(List(String,Int))((res, list) => res ++ list)
.sortBy(.2)
var min = Int.MaxValue
var minI = 0
var minJ = 0
var currWindow = ListBuffer(String,Int)
for( tuple <- idxs ) {
if (!currWindow.isEmpty && currWindow.head.1.equals(tuple.1)) currWindow.remove(0)
currWindow += tuple
if (keywords.subsetOf(currWindow.map(.1).toSet)) {
val currMin = currWindow.last.2 - currWindow.head.2 + currWindow.last.1.length
if (min > currMin) {
min = currMin
minI = currWindow.head.2
minJ = currWindow.last.2
}
}
}
println("min = " + min + " ,i = " + minI + " j = " + minJ )
min
}
}
This block:
Creates a list of tuples List[(String, Int)] which contains each keyword and its respective index found in the text sorted in order by index. This makes it easier for calculating the min window on the next step.