$30
Web Search and Sense-Making
Assignment 5
Task: Build the Wikipedia Link Graph
Introduction:
In this assignment, we will extract the links from the Wikipedia dump and build a link graph from
them.
Requirements:
100GB free disk space in your machine.
Instructions:
Write a LinkGraph.scala file to extract the links from the English Wikipedia dump and build the
link graph, by taking the following steps:
1. Read in the output files of your last assignment, in which you have obtained the articles. The
format should be:
• One page (of the type of article) per line
• In each line, you have two fields: the title and the text, which are separated by a tab
• An example page looks like:
here is a tab
COSC 589 - Web Search and Sense-Making
2. For each page, extract its outlinks’ titles. An outlink appears in the Wikipedia dump in the
following format:
[[the title of an outline page]]
For instance, the page titled “Alvin Toffler” has an outlink to another page titled “Future
Shock”.
(Alvin Toffler {{Use mdy dates|date=September 2013}}{{Infobox person| name = Alvin
Toffler| image = Alvin Toffler 02.jpg| image_size = 210px| caption = Alvin Toffler
(2006)| birth_name =| birth_date = {{Birth date and age |1928|10|4}}| birth_place =
New York City<ref{{cite web|last=The European Graduate School|title=ALVIN TOFFLER -
BIOGRAPHY|url=http://www.egs.edu/library/alvin-toffler/biography/|accessdate=January 7,
2014}}</ref| death_date = <!-- {{Death date and age|YYYY|MM|DD|YYYY|MM|DD}} --|
death_place =| death_cause =| resting_place =| residence = Los Angeles,
California| nationality = United States| other_names =| known_for = ''[[Future
Shock]]''
3. However, not all the things inside [[]] are good outlink titles. We will need to do the following:
3.1. First, ignore an outlink title contains colons “:”. Basically, they are not titles for any
article, but for something else. For instance, “WP:CSD#R3D3” is not a title name for an article:
3.2. Second, extract the parts before an “|”, “#”, or “,”, if an outlink title contains these
symbols. If a title contains “|”, it has multiple variations of the title; we only keep the first one.
For instance, for [[The Third Wave (book)|The Third Wave]], we will only keep [[The Third Wave
(book)]]. If a title contains “#”, it has both the title and a section name. Similarly, we only keep
the former. For instance, for [[Uncial script#Half-uncial|semi-uncial]], We will only keep [[Uncial
script]]. If a title contains “,”, it conflicts with the Spark’s default delimiter. To allow we will be
able to match an outlink page to its own entry, we only keep the part before comma. In summary
we will extract
• the part before “|” in an outlink title with “|” (title name variations),
• the part before “#” in an outlink title with “#” (book mark sections), and
• the part before “,” in an outlink title with “,” (Spark’s default delimiter in saved files)
4. Save the title and the outlink titles for each page in Wiki dump. Optionally, you can save your
files into compressed format by using saveAsTextFile(filename, classOf[GzipCodec]). The
output format is described as follows:
• One page per line
• In each line, you have the title of a page and a list of the titles of the outlinks in the page
• Each outline title is inside [[]], and separated by a tab “\t”.
COSC 589 - Web Search and Sense-Making
• The title and the list of links is separated by “,”. (this is the default in Spark)
For instance, for page titled “Alvin Toffler”, we expect the following as the output:
5. Save the number of outlinks for each page separately.
6. You are welcome to use the following code template:
import scala.util.matching.Regex
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.hadoop.io.compress.GzipCodec
object LinkGraph {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName(“Wiki LinkGraph”)
val sc = new SparkContext(sparkConf)
val input = sc.textFile(“./wikiarticles”) // your output directory from the last assignment
val page = input.map{ l =
val pair = l.stripPrefix(“(").stripSuffix(")").split("\t", 2)
(pair(0), pair(1)) // get the two fields: title and text
}
val links = page.map(r = (r._1, extractLinks(r._2))) // extract links from text
val linkcounts = links.map(r = (r._1, r._2.split(“\t").length)) // count number of links
// save the links and the counts in compressed format (save your disk space)
links.saveAsTextFile("./links", classOf[GzipCodec])
linkcounts.saveAsTextFile("./inks-counts", classOf[GzipCodec])
}
def extractLinks(text: String) : String = {
// you will need to work on a way to extract the links
}
}
COSC 589 - Web Search and Sense-Making
What to Submit:
- Your code
- Screen captures of the beginning of your saved link graphs (e.g. the first 20 lines on the
screen. Hint: Use ‘gunzip part-00000.gz” to unzip, then use ‘less yourfile’ to view the
documents and screen capture)
- Screen captures of the beginning of the saved counts of links (e.g. the first 20 lines on the
screen. Hint: Use ‘gunzip part-00000.gz” to unzip, then use ‘less yourfile’ to view the
documents and screen capture)
What NOT to Submit:
- Your input or output files
Where to submit:
- Canvas