Q1  _____ are the basic constructs you use to process data using Pig

A1 Pig Latin statements

Q2  A _______ is an operator that takes a relation as input and produces another relation as output. 

A2 Pig Latin statement

Q3 A Pig Latin statement is an operator that takes a _____ as input and produces another relation as output.  

A3 relation

Q4 A Pig Latin statement is an operator that takes a relation as input and produces ________ as output.

A4 another relation 

Q5  Pig Latin statements are generally organized as follows: A  _____ statement to read data from the file system. A series of “transformation” statements to process the data. A DUMP statement to view results or a STORE statement to save the results.


Q6 Pig Latin statements are generally organized as follows: A LOAD statement to read data from the file system. A series of _____ statements to process the data. A DUMP statement to view results or a STORE statement to save the results.

A6 “transformation”

Q7 Pig Latin statements are generally organized as follows:
A LOAD statement to read data from the file system. A series of “transformation” statements to process the data. A ______ statement to view results or a _______ statement to save the results.


Q8 Use the _____  operator to work with tuples or rows of data.


Q9 Use the ______ operator to work with columns of data.


Q10 Use the _____ operator to group data in a single relation.


Q11 Use the COGROUPinner JOIN, and outer JOIN operators to group or join data in _______?

A11  two or more relations.

Q12 Use the ____ operator to partition the contents of a relation into multiple relations.


Q13 ______ stores data as tuples in human-readable UTF-8 format.

A13 PigDump

eg. STORE X INTO ‘output’ USING PigDump();

Q14 Loads and stores data as structured text files.

A14 PigStorage

Q15 A = LOAD ‘student’ USING PigStorage(‘\t’) AS (name: chararray, age:int, gpa: float); what’s (‘\t’) do?

A15 separates values with a tab

Q16 Loads unstructured data in UTF-8 format.

A16 TextLoader() eg. A = LOAD ‘data’ USING TextLoader();




Q1 ./bin/spark-shell (in the Spark directory)

A1 Starts the spark shell


A2 Resilient Distributed Dataset

Q3 RDDs have _____ which return values

A3 actions

Q4 RDDs have _____ which return pointers to new RDDs

A4 transformations


A5 Directed Acyclic Graphs

Q6 The research team behind Spark founded which company?

A6 Databricks



Q1 val

A1 indicates an unmodifiable variable

Q2 var

A2 indicates a modifiable variable

Q3 internally in Scala, both arrays and functions are conceptualized as kinds of _______ from one object to another

A3 mathematical mappings

Q4 Scala has no static variables or methods. Instead it has __ which are essentially classes with only one object in the class

A4 singleton objects

Q5 Singleton objects are declared using ________

A5 Object

Q6 It is common to place static variables and methods in a singleton object with the same name as the class name which is then known as a ______

A6 Companion object

Q7 Scala strings are implemented by

A7 Java’s string class

Q8 Scala’s ability to figure out types is called?

A8 type inference

Q9 Although unnecessary, if you want to specify a type, how do you do it?

A9 After its name separated by a colon.

Q10 Since java.lang types are visible with their simple names, java.lang.String can be written

A10 String

Q11 Output the variable msg

A11 PrintIn(msg)

Q12 Function definitions start with ?

A12 def

Q13 The Scala compiler does not infer ____ ____ types

A13 Function parameter

Q14 Functional programming is a programming paradigm that treats computation as the evolution of ______

A14 mathematical functions

Q15 the opposite of functional programming is

A15 imperative programming



Q1 _____ can be viewed as doing a matrix multiplication of the term-document matrix by the query vector (giving a vector over documents where the components are the relevance score)

A1 Full-text search

Q2 One of the more useful approaches to dealing with huge sparse data sets is the concept of _______, where a lower dimensional space of the original column (feature) space of your data is found/constructed and your rows are mapped into the subspace (or ‘sub-manifold’)

A2 dimensionality reduction

Q3 One of the most straightforward techniques for dimensionality reduction is the ________

A3 matrix decomposition



Q1 ______ provides a simple way to run any existing system, unmodified on YARN by merely providing details such as required resources (CPU/memory per container, number of containers, software, start/stop commands etc.)

A1 Apache Slider

Linear Algebra

Q1 The PANDAS documentation recommends learning _________ first?

A1 Numpy

Q2 In R: mydata <- read.csv (“filename.txt”)

A2 imports a CSV file to the variable mydata

Q3 If your file has a header row, row1 is the _______

A3 name of each column

Q4 mydata,-read.csv(“filename.txt”, header=FALSE) What’s header = FALSE do?

A4 indicates the first row is not the name of each column but data

Q5 mydata<-read.table(“filename.txt”, sep+”\t”, header=TRUE) what’s this do?

A5 saves a tab separated value file to mydata.

Q6 How can you use a GUI to import data to R?

A6 Rstudio



Q1 T/F: Scala does not require semi-colons

A1 T

Q2 T/F Scala is a statically typed language

A2 T

Q3 T/F Scala is a functional programming language

A3 T

Q4 Scala value types are _______

A4 Capitalized (Int, Double, Boolean)

Q5 Parameter and return types _______ as in Pascal, rather than ________ as in C

A5 Follow, Precede

Q7 ________ the ability of a program to inspect and possibly even modify itself

A7 reflection

Q8 Reflection involves the ability to _______ (ie. make explicit) otherwise implicit elements of a program

A8 reify



Q1 Mahout has utilities that allow one to easily produce ____ from Lucene and Solr

A1 Mahout Vector Representations

Q2 Before creating the vectors, you need to convert the documents to ____ format.

A2 SequenceFile

Q3 A Hadoop class which allows us to write arbitrary key, value pairs into it

A3 SequenceFile

Q4 –minSupport is the min frequency for a word to be considered or a ____

A4 Feature

Q5 –minDF is the min number of documents the word needs to be in to be considered as a _____?

A5 feature

Q6 maxDFPercent is the max value of the expression (document frequency of a word/total number of document) to be considered as good feature to be in the document. This helps ______

A6 remove high frequency features like stop words

Q7 A document processing pipeline must be converted to ?

A7 Mahout Vector Format

Q8 A powerful learning algorithm for automatically and jointly clustering words into “topics” and documents into mixtures of topics

A8 Latent Dirichlet Allocation

Q9 a hierarchical Bayesian model that associates with each document a probability distribution over “topics” which are in turn distributions over words

A9 A topic model


What are we searching for?