$30
CS 510 Software Engineering
Project 3, Version 1
Submit: An electronic copy on BLACKBOARD
The whole project 3 is meant to be 22% of your final grade We plan to have the following breakdown for project
3: part (I): 30%, part (II): 40%, part (III): 30%, and part (IV): Bonus
You will work in a group of 1-2 members for this project. We expect that you will be in the same group as
Project 2. But if you need to make changes, please notify Yi Sun by the Group Sign-Up Due Date above.
Please download proj-skeleton.tar.gz from the course repo, which you will need for this project. The
skeleton contains the necessary source code and test cases for the project.
Training deep learning models may take hours or longer. Please start early; otherwise, there may
not be enough machine time to finish the experiments. Also, the servers get busier/slower when
many groups use them at the end when the project is due.
You are expected to use mc18.cs.purdue.edu or cuda.cs.purdue.edu machines to work on your project. Your
home directory may not have enough space for the project. Use /scratch instead, which has enough space
for the project. Remember to remove your data if you no longer need it. Several of the resources required for
this project are already installed on these servers, but can also be downloaded independently.
I expect each group to work on the project independently (discussion and collaboration among group)
Submission Instructions:
Go to Blackboard → Project 3 to submit your answer. Submit only one file in .tar.gz format. Please name your
file
<FirstName>-<LastName>-<Username>.tar.gz
For example, use John-Smith-jsmith.tar.gz if you are John Smith with username jsmith. The .tar.gz file should contain
the following items:
• a single pdf file “proj3 sub.pdf”. The first page must include your full name, and your Purdue email address. Your
PDF file should contains your results for question I and a report of the improvements you tried for questions II,
III, and IV as well as your results.
• a directory “q2” that contains your code (source code only; no binaries, datasets or trained models) for Question
II
• a directory “q3” that contains your code (source code only; no binaries, datasets or trained models) for Question
III
1
• (optional) a directory “q4” that contains your code (source code only; no binaries, datasets or trained models) for
the competition for Question IV
If you use new libraries for questions II, III, and IV, also include a requirements.txt that contains the list of libraries
used. You can use pipfreeze > requirements.txt and include it with your source code.
You can submit multiple times. After submission, please view your submissions to make sure you have uploaded
the right files/versions.
Building Line-level Defect Detection Models
In this project, you are expected to learn how to build a defect prediction model for software source code from scratch.
You are required to apply deep-learning techniques, e.g., classification, tokenization, embedding, etc., to build more
accurate prediction models with the dataset provided.
Background
Line-level Defect classifiers predict which lines in a file are likely to be buggy.
A typical line-level defect prediction using deep-learning consists of the following steps:
• Data extraction and labeling: Mining buggy and clean lines from a large dataset of software changes (usually
GitHub).
• Tokenization and pre-processing: Deep learning algorithms take a vector as input. Since source code is text, it
needs to be tokenized and transformed into a vector before being fed to the model.
• Model Building: Using the tokenized data and labels to train a deep learning classifier. Many different classifiers
have been shown to work for text input (RNNs and CNNs). Most of these models can be built using TensorFlow.
• Defect Detection: Unlabelled instances (i.e., line of codes or files) are fed to the trained model that will classify
them as buggy or clean.
Evaluation Metrics
Metrics, i.e., P recision, Recall, and F1, are widely used to measure the performance of defect prediction models. Here
is a brief introduction:
P recision =
true positive
true positive + f alse positive (1)
Recall =
true positive
true positive + f alse negative (2)
F1 = 2 ∗ P recision ∗ Recall
P recision + Recall (3)
These metrics rely on four main numbers: true positive, false positive, true negative, and false negative. True positive
is the number of predicted defective instances that are truly defective, while false positive is the number of predicted
defective ones that are actually not defective. True positive records the number of predicted non-defective instances that
are actually defective, while false negative is the number of predicted non-defective instances that are actually defective.
F1 is the weighted average of precision and recall.
These methods are threshold-dependent and are not the best to evaluate binary classifiers. In this project, we will
also use the Receiver operating characteristic curve (ROC curve) and its associated metric, Area under the ROC curve
(AUC) to evaluate our trained models independently from any thresholds. The ROC curve is created by plotting the
true positive rate (or recall, see definition above) against the false positive rate at various threshold settings.
F alse positive rate =
f alse positive
f alse negative + true negative
(4)
2
(I)- Using TensorFlow to build a simple classification model
Part I will guide you through building a simple bidirectional LSTM model, while part II and III will let you explore
different ways to improve it.
CS Linux Servers have the environment ready to use. The following instructions assume using one of these machines
unless stated otherwise.
(We’ve tested on mc18.cs.purdue.edu and cuda.cs.purdue.edu. Other mc machines may or may not work)
The environment uses Python 3 and virtualenv. For more information on how to use virtualenv, please look at the
virtualenv documentation (https://virtualenv.pypa.io/en/latest/userguide/)
source /homes/ c s 5 1 0 / p r o j e c t −3/venv / b in / a c t i v a t e
*If you work on your own machine, after you created your virtualenv session and activated it, you can install the required
library using the requirements.txt file we provided:
p ip i n s t a l l −−upgrade p ip
p ip i n s t a l l −r r e q u i r em e n t s . t x t
0.1 Load the Input Data:
Since the dataset is quite large (9GB uncompressed), we put it in /homes/cs510/project-3/data folder on the servers.
You can also download it from https://drive.google.com/file/d/1MTBAQ-Nw2yPr8drU-cQPae17eSHvz-4j if you want
to work on your own machine.
If you prefer to work on your own machine, you will need to download the data and update the path in tokenization.py
The training, validation and test data are made available in pickled Pandas dataframes, respectively in train.pickle,
valid.pickle, and test.pickle
The panda dataframes consists of 4 columns:
• instance: the line under test
• context before: the context of the line under test right before the line. In this question, the context before
consists of all the lines in the functions before the tested line.
• context after: the context of the line under test right after the line. In this question, the context after consists
of all the lines in the functions after the tested line.
• is buggy: the label of the line tested. 0 means the line is not buggy, 1 means the line is buggy.
The first step is to load the data and tokenize it. To load the data, use the following code (modify the paths if necessary):
# Load t h e da ta :
w ith open( ’ data / t r a i n . p i c k l e ’ , ’ rb ’ ) a s h and l e :
t r a i n = p i c k l e . l o a d ( h and l e )
w ith open( ’ data / v a l i d . p i c k l e ’ , ’ rb ’ ) a s h and l e :
v a l i d = p i c k l e . l o a d ( h and l e )
w ith open( ’ data / t e s t . p i c k l e ’ , ’ rb ’ ) a s h and l e :
t e s t = p i c k l e . l o a d ( h and l e )
The custom tokenizer implemented in tokenization.py is a basic java tokenizer from the javalang library (https://
github.com/c2nes/javalang) that is enhanced to also abstract string literals and numbers different from 0 and 1.
# Token ize and shape our in pu t :
def c u s t om t o k e n i z e ( s t r i n g ) :
try :
t o k e n s = l i s t ( j a v a l a n g . t o k e n i z e r . t o k e n i z e ( s t r i n g ) )
except :
return [ ]
v a l u e s = [ ]
fo r tok en in t o k e n s :
# A b s t r a c t s t r i n g s
i f ’ ” ’ in tok en . v a l u e or ” ’ ” in tok en . v a l u e :
v a l u e s . append ( ’$STRING$ ’ )
# A b s t r a c t numbers ( e x c e p t 0 and 1)
e l i f tok en . v a l u e . i s d i g i t ( ) and in t ( tok en . v a l u e ) > 1 :
v a l u e s . append ( ’$NUMBER$ ’ )
3
#o t h e r w is e : g e t t h e v a l u e
e l s e :
v a l u e s . append ( tok en . v a l u e )
return v a l u e s
def t o k e n i z e d f ( d f ) :
d f [ ’ i n s t a n c e ’ ] = d f [ ’ i n s t a n c e ’ ] . apply ( lambda x : c u s t om t o k e n i z e ( x ) )
d f [ ’ c o n t e x t b e f o r e ’ ] = d f [ ’ c o n t e x t b e f o r e ’ ] . apply ( lambda x : c u s t om t o k e n i z e ( x ) )
d f [ ’ c o n t e x t a f t e r ’ ] = d f [ ’ c o n t e x t a f t e r ’ ] . apply ( lambda x : c u s t om t o k e n i z e ( x ) )
return d f
t e s t = t o k e n i z e d f ( t e s t )
t r a i n = t o k e n i z e d f ( t r a i n )
v a l i d = t o k e n i z e d f ( v a l i d )
w ith open( ’ data / t o k e n i z e d t r a i n . p i c k l e ’ , ’wb ’ ) a s h and l e :
p i c k l e . dump( t r a i n , hand le , p r o t o c o l=p i c k l e .HIGHEST PROTOCOL)
w ith open( ’ data / t o k e n i z e d v a l i d . p i c k l e ’ , ’wb ’ ) a s h and l e :
p i c k l e . dump( v a l i d , hand le , p r o t o c o l=p i c k l e .HIGHEST PROTOCOL)
w ith open( ’ data / t o k e n i z e d t e s t . p i c k l e ’ , ’wb ’ ) a s h and l e :
p i c k l e . dump( t e s t , hand le , p r o t o c o l=p i c k l e .HIGHEST PROTOCOL)
Loading the data and tokenizing it can be done by running the script:
python t o k e n i z a t i o n . py
The tokenized dataset will be saved in the data folder under proj-skeleton (not data folder under /homes/cs510/project3). You can change it if necessary. The tokenization should take about 80 minutes.
0.2 Preprocessing data
Once we have the tokenized data, we need to transform them into vectors before feeding them to the deep learning model.
This part can be done by running the script:
python p r e p r o c e s s . py
It will do the transformation and save the transformed data (x train.pickle, etc.) under data folder.
For this question, we represent each instance as one vector of tokens:
tokenized context before, < ST ART >, tokenized line under test, < END >, tokenized context after
The tokens < ST ART > and < END > indicates when the line under test starts.
For this question, we will only keep 50,000 training instances to save time. You can try to use larger dataset (1 million
or more) in part II-IV.
Loading tokenized data and reshaping the input:
# Loading t o k e n i z e d da ta
w ith open( ’ data / t o k e n i z e d t r a i n . p i c k l e ’ , ’ rb ’ ) a s h and l e :
t r a i n = p i c k l e . l o a d ( h and l e )
w ith open( ’ data / t o k e n i z e d v a l i d . p i c k l e ’ , ’ rb ’ ) a s h and l e :
v a l i d = p i c k l e . l o a d ( h and l e )
w ith open( ’ data / t o k e n i z e d t e s t . p i c k l e ’ , ’ rb ’ ) a s h and l e :
t e s t = p i c k l e . l o a d ( h and l e )
# Reshape i n s t a n c e s :
def r e s h a p e i n s t a n c e s ( d f ) :
d f [ ” inpu t ” ] = d f [ ” c o n t e x t b e f o r e ” ] . apply ( lambda x : ” ” . j o i n ( x ) ) + ” <START> ” + d f [ ” i n s t a n c e ” ] .
apply ( lambda x : ” ” . j o i n ( x ) ) + ” <END> ” + d f [ ” c o n t e x t a f t e r ” ] . apply ( lambda x : ” ” . j o i n ( x ) )
X d f = [ ]
Y d f = [ ]
fo r ind ex , rows in d f . i t e r r o w s ( ) :
X d f . append ( rows . input )
Y d f . append ( rows . i s b u g g y )
return X df , Y d f
X t r a in , Y t r a i n = r e s h a p e i n s t a n c e s ( t r a i n )
X t e s t , Y t e s t = r e s h a p e i n s t a n c e s ( t e s t )
X v a l id , Y v a l i d = r e s h a p e i n s t a n c e s ( v a l i d )
X t r a i n = X t r a i n [ : 5 0 0 0 0 ]
Y t r a i n = Y t r a i n [ : 5 0 0 0 0 ]
X t e s t = X t e s t [ : 2 5 0 0 0 ]
4
Y t e s t = Y t e s t [ : 2 5 0 0 0 ]
X v a l i d = X v a l i d [ : 2 5 0 0 0 ]
Y v a l i d = Y v a l i d [ : 2 5 0 0 0 ]
Since the deep learning model takes a fixed-length vector of numbers as input, we use the training set to build a vocabulary
that maps each token to a number. Then we encode our training, testing and validation instances and created vectors
of fixed length representing the encoded instances. We limit the size of an instance to 1,000 tokens. In part II-IV, you
might want to experiment with different vector sizes.
# Bu i ld v o c a bu l a r y and encoder from t h e t r a i n i n g i n s t a n c e s
maxlen = 1000
v o c a b u l a r y s e t = se t ( )
fo r data in X t r a i n :
v o c a b u l a r y s e t . update ( data . s p l i t ( ) )
v o c a b s i z e = len ( v o c a b u l a r y s e t )
pr int ( v o c a b s i z e )
# Encode t r a i n i n g , v a l i d and t e s t i n s t a n c e s
en c od e r = t f d s . f e a t u r e s . t e x t . TokenTextEncoder ( v o c a b u l a r y s e t )
def en cod e ( t e x t ) :
e n c o d e d t e x t = en c od e r . en cod e ( t e x t )
return e n c o d e d t e x t
X t r a i n = l i s t (map( lambda x : en cod e ( x ) , X t r a i n ) )
X t e s t = l i s t (map( lambda x : en cod e ( x ) , X t e s t ) )
X v a l i d = l i s t (map( lambda x : en cod e ( x ) , X v a l i d ) )
X t r a i n = p a d s e q u e n c e s ( X t r a in , maxlen=maxlen )
X t e s t = p a d s e q u e n c e s ( X t e s t , maxlen=maxlen )
X v a l i d = p a d s e q u e n c e s ( X v a l id , maxlen=maxlen )
0.3 Training the model
Training and evaluation of the model is done by train and test.py
For our first model, we will try to train a two layers bidirectional RNN model using LSTM layers. RNNs have been
known to work well with text data. A tutorial showing how to create a basic RNN model with TensorFlow is available
on https://www.tensorflow.org/tutorials/text/text_classification_rnn
Our model will be defined as followed:
# Model D e f i n i t i o n
model = t f . k e r a s . S e q u e n t i a l ( [
t f . k e r a s . l a y e r s . Embedding ( en c od e r . v o c a b s i z e , 6 4 ) ,
t f . k e r a s . l a y e r s . B i d i r e c t i o n a l ( t f . k e r a s . l a y e r s .LSTM( 6 4 , r e t u r n s e q u e n c e s=True ) ) ,
t f . k e r a s . l a y e r s . B i d i r e c t i o n a l ( t f . k e r a s . l a y e r s .LSTM( 3 2 ) ) ,
t f . k e r a s . l a y e r s . Dense ( 6 4 , a c t i v a t i o n= ’ r e l u ’ ) ,
t f . k e r a s . l a y e r s . Dropout ( 0 . 5 ) ,
t f . k e r a s . l a y e r s . Dense ( 1 , a c t i v a t i o n= ’ s i gm o id ’ )
] )
model . compile ( l o s s= ’ b i n a r y c r o s s e n t r o p y ’ ,
o p t im i z e r=t f . k e r a s . o p t im i z e r s .Adam( 1 e −4) ,
m e t r i c s =[ ’ a c cu r a c y ’ ] )
model . summary ( )
Since the data is pretty large, we might not be able to fit an embedding for the entire dataset in memory. Therefore, we
need to build a batch generator to generate the embedding for the input data on the fly.
# B u i l d in g g en e r a t o r s
c l a s s CustomGenerator ( S equ en c e ) :
def i n i t ( s e l f , t e x t , l a b e l s , b a t c h s i z e , num st eps=None ) :
s e l f . t e x t , s e l f . l a b e l s = t e x t , l a b e l s
s e l f . b a t c h s i z e = b a t c h s i z e
s e l f . len = np . c e i l ( len ( s e l f . t e x t ) / f l o a t ( s e l f . b a t c h s i z e ) ) . a s t y p e ( np . i n t 6 4 )
i f num st eps :
s e l f . len = min( num steps , s e l f . len )
def l e n ( s e l f ) :
return s e l f . len
def g e t i t e m ( s e l f , id x ) :
b a t ch x = s e l f . t e x t [ id x ∗ s e l f . b a t c h s i z e : ( id x + 1 ) ∗ s e l f . b a t c h s i z e ]
b a t ch y = s e l f . l a b e l s [ id x ∗ s e l f . b a t c h s i z e : ( id x + 1 ) ∗ s e l f . b a t c h s i z e ]
5
return bat ch x , b a t ch y
t r a i n g e n = CustomGenerator ( X t r a in , Y t r a in , b a t c h s i z e )
v a l i d g e n = CustomGenerator ( X v a l id , Y v a l id , b a t c h s i z e )
t e s t g e n = CustomGenerator ( X t e s t , Y t e s t , b a t c h s i z e )
We feed this data generator and start training the model as shown below:
# T ra in ing t h e model
c h e c k p o i n t e r = Mode lCheckpo int ( ’ data /mode ls /model−{epoch : 0 2 d}−{ v a l l o s s : . 5 f } . hd f5 ’ ,
mon itor= ’ v a l l o s s ’ ,
v e r b o s e =1 ,
s a v e b e s t o n l y=True ,
mode= ’ min ’ )
c a l l b a c k l i s t = [ c h e c k p o i n t e r ] #, , r e d u c e l r
h i s 1 = model . f i t g e n e r a t o r (
g e n e r a t o r=t r a i n g e n ,
ep o ch s =1 ,
v a l i d a t i o n d a t a=v a l i d g e n ,
c a l l b a c k s= c a l l b a c k l i s t )
0.4 Evaluating the model
Once the model is trained, we evaluate it on the test set. predict generator will generate a probability of a given instance
to be buggy or clean.
Traditionally, instances will be then classified in the class 0 (i.e., clean) if the probability is lower than 50%, and in the
class 1 (i.e., buggy) if the probability is higher. However, using the 50% threshold might not be the best choice and using
a different threshold might provide better results. Therefore, to take into consideration the impact of the threshold, we
draw the ROC curve and use the AUC (area under the curve metrics) to measure the correctness of our classifier.
p r e d I d x s = model . p r e d i c t g e n e r a t o r ( t e s t g e n , v e r b o s e =1)
fp r , tpr , = r o c c u r v e ( Y t e s t , p r e d I d x s )
r o c a u c = auc ( fp r , t p r )
p l t . f i g u r e ( )
lw = 2
p l t . p l o t ( fp r , tpr , c o l o r= ’ d a rk o r an g e ’ , lw=lw , l a b e l= ’ROC cu r v e ( a r e a = %0.2 f ) ’ % r o c a u c )
p l t . p l o t ( [ 0 , 1 ] , [ 0 , 1 ] , c o l o r= ’ navy ’ , lw=lw , l i n e s t y l e= ’−− ’ )
p l t . x l im ( [ 0 . 0 , 1 . 0 ] )
p l t . y l im ( [ 0 . 0 , 1 . 0 5 ] )
p l t . x l a b e l ( ’ F a l s e P o s i t i v e Rate ’ )
p l t . y l a b e l ( ’ True P o s i t i v e Rate ’ )
p l t . t i t l e ( ’ R e c e i v e r o p e r a t i n g c h a r a c t e r i s t i c examp le ’ )
p l t . l e g e n d ( l o c=” l ow e r r i g h t ” )
p l t . s a v e f i g ( ’ au c mod e l . png ’ )
For Part (I), please include auc model.png in your report, and measure the buggy rate (i.e. % of instances
labeled 1) in the training, validation and test instances.
(II)- Improving the results by using a better deep-learning algorithm
The model trained in part (I) is simple and does not perform very well. In the past few years, many different models
to classify text inputs for diverse tasks (content tagging, sentiment analysis, translation, etc.) have been proposed in
the literature. In part (II), you will look at the literature and apply a different deep-learning algorithm to do defect
prediction. You can, and are encouraged to use or adapt models that have been proposed by other people for other tasks.
Please cite your source and provide a link to a paper or/and GitHub repository showing that this algorithm has been
applied successfully for text classification, modeling or generation tasks.
Examples of models to try:
Hierarchical Attention Networks for Document Classification:
https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf and https://github.com/
richliao/textClassifier,
6
Independently Recurrent Neural Network (IndRNN): Building A Longer andDeeper RNN: https://arxiv.org/pdf/
1803.04831.pdf and https://github.com/titu1994/Keras-IndRNN
Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems: https://arxiv.org/pdf/1512.
08756.pdf and https://github.com/ShawnyXiao/TextClassification-Keras
You can also look at more complex models like BERT, Elmo or XLNet.
You can search GitHub for text-classification models and pick the one you like!
We strongly recommend that you do not implement the CNN-only models from this paper: https:
//arxiv.org/abs/1408.5882. We have extensively tested this model for our specific task and we already know it does
not work well.
Report: For this question, please put in your report the model you chose, a link to the paper and/or GitHub repository
where you got the model, a small discussion why you chose to try this model, your source code, and an evaluation of
your trained model on the test set (AUC and ROC curve). If you get any improvement compared to the model used in
part I, please report it too.
If the model you pick is too complex and takes too long to train on the entire training set, please provide an explanation
indicating how much time the model would take to train on the entire dataset and only train your model on a sample of
the dataset.
(III)- Other ways to improve the results
In this question, you will try to improve the model you worked with in Part II using different methods. Chose at least
two of the methods below to try to improve the results you got in part II: Report which methods you use and its
impact on the results and training time.
Use more training data: In part I), we only used 50,000 instances to train our model. You can try to train your
model with the entire training set instead. Based on our experience, using 1 million instances produce much higher AUC
than using 50,000 instances. Generally, the more training instances, the higher AUC until it saturates. The constraint
is machine time.
Data cleaning: The input data we provided is automatically extracted from GitHub and likely contains a lot of noise.
To improve the results, one possibility is to clean the datasets. You can investigate a bit more the raw data and try to
clean the input data. Examples (non-exhaustive) of challenges to investigate and solve are:
• Duplicate instances: Are there any instances that are labeled both buggy and clean?
• Length of the input: What is the average length of an instance, are there any outliers? Does removing outliers
improve the results?
• Quality of the input: Comments have not been removed from the inputs? Does removing comments help to
improve the results?
Tokenization and input abstraction:
In this project, we use a simple tokenization using a java tokenizer and basic abstraction of strings and numbers. This
has the inconvenience of creating a gigantic vocabulary that might be difficult to learn for a deep learning network. Many
different tokenizers or abstractions can be tried:
• Source code contains structured information that could help abstract data to reduce the vocabulary size. For
example, all variables could be abstracted to the same token variable, all method calls to the token methodcall,
types to type, etc. You can also distinguish between different variables in the same instance by abstracting different
variables with slightly different tokens (e.g., var1, var2, etc. Such information can be extracted from an AST or
a java Parser (the javalang library contains a basic AST parser that could be used). Using such an abstract will
significantly reduce the vocabulary and might help the algorithm to learn.
• Subword tokenizers have been used in NLP. You can try tokenizers like SentencePiece (https://github.com/
google/sentencepiece, or word pieces (https://www.tensorflow.org/datasets/api_docs/python/tfds/features/
text/SubwordTextEncoder)
• You can also build your own tokenizer.
7
Context representation: In part I), the context is represented as a sequence of tokens from the entire function. In
addition, both the context and the line under test are represented similarly and fed as one input. This might not be the
best way to represent the context of a bug. You can propose a different approach to represent the context of a bug:
• You can try to represent the context differently (e.g., use a higher-level abstraction, only use a set of tokens instead
of a sequence)
• In this project, we use the entire function as context. This provides a lot of information, but it likely also contains
a lot of noise (e.g. irrelevant statements). You can try to use a different context (e.g., reduce the context to only
consider the basic block surrounding the line under test).
• You can try to feed the context and the instance under test as different inputs.
Tuning and Building Deeper models: Deep learning models contain a lot of hyper-parameters that can be tuned (e.g.,
number of epoch trained, number and size of layers, dropout rate, learning rate, etc.). Using different hyper-parameters
can lead to very different results. One way to improve the results of a classifier is to pick the ”best” hyper-parameters
by tuning the model.
Using different Learning methods: Sometimes, learning from one model and one dataset is not enough to achieve
good results. There are several possibilities to improve the models:
• Use pre-trained embedding to have a better source code representation. Much work has been done to represent
source code from a very large corpus. Instead of training our embedding layers from our limited training data, you
could use a pre-trained embedding (e.g. such as the ones proposed in code2seq https://github.com/tech-srl/
code2seq or train your own embedding (e.g., GloVe or Word2Vec) before training the classifier.
• It is easier to learn from simple instances first. Curriculum Learning has been proposed to help to learn easier
instances first. (https://arxiv.org/abs/1904.03626)
• Use ensemble learning. One model might not be enough to learn all buggy lines. Instead of building one single
model, a combination of several smaller models (trained with different training data or using different hyperparameters) might provide better performances.
(IV)- Further improvements (competition) - for Bonus
You are also highly encouraged to improve the defect prediction models by using other techniques beyond the ones we
recommended or to try to combine all of them to further improve your model.
(Optional) Use GPU for your training
GPU can drastically accelerate the speed of training. In this part, we will guide you to use tensorflow-gpu to train the
model.
Server cuda.cs.purdue.edu is equipped with 6 GPUs capable of deep learning. You should have access to this server.
However, most of the time, its GPUs are occupied by others, which is out of our control. We highly recommend that
you consider using GPUs if you have access to one.
Use GPU of cuda.cs.purdue.edu
We have environment ready to use on this server. Run nvidia-smi to check the avaibility of GPUs before you start. To
run training on this server using GPU, you should follow the steps:
module l o a d cuda / 1 0 . 0
s o u r c e /homes/ c s 5 1 0 / p r o j e c t −3/venv−gpu/bin/ a c t i v a t e
python t r a i n a n d t e s t . py
You may get a OUT OF MEMORY error if no GPU is available at that time. Since we don’t have control over the
server, we cannot guarantee your access to the GPU. You may try at different time.
8
Use your own GPU
If you have control over a machine with Nvidia GPU. You may use tensorflow-gpu to accelerate your training (The
performance is varied based on the model).
Prerequisites
• Python3
• CUDA Toolkit 10.0 (https://developer.nvidia.com/cuda-10.0-download-archive)
• cuDNN (Any version that is compatible with cuda10.0 https://developer.nvidia.com/cudnn)
Once you have meet the prerequisites, you can create a virtualenv and use the provide requirements-gpu.txt to setup
your environment.
python3 −m venv path / t o / venv
s o u r c e path / t o / venv /bin/ a c t i v a t e
p ip i n s t a l l −−upgrade p ip
p ip i n s t a l l −r r e qu i r em en t s −gpu . t x t
Then, you should be ready to train your model on a large dataset faster.
9