This document shows where to get the test data set, and explains how the judging will be done for the "Best Classification" system.
Read the Step 1: Get Data tutorial for information on where to get the training and test data sets.
Your job is to create a CSV file that holds the scores for each of the signals found in the test data set. Each line of your CSV file "Scorecard", will contain the signal's UUID, followed by scores for each signal class. (The signal class with the highest score would be your model's class estimation. Typically these's scores are the probability estimates for each class.)
The test data files are nearly the same as the training sets. The only difference is the JSON header in each file does not contain the signal class. You can use ibmseti
python package to read each file, just like you would the training data. See Step_2_reading_SETI_code_challenge_data.ipynb for examples.
For each data file in the test set, generate the appropriate spectrogram, and then pass that to your signal classifier to calculate the scores for each class.
For example, each line in the CSV file should look like:
abdefg-adbc12-23234-123123-cvaf, 0.1, 0.023, 0.451, 0.232, 0.001, 0.07, 0.0083
THE COLUMN ORDER OF THE NUMBERS IS ABSOLUTELY CRITICAL! THEY MUST BE THE SCORE FOR EACH CLASS IN THIS ORDER:
uuid, brightpixel, narrowband, narrowbanddrd, noise, squarepulsednarrowband, squiggle, squigglesquarepulsednarrowband
import ibmseti
import csv
import ALL LIBS FOR YOUR CLASSIFIER (tensorflow, Watson, etc)
my_model #this is your trained signal classification model
my_output_results = mydatafolder + '/signal_class_results.csv'
zz = zipfile.ZipFile(mydatafolder + '/' + 'primary_testset.zip')
for fn in zz.namelist():
data = zz.open(fn).read()
aca = ibmseti.compamp.SimCompamp(data)
uuid = aca.header()['uuid']
spectrogram = draw_spectrogram(aca) #whatever signal processing code you need would go in your `draw_spectrogram` code
#cr = class results. In this example, it's a dictionary. But in your experience it could be something else
# like a simple list.
cr = my_model.classify(spectrogram)
with open(my_output_results, 'a') as csvfile:
fwriter = csv.writer(csvfile, delimiter=',')
fwrite.writerow([uuid, cr['brightpixel'], cr['narrowband'], cr['narrowbanddrd'],
cr['noise'], cr['squarepulsednarrowband'], cr['squiggle'],
cr['squigglesquarepulsednarrowband']
])
In this contest, because we are using all the scores, we'll use the Log-Loss function as a measure of your model. These details will be found in an upcoming notebook. There will be a website that you use to submit your CSV file containing the scores for your team. This website will also contain a scoreboard. This will be implemented after the hackathon on June 11th -- more details to come.
You can download an example scorecard from Object Storage.
The scores in this example file are random values between 0 and 1. Typically, your scores will be your classification model's estimate of probability for each class. As such, they should sum to 1.0. However, to be sure, the Log-loss calculator in our Scoreboard will normalize your score to ensure the values sum to 1.0.
If you are running this on IBM DSX (IBM Apache Spark), you'll need to get your .csv
file to your local working directory in order to submit your results to the Scoreboard website.
The best way to get your data is to move your .csv
file to your Object Storage account that is provided in DSX. Then, move the .csv
file from your Object Storage to your local directory.
This tutorial shows you the basic steps to move data to and from your Object Storage instance.
We are currently constructing a Team Scoreboard. This will be used first at the hackathon. After the hackathon, we will make scoreboard available for the code challenge particitipants. We will send you notification when the Scoreboard is ready. We expect the Scoreboard to go live on June 15th! You will use this Scoreboard to submit results the CSV file for your team.
The SETI Institute reserves the right to change the rules or add more rules at any time. This list will be updated as needed.
The code challenge ends at 00:00 August 1, 2017 US West Coast time. All submissions to the Scoreboard must occur before then.
All results are subject to verification by the SETI Institute. The results of the top 3 teams will undergo further scrutiny to ensure classification accuracy. In the event of a tie-break, more test data will be provided. Another tie-breaking measure will be the time it takes to perform classifications and the faster classifer will be chosen.
Your code must be open-sourced and licensed with the Apache License 2.0. This is required because the plan is to install the winning team's code in the SETI Institute real-time data analysis pipeline.