# Welcome to the SETI Institute Code Challenge!¶

This first tutorial will explain a little bit on what the data is and where to get it.

# Update 23 Januar 2018¶

This project has been a huge success and we'd like to thank all of the participants, the winning team Effsubsee and the scientific members of the SETI Insitute and IBM.

We are beginning to decommision this project. However, it will still be useful as a learning tool. The only real change is that the primary full data set will be removed. The basic, primary small and primary medium data sets will remain.

# Update 21 June 2017¶

We learned a lot at the hackathon on June 10-11th and decided to regenerate the primary data set. This is called the v3 primary data set. The changes, compared to v2 are: the noise background is gaussian white noise instead of noise from the Sun, the signal amplitudes are higher and the characteristics should make them more distinguishable, and there are only 140k in the full set (20k per signal type), compared with 350k previously (50k per signal type).

The basic data set remains unchanged from before.

# Introduction¶

For the Code Challenge, you will be using the "primary" data set, as we've called it. The primary data set is

* labeled data set of 35000 simulated signals
* 7 different labels, or "signal classifications"
* total of about 10 GB of data



This data set should be used to train your models.

As stated above, we no longer have the full 140,000 data set (51 GB). All of the data are found in the primary medium data set below. Additionally , there is the basic4 data set and the and primay small sub set. They are explained below.

## Simple Data Format¶

Each data file has a simple format:

* file name = <UUID>.dat
* a JSON header in the first line that contains:
* UUID
* signal_classification (label)
* followed by stream complex-valued time-series data.



The ibmseti Python package is available to assist in reading this data and performing some basic operations for you.

## Basic Warmup Data Set.¶

There is also a second, simple and clean data set that you may use for warmup, which we call the "basic" data set. This basic set should be used as a sanity check and for very early-stage prototyping. We recommend that everybody starts with this.

* Only 4 different signal classifications
* 1000 simulation files for each class: 4000 files total
* Available as single zip file
* ~1 GB in total.



### Basic Set versus Primary Set¶

The difference between the basic and primary data sets is that the signals simulated in the basic set have, on average, much higher signal to noise ratio (they are larger amplitude signals). They also have other characteristics that will make the different signal classes very distinguishable. You should be able to get very high signal classification accuracy with the basic data set. The primary data set has smaller amplitude signals and can look more similar to each other, making classification accuracy more difficult with this data set. There are also only 4 classes in the basic data set and 7 classes in the primary set.

## Primary Data Sets¶

### Primary Small¶

The primary small is a subset of the full primary data set. Use for early-stage prototyping.

• All 7 signal classifications
• 1000 simulations / class (7 classes = 7000 files)
• Available as single zip file
• ~2 GB in total

### Primary Medium¶

The primary medium was a subset of the full primary data set but it not constitutes the entire data set. You may want to consider ways to augment this data set in order to create more training samples. Additionally, you could consider splitting each file up into 4 or 5 smaller files and simply build models that accept smaller files. You wouldn't be able to use this for post scores to the Scoreboards, but it would be one way to generate more data. Finally, we hope to one day release the simulation code, which would allow you to generate your own data sets.

• All 7 signal classifications
• 5000 simulations / class (7 classes = 35000 files)
• Large enough for relatively robust model construction
• Available in 5 separate zip files
• ~10 GB in total

## Index Files¶

For all data sets, there exists an index file. That file is a CSV file. Each row holds the UUID, signal_classification (label) for a simulation file in the data set. You can use these index files in a few different ways (from using to keep track of your downloads, to facilitate parallelization of your analysis on Spark).

## Direct Data URLs if you are working from outside of IBM Data Science Experience¶

Data (1.1 GB)

Index File

Data (1.9 GB)

Index File

### Primary Medium¶

Data Zip File 1 (1.9 GB)

Data Zip File 2 (1.9 GB)

Data Zip File 3 (1.9 GB)

Data Zip File 4 (1.9 GB)

Data Zip File 5 (1.9 GB)

Index File

It's probably easiest to download these zip files, unzip them separately, then move the contents of to a single folder.

# Test Data Sets¶

Once you've trained your model, done all of your cross-validation testing, and are ready to submit an entry to the contest, you'll need to download the test data set and score the test set data with your model.

The test data files are nearly the same as the training sets. The only difference is the JSON header in each file does not contain the signal class. You can use ibmseti python package to read each file, just like you would the training data. See Step_2_reading_SETI_code_challenge_data.ipynb for examples.

## Preview Test Set¶

The primary_testset_preview_v3 data set contains 2414 test simulation files. Each data file is the same as the above training data except the JSON header does NOT contain the 'signal_classification' key.

• All 7 classes
• Roughly 340 simulations per class
• JSON header with UUID only
• Available as single zip file
• 665 MB in total

Preview Test Set Zip File

Preview Test Set Index File

## Final Test Set¶

The primary_testset_final_v3 data set contains 2496 test simulation files. Each data file is the same as the above training data except the JSON header does NOT contain the 'signal_classification' key.

• All 7 classes
• Roughly 350 simulations per class
• JSON header with UUID only
• Available as single zip file
• 687 MB in total

Final Test Set Zip File

Final Test Set Index File

### Submitting Classification Results¶

See the Judging Criteria notebook for information on submitting your test-set classifications.

# Getting Data from IBM Spark service¶

If you're working with IBM Watson Data Platform (or Data Science Experience), you can use either wget or curl from a Jupyter notebook cell. Or you can use the requests library, or similar, to download the files programmatically. (This should work for both the IBM Spark service backend or the IBM Analytics Engine backend.) Simply call wget command-line from the shell using the appropriate shell command syntax. The shell command syntax is different for Python kernels versus Scala kernels. Below we show you the Python kernel way, asssuming that the vast majority will use Python.

In [33]:
#copy link from above.
#make sure to use the -O <filename.zip> to redirect the output

!wget https://ibm.box.com/shared/static/91z783n1ysyrzomcvj4o89f4b8ss76ct.zip -O primary_testset_preview_v3.zip

--2018-01-23 15:14:33--  https://ibm.box.com/shared/static/91z783n1ysyrzomcvj4o89f4b8ss76ct.zip
Resolving ibm.box.com... 107.152.25.197, 107.152.24.197
Connecting to ibm.box.com|107.152.25.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/shared/static/91z783n1ysyrzomcvj4o89f4b8ss76ct.zip [following]
--2018-01-23 15:14:34--  https://ibm.ent.box.com/shared/static/91z783n1ysyrzomcvj4o89f4b8ss76ct.zip
Resolving ibm.ent.box.com... 107.152.24.211
Connecting to ibm.ent.box.com|107.152.24.211|:443... connected.
HTTP request sent, awaiting response... 302 Found
Resolving public.boxcloud.com... 107.152.24.200, 107.152.25.200
Connecting to public.boxcloud.com|107.152.24.200|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 697495727 (665M) [application/zip]
Saving to: ‘primary_testset_preview_v3.zip’

primary_testset_pre 100%[===================>] 665.18M  4.13MB/s    in 2m 58s

2018-01-23 15:17:33 (3.73 MB/s) - ‘primary_testset_preview_v3.zip’ saved [697495727/697495727]


In [34]:
!ls -al primary_testset_preview_v3.zip

[email protected] 1 adamcox  staff  697495727 Jan 23 15:17 primary_testset_preview_v3.zip

In [35]:
import zipfile
zz = zipfile.ZipFile('primary_testset_preview_v3.zip')

In [36]:
zz.namelist()[:10]

Out[36]:
['primary_testset_preview_v3/',
'primary_testset_preview_v3/00cf2d57b2794eb650cd516d9fd602f3.dat']