This first tutorial will explain a little bit on what the data is and where to get it.
This project has been a huge success and we'd like to thank all of the participants, the winning team
Effsubsee and the scientific members of the SETI Insitute and IBM.
We are beginning to decommision this project. However, it will still be useful as a learning tool. The only real change is that the primary full data set will be removed. The
primary small and
primary medium data sets will remain.
We learned a lot at the hackathon on June 10-11th and decided to regenerate the primary data set. This is called the
v3 primary data set. The changes, compared to
v2 are: the noise background is gaussian white noise instead of noise from the Sun, the signal amplitudes are higher and the characteristics should make them more distinguishable, and there are only 140k in the full set (20k per signal type), compared with 350k previously (50k per signal type).
basic data set remains unchanged from before.
For the Code Challenge, you will be using the "primary" data set, as we've called it. The primary data set is
* labeled data set of 35000 simulated signals * 7 different labels, or "signal classifications" * total of about 10 GB of data
This data set should be used to train your models.
As stated above, we no longer have the full 140,000 data set (51 GB). All of the data are found in the
primary medium data set below. Additionally , there is the
basic4 data set and the and
primay small sub set. They are explained below.
Each data file has a simple format:
* file name = <UUID>.dat * a JSON header in the first line that contains: * UUID * signal_classification (label) * followed by stream complex-valued time-series data.
ibmseti Python package is available to assist in reading this data and performing some basic operations for you.
There is also a second, simple and clean data set that you may use for warmup, which we call the "basic" data set. This basic set should be used as a sanity check and for very early-stage prototyping. We recommend that everybody starts with this.
* Only 4 different signal classifications * 1000 simulation files for each class: 4000 files total * Available as single zip file * ~1 GB in total.
The difference between the
primarydata sets is that the signals simulated in the
basicset have, on average, much higher signal to noise ratio (they are larger amplitude signals). They also have other characteristics that will make the different signal classes very distinguishable. You should be able to get very high signal classification accuracy with the basic data set. The primary data set has smaller amplitude signals and can look more similar to each other, making classification accuracy more difficult with this data set. There are also only 4 classes in the basic data set and 7 classes in the primary set.
primary small is a subset of the full primary data set. Use for early-stage prototyping.
primary medium was a subset of the full primary data set but it not constitutes the entire data set. You may want to consider ways to augment this data set in order to create more training samples. Additionally, you could consider splitting each file up into 4 or 5 smaller files and simply build models that accept smaller files. You wouldn't be able to use this for post scores to the Scoreboards, but it would be one way to generate more data. Finally, we hope to one day release the simulation code, which would allow you to generate your own data sets.
For all data sets, there exists an index file. That file is a CSV file. Each row holds the UUID, signal_classification (label) for a simulation file in the data set. You can use these index files in a few different ways (from using to keep track of your downloads, to facilitate parallelization of your analysis on Spark).
It's probably easiest to download these zip files, unzip them separately, then move the contents of to a single folder.
Once you've trained your model, done all of your cross-validation testing, and are ready to submit an entry to the contest, you'll need to download the test data set and score the test set data with your model.
The test data files are nearly the same as the training sets. The only difference is the JSON header in each file does not contain the signal class. You can use
ibmseti python package to read each file, just like you would the training data. See Step_2_reading_SETI_code_challenge_data.ipynb for examples.
primary_testset_preview_v3 data set contains 2414 test simulation files. Each data file is the same as the above training data except the JSON header does NOT contain the 'signal_classification' key.
primary_testset_final_v3 data set contains 2496 test simulation files. Each data file is the same as the above training data except the JSON header does NOT contain the 'signal_classification' key.
See the Judging Criteria notebook for information on submitting your test-set classifications.
If you're working with IBM Watson Data Platform (or Data Science Experience), you can use either
curl from a Jupyter notebook cell. Or you can use the
requests library, or similar, to download the files programmatically. (This should work for both the IBM Spark service backend or the IBM Analytics Engine backend.) Simply call
wget command-line from the shell using the appropriate shell command syntax. The shell command syntax is different for Python kernels versus Scala kernels. Below we show you the Python kernel way, asssuming that the vast majority will use Python.
#copy link from above. #make sure to use the -O <filename.zip> to redirect the output !wget https://ibm.box.com/shared/static/91z783n1ysyrzomcvj4o89f4b8ss76ct.zip -O primary_testset_preview_v3.zip
!ls -al primary_testset_preview_v3.zip
import zipfile zz = zipfile.ZipFile('primary_testset_preview_v3.zip')
['primary_testset_preview_v3/', 'primary_testset_preview_v3/0024bc94adff8627d661682329beacbe.dat', 'primary_testset_preview_v3/0037faaba996c1ed34d8e8fa51f649c3.dat', 'primary_testset_preview_v3/005a3820f3baedce982a2969d6098c5a.dat', 'primary_testset_preview_v3/0061997107af9768ad50cbd83b413021.dat', 'primary_testset_preview_v3/0067e25d70b0fcba15d45418bb00f0fb.dat', 'primary_testset_preview_v3/006c06174ff66b4a09fc40cfc202dbc9.dat', 'primary_testset_preview_v3/00754078d00d13f3be678057942fd89e.dat', 'primary_testset_preview_v3/00b3b8fdb14ce41f341dbe251f476093.dat', 'primary_testset_preview_v3/00cf2d57b2794eb650cd516d9fd602f3.dat']