In this notebook we'll be covering examples of using qsv's count
command.
This notebook uses qsv, an open-source CSV data wrangling toolkit available as a command line tool. You may learn more at https://github.com/jqnatividad/qsv.
!
in this Jupyter notebook environment to execute them, but should be removed when using Bash on a terminal.First, let's download qsv into our notebook from the releases page. We'll use qsv 0.111.0:
# Downloading the .zip file that contains qsv
!curl -LO https://github.com/jqnatividad/qsv/releases/download/0.111.0/qsv-0.111.0-x86_64-unknown-linux-gnu.zip
# Unzipping the .zip file into a folder
!unzip -o qsv-0.111.0-x86_64-unknown-linux-gnu.zip -d qsv-0.111.0-files
# Moving the qsv binary file from the folder into /bin to use the qsv command anywhere on our system
!cp qsv-0.111.0-files/qsv /bin
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 73.3M 100 73.3M 0 0 42.5M 0 0:00:01 0:00:01 --:--:-- 76.7M Archive: qsv-0.111.0-x86_64-unknown-linux-gnu.zip inflating: qsv-0.111.0-files/README inflating: qsv-0.111.0-files/qsv inflating: qsv-0.111.0-files/qsv_glibc-2.31 inflating: qsv-0.111.0-files/qsv_glibc-2.31_rust_version_info.txt inflating: qsv-0.111.0-files/qsv_nightly inflating: qsv-0.111.0-files/qsv_nightly_rust_version_info.txt inflating: qsv-0.111.0-files/qsvdp inflating: qsv-0.111.0-files/qsvdp_glibc-2.31 inflating: qsv-0.111.0-files/qsvdp_nightly inflating: qsv-0.111.0-files/qsvlite inflating: qsv-0.111.0-files/qsvlite_glibc-2.31 inflating: qsv-0.111.0-files/qsvlite_nightly
Here is the main CSV data set I'll be using:
Data set | Source | Download Link | Rounded size |
---|---|---|---|
Indicators of Anxiety or Depression Based on Reported Frequency of Symptoms During Last 7 Days | https://catalog.data.gov/dataset/indicators-of-anxiety-or-depression-based-on-reported-frequency-of-symptoms-during-last-7- | https://data.cdc.gov/api/views/8pt5-q6wp/rows.csv?accessType=DOWNLOAD | 2.1 MB |
Let's download the data set into our notebook as data.csv
.
# Downloading the .csv file as data.csv
!curl https://data.cdc.gov/api/views/8pt5-q6wp/rows.csv?accessType=DOWNLOAD -o data.csv
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2144k 0 2144k 0 0 1894k 0 --:--:-- 0:00:01 --:--:-- 1895k
qsv count
!¶Time to explore with qsv count
! Let's start by simply getting the help message for qsv count
.
--help
¶As with any qsv command, we'll use the --help
option to get the help message:
!qsv count --help
Prints a count of the number of records in the CSV data. Note that the count will not include the header row (unless --no-headers is given). For examples, see https://github.com/jqnatividad/qsv/blob/master/tests/test_count.rs. Usage: qsv count [options] [<input>] qsv count --help count options: -H, --human-readable Comma separate row count. --width Also return the length of the longest record. The count and width are separated by a semicolon. Common options: -h, --help Display this message -n, --no-headers When set, the first row will be included in the count.
qsv count
On Our CSV¶We may start by getting the default output for qsv count
by running it on our data set. This should get us the number of non-header records (rows) in our CSV:
!qsv count data.csv
13671
That's 13,671 non-header rows of data!
--human-readable
, -H
¶With the --human-readable
option (or its alias -H
), qsv should automatically add commas in the appropriate places to help us read the number better.
!qsv count data.csv --human-readable
13,671
--no-headers
, -n
¶What if we we want to also include the header row in the count (therefore counting all the rows in the CSV)?
We can use the --no-headers
(or -n
) option to include the header row in the count.
We should expect 13,672
as our output, including the commas by also using -H
.
!qsv count data.csv -n -H
13,672
--width
¶There's one more option that you might not expect.
What if we wanted to find out how long the longest row is in our data set, based on the number of characters it has?
The --width
option should return the length of the longest record
. The count and width are separated by a semicolon.
!qsv count data.csv --width
13671;237
--width
Option's Output¶The longest record has 237 characters. But you may have some questions about this width output:
--no-headers
option?Let's find out with this simple CSV file we'll name sample.csv
:
# Write our data to sample.csv
!echo 'letter,number' > sample.csv
!echo 'alpha,13' >> sample.csv
!echo 'beta,24' >> sample.csv
# Display the data from sample.csv
!cat sample.csv
letter,number alpha,13 beta,24
First let's use --width
without --no-headers
.
Our initial assumption is that if the headers are not included then we should get 8
as the width because there are 8
total characters in the row alpha,13
when you also include the comma ,
.
!qsv count sample.csv --width
2;9
Hmm... We get 9. Why is that?
In our CSV data, there's a sort of hidden character at the end of each row: the newline character \n
. This is included as a character in the width for our rows, so we simply add 1 to our estimate of 8. We can also see that the comma is included in the width output.
To further verify both of these claims, let's run the command with --no-headers
to try and include the header row in the width output. Based on what we've learned so far, we can expect that all characters including the commas ,
between field values and the newline \n
at the end of the longest row are included in the width output. So for the header row letter,number
we should expect a width of 12 + 1 + 1 = 14
:
!qsv count sample.csv --width --no-headers
3;14
Awesome! Now that you have a better understanding of qsv count
, try it out for yourself!
Let's say I want to write a sentence that dynamically includes the count of a CSV file within it. For example, I want to print out:
There are 1,000,000 non-header rows of data in the data set!
The 1,000,000
is arbitrary, that is, it should be the output from using qsv count
on a CSV file. Here's a Bash script using the echo
command we can use to achieve this:
!echo "There are $(qsv count data.csv -H) non-header rows of data in the data set!"
There are 13,671 non-header rows of data in the data set!
qsv count
on User's Input File Path¶Let's say we want to run a Python script where the user can simply enter the path to the CSV file (in our case we can just write data.csv
) and then get the output of running qsv count
on it. We can use the subprocess
module to run qsv
commands and print the output. Here's a sample script with comments to help understand how it works:
import subprocess
# Get user input for the CSV path
csv_path = input('Enter the path to your CSV file: ')
# Run qsv count on the CSV file with the -H option
command = ["qsv", "count", csv_path, "-H"]
# Get the qsv count output
subprocess_output = subprocess.run(command, capture_output=True)
# stdout - standard stream of output from our subprocess that runs count
# decode - convert the stdout output from bytes to string
# strip - remove any spaces/newline characters in the output
count = subprocess_output.stdout.decode().strip()
# Print the output of qsv count within a sentence
print(f"There are {count} non-header rows of data in the data set!")
Enter the path to your CSV file: data.csv There are 13,671 non-header rows of data in the data set!
With this script we can now simply provide the file path and then get the qsv count
output for it! Of course you may expand on this script with improvements such as:
stderr
.In this notebook we covered example usage of qsv count
for tallying the number of rows in a CSV file. We discussed all the options that are available for qsv count
, and we also went further to discover how qsv count
can be integrated in Bash and Python.