Downloading Datasets

Some of our students desire to download notebooks and datasets and work on your personal computers. We do not recommend your personal computers for completing your normal course work. The JupyterHub environment and our containers have been tested to work with the course materials, and provide a standardized environment. Our Faculty, TAs, and Staff have access to your JuptyerHub servers to debug issues. We do not have the staff resources to help students debug their local installations, which will vary considerably according to your equipment, OS, and your versions of Python, R, Jupyter, and additional library installations.

However, if you do choose to do extra experimentation on your local computer, there are a couple of options for downloading the datasets. Please be aware that this is not something we provide TA or instructor support for.

WARNING: Do not try to download the entire dataset folder - it is 343 GB. Instead, download individual datasets as desired for your current work.

Below we explain 3 methods you can use to download datasets.

Method 1: Copy and download
One way that is probably easiest for newbies is to open a terminal and execute a copy command, and then use the Jupyter interface to download the dataset. The copy command follows this pattern:
cp <path to the dataset> <path to the copy>
For instance, to copy the baby names dataset:
cp /dsa/data/all_datasets/baby-names/NationalNames1.csv NationalNames1.csv
Assuming you are in the root Jupyter directory when executing that command, you would find NationalNames1.csv in your root directory, and would be able to select it using the checkbox next to it and using the download button at the top. After downloading the dataset to your personal computer, remove the dataset from your Jupyter directory using the following command:
rm NationalNames1.csv

Method 2: Terminal commands

  • This method requires that you are connected to the University Cisco Secure Client VPN AnyConnect. https://wiki.dsa.missouri.edu/installing-the-university-anyconnect-vpn/
  • Open a terminal on your local machine (desktop or laptop.) On Windows, use PowerShell.
  • Execute a secure copy command to the lz jump server:
    scp yourpawprint@lz.dsa.missouri.edu:<path to dataset>  <path on your local machine>
    For example, to copy the baby names file to a folder called datasets in my Documents folder on my Windows machine, I would enter (all on one line):
    scp sebcq5@lz.dsa.missouri.edu:/dsa/data/all_datasets/baby-names/NationalNames1.csv C:/users/sebcq5/Documents/datasets
  • Enter your university password. The cursor will not move or show what you are typing. Press enter when you are done.
  • The terminal will show the copy progress.

scpCopyProgress.png

Note: protocols other than scp are available to use, such as sftp or rsync.

Method 3: Client application
There are various client applications to assist with moving files. I will provide an example using the Winscp client on a Windows machine, which is what I personally use. You can research others.

  • This method requires that you are connected to the University Cisco Secure Client VPN AnyConnect. https://wiki.dsa.missouri.edu/installing-the-university-anyconnect-vpn/
  • Download and install Winscp: https://winscp.net/eng/index.php
  • In Winscp, set up a session using the New Site option for the lz.dsa.missouri.edu server using the sftp protocol, port 22, and your university pawprint and password and save it for future use giving the session a name of your choosing. Highlight that named session and click the login button to connect using that session.
  • In the left-hand panel, navigate to the folder location on your machine where you wish to store your dataset.
  • In the right-hand panel, there is a drop down box above the visible folders and files that lets you easily navigate to the correct folder. Navigate to root/dsa/data/all_datasets and then to the actual folder location of the dataset you wish to download (for instance, /dsa/data/all_datasets/baby-names/ for the baby names dataset.)
  • You can drag and drop the dataset from right to left.