Are you trying to figure out HOW to manipulate and transfer Kaggle large datasets to analyze it?
manipulating large dataset can be constraining and time-consuming if it’s not addressed the right way
Importing 100 MB to your Python Notebook maybe it’s fine to download it to your local machine and start analyzing but when the size became 1G or 2GB you need to proceed differently
I’m going to share with you a step-by-step guide to transfer your data into Google Cloud Storage bucket.
1- You need your Google Cloud Project-ID
2- You need also your Kaggle API Key, you can download your kaggle.json file from your Kaggle account setting.
How to get the desired dataset?
1- Connect to Google Cloud Console and activate cloud shell
2- Download the following bash script from github
3- upload the two files to the instance provisioned by cloud shell
the bash script is as follow :
!/bin/bash #usage ./download_from_kaggle.sh .sh datasetname gcs_project_id # Installing python3 and pip and the dependencies sudo apt update sudo apt install -y python3 python3-dev python3-venv sudo apt install -y wget unzip wget https://bootstrap.pypa.io/get-pip.py sudo python3 get-pip.py # Installing kaggle API pip3 install kaggle # updating $PATH export PATH=$PATH:/home/$USER/.local/bin #Import kaggle.json to the VM and copy it to .kaggle mkdir -p ~/.kaggle cp kaggle.json ~/.kaggle/ chmod 600 ~/.kaggle/kaggle.json # download the dataset echo "downloading dataset from: $1..." kaggle datasets download -d $1 #unzip it echo "unzipping file... " for file in *.zip do unzip "$file" -d ./ done #Creating GCS bucket(with default setup) echo "setting project and creating bucket on GCS..." bucket_name=$2"-dataset" gcloud config set project $2 gsutil mb gs://$bucket_name/ echo "bucket created..." #upload it to GCS echo "uploading to GCS..." gsutil cp *.csv gs://$bucket_name echo "upload complete ..." # Note that the VM need to have GCS write access
And a demo video is also available below:
I Hope it’s helpfull for you and saves you time.