How to transfer Kaggle dataset to Google Cloud Storage

Sharing is caring!

Are you trying to figure out HOW to manipulate and transfer Kaggle large datasets to analyze it?

manipulating large dataset can be constraining and time-consuming if it’s not addressed the right way

Importing 100 MB to your Python Notebook maybe it’s fine to download it to your local machine and start analyzing but  when the size became 1G or 2GB you need to proceed differently

I’m going to share with you a step-by-step guide to transfer your data into Google Cloud Storage bucket.


1- You need your Google Cloud Project-ID

2- You need also your Kaggle API Key, you can download your kaggle.json file from your Kaggle account setting.

How to get the desired dataset?

1- Connect to Google Cloud Console and activate cloud shell

2- Download the following bash script from github

3- upload the two files to the instance provisioned by cloud shell

the bash script is as follow :

#usage ./ .sh datasetname gcs_project_id

# Installing python3 and pip and the dependencies
sudo apt update
sudo apt install -y python3 python3-dev python3-venv
sudo apt install -y wget unzip
sudo python3

# Installing kaggle API
pip3 install kaggle

# updating $PATH
export PATH=$PATH:/home/$USER/.local/bin

#Import kaggle.json to the VM and copy it to .kaggle 
mkdir -p ~/.kaggle
cp kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

# download the dataset 
echo "downloading dataset from: $1..."
kaggle datasets download -d $1

#unzip it
echo "unzipping file... "

for file in *.zip
unzip "$file" -d ./

#Creating  GCS bucket(with default setup) 
echo "setting project and creating bucket on GCS..."
gcloud config set project $2
gsutil mb gs://$bucket_name/
echo "bucket created..."

#upload it to GCS
echo "uploading to GCS..."
gsutil cp *.csv gs://$bucket_name

echo "upload complete ..."
# Note that the VM need to have GCS write access

And a demo video is also available below:

I Hope it’s helpfull for you and saves you time.