Training Projects

From DeepSense Docs
Revision as of 15:20, 26 August 2021 by Bgeetika (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

DeepSense has compiled a few data sets for students, and others interested in the ocean and AI, so they can have the opportunity to complete AI projects independently. We hope participants can learn about a specific type of ocean related data, and experience an explicit AI project. It is expected that the participants work on the project alone, but we have provided some guidance that includes notebooks, data, outputs and models to try to improve upon.

We have found that the data cleaning step can take a long time, so our hope is that these datasets will be reasonably clean, allowing the participants to explore ocean AI.

Beginner Level

1. Python Basics

Before exploring the AI Training projects, it is recommended to go through the Python Libraries required for Machine Learning projects. This would help you to understand the other projects effectively. This section contains the information about the essential Python Libraries required for AI project and also you can see the practical usage of these libraries through notebook attached.


Link for Notebook

Introduction to Python Libraries

Intermediate Level

2. Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is an approach of analyzing or investigating the dataset using statistical graphics and other data visualization methods. Analysis may include: handle missing values, maximize insight into a data set and discover patterns, extract important variables, detect outliers and anomalies, find interesting relations among the variables, test hypothesis and check assumptions. Drawing reliable conclusions from a massive quantity of data by just gleaning over it is very difficult or almost impossible—instead, you have to look at it carefully through an analytical lens.


Download Instructions

Steps to do EDA

Link for Notebook

3. Text Cleaning

The purpose of this project is to show you how to clean the text before using it for NLP problems. In this project, I am using the Tweets posted by the users on World Ocean day which was on June 8, 2021. I have created a Python Script to pull the tweets and then showed how to clean that raw data. Notebooks are attached for your reference.

Steps for Tweets Extraction

Steps for Text Cleaning

Link for Notebook


You can try the same dataset which I have attached here for practicing Data Cleaning. If you want to see how to pull tweets from Twitter, follow instructions given in Steps for Tweets Extraction and Steps for Text Cleaning links.

Advanced Level

4. Object Detection

We used the google open images database to obtain approximately 650 images of starfish. The images were already separated into train, test and validation sets. The metadata linked below is only for the starfish images, not for the entire dataset. The metadata includes coordinates for bounding boxes around the starfish.

Google Drive Directory

This contains the files:

Starfish Dataset

Training metadata

Test metadata

Validation metadata

If you want to download other categories of images from the open images database, you can do so by following the instructions here:

Download Instructions

After you have the datasets, you can download and install YOLO v4 using the following instructions:

Installation Instructions

Configuration Instructions

Metadata Conversion Script

If you want to run this on google colab, check out the following wiki: Darknet Wiki. At the top there is a link to a colab notebook, and a video tutorial.

5. Regression

The buoy collects environment measurements including wind speed and direction, surface temperature, current speed, wave height, and peak wave period. This wind and wave data are used to decide if conditions allow the safe transfer of pilots and passage of vessels, as they require a minimum depth of water which may not be met if the waves are too large. The current Red Shoal Buoy is under maintenance. Such a duration without accurate environmental measurements would significantly impair the ability to ensure the safe guidance of vessels. In this project, we are trying to predict the environment measurements of the buoy which is under maintenance using the values of other active operational buoy so that the authorities could allow the safe passage of vessels.

Predicting the values of one buoy using the parameters of another buoy. In this project, we are using the dataset of Mouth of Placentia Bay Buoy, Pilot Boarding Station / Red Island Shoal Buoy, Placentia Bay: Ragged Islands – KLUMI( Land station) which are located in Newfoundland and Labrador.


Mouth of Placentia Bay Buoy

Pilot Boarding Station / Red Island Shoal Buoy

Placentia Bay: Ragged Islands – KLUMI( Land station)

The dataset available here is till April 19, 2021. You can get the latest dataset from smartatlantic.

Download Instructions

Data Dictionary

Visual Representation of data

Buoy Location

You can find the instructions to clean the dataset, merging of files and training the ML models from the below link:

Instructions for Cleaning/Merging/Training

We have implemented the code on IBM Watson Cloud and encourage you to use this to get the experience of Cloud. Below link will provide you the instructions for using the IBM Watson Cloud. The Lite version of this cloud is free and provide you 25GB storage which is enough for this project.

Instructions for using IBM Watson Cloud

We have created notebooks with the code for your reference in the below link.

Links for Notebooks



6. Natural Language Processing (NLP)

NLP project is related to the Sentiment Analysis on Climate change. We have used the dataset available on provided below). We have applied BERT to do Sentiment analysis. BERT has become a new standard for Natural Language Processing (NLP). It achieved a whole new state-of-the-art on eleven NLP task, including text classification, sentiment analysis, sequence labeling, question answering, and many more


Instruction for using Google Colab and download dataset

Steps to do Sentiment Analysis

Link to Notebook

7. Time Series

A Time Series is simply a series of data points ordered in time. In a Time Series, time is often the independent variable and the goal is usually to make a forecast for the future. Plot the points on a graph, and one of your axes would always be time. You can see the analysis ,plotting and building machine learning model for time series data in this project. This project is done in IBM Watson cloud and instructions are given in attached links. The Lite version of this cloud is free and provide you 25GB storage which is enough for this project.

Instruction for setting account in IBM Watson cloud

Steps to download dataset


Steps to handle Time Series data

Link to notebook