DeepSense has compiled a few data sets for students, and others interested in the ocean and AI, so they can have the opportunity to complete AI projects independently. We hope participants can learn about a specific type of ocean related data, and experience an explicit AI project. It is expected that the participants work on the project alone, but we have provided some guidance that includes notebooks, data, outputs and models to try to improve upon.
We have found that the data cleaning step can take a long time, so our hope is that these datasets will be reasonably clean, allowing the participants to explore ocean AI.
1. Python Basics
Before exploring the AI Training projects, it is recommended to go through the Python Libraries required for Machine Learning projects. This would help you to understand the other projects effectively. This section contains the information about the essential Python Libraries required for AI project and also you can see the practical usage of these libraries through notebook attached.
2. Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is an approach of analyzing or investigating the dataset using statistical graphics and other data visualization methods. Analysis may include: handle missing values, maximize insight into a data set and discover patterns, extract important variables, detect outliers and anomalies, find interesting relations among the variables, test hypothesis and check assumptions. Drawing reliable conclusions from a massive quantity of data by just gleaning over it is very difficult or almost impossible—instead, you have to look at it carefully through an analytical lens.
3. Text Cleaning
The purpose of this project is to show you how to clean the text before using it for NLP problems. In this project, I am using the Tweets posted by the users on World Ocean day which was on June 8, 2021. I have created a Python Script to pull the tweets and then showed how to clean that raw data. Notebooks are attached for your reference.
You can try the same dataset which I have attached here for practicing Data Cleaning. If you want to see how to pull tweets from Twitter, follow instructions given in Steps for Tweets Extraction and Steps for Text Cleaning links.
4. Object Detection
We used the google open images database to obtain approximately 650 images of starfish. The images were already separated into train, test and validation sets. The metadata linked below is only for the starfish images, not for the entire dataset. The metadata includes coordinates for bounding boxes around the starfish.
This contains the files:
If you want to download other categories of images from the open images database, you can do so by following the instructions here:
After you have the datasets, you can download and install YOLO v4 using the following instructions:
If you want to run this on google colab, check out the following wiki: Darknet Wiki. At the top there is a link to a colab notebook, and a video tutorial.
The buoy collects environment measurements including wind speed and direction, surface temperature, current speed, wave height, and peak wave period. This wind and wave data are used to decide if conditions allow the safe transfer of pilots and passage of vessels, as they require a minimum depth of water which may not be met if the waves are too large. The current Red Shoal Buoy is under maintenance. Such a duration without accurate environmental measurements would significantly impair the ability to ensure the safe guidance of vessels. In this project, we are trying to predict the environment measurements of the buoy which is under maintenance using the values of other active operational buoy so that the authorities could allow the safe passage of vessels.
Predicting the values of one buoy using the parameters of another buoy. In this project, we are using the dataset of Mouth of Placentia Bay Buoy, Pilot Boarding Station / Red Island Shoal Buoy, Placentia Bay: Ragged Islands – KLUMI( Land station) which are located in Newfoundland and Labrador.
The dataset available here is till April 19, 2021. You can get the latest dataset from smartatlantic.
You can find the instructions to clean the dataset, merging of files and training the ML models from the below link:
We have implemented the code on IBM Watson Cloud and encourage you to use this to get the experience of Cloud. Below link will provide you the instructions for using the IBM Watson Cloud. The Lite version of this cloud is free and provide you 25GB storage which is enough for this project.
We have created notebooks with the code for your reference in the below link.
6. Natural Language Processing (NLP)
NLP project is related to the Sentiment Analysis on Climate change. We have used the dataset available on data.world(Link provided below). We have applied BERT to do Sentiment analysis. BERT has become a new standard for Natural Language Processing (NLP). It achieved a whole new state-of-the-art on eleven NLP task, including text classification, sentiment analysis, sequence labeling, question answering, and many more
7. Time Series
A Time Series is simply a series of data points ordered in time. In a Time Series, time is often the independent variable and the goal is usually to make a forecast for the future. Plot the points on a graph, and one of your axes would always be time. You can see the analysis ,plotting and building machine learning model for time series data in this project. This project is done in IBM Watson cloud and instructions are given in attached links. The Lite version of this cloud is free and provide you 25GB storage which is enough for this project.