Identifying Wildlife in Camera Trap Images

As someone with a lifelong passion for ecology and wildlife conservation working with data focused on wildlife populations for my final project at Metis was an obvious choice. Early on in exploring what data science is and what it can do I became interested in computer vision, much of animal science in my experience has been about visual observations from identifying an animal and its species to assessing body condition and behavioral cues, which is why I wanted to use this project at a starting point to explore data science applications in the realm of wildlife conservation. I noticed that much of the labeling of image data for the creation of usable datasets is done manually, using deep learning image classification I want to at least semi-automate this process. The more quickly and efficently these types of images can be curated into usable datasets the more quickly they can be used in further modeling towards actionable insights. I also thought this would be a great project to continue to build on, if I can build a model that succesfully identify species purhaps in the future I could add to this model or a new one that does a more complex identification such as behavior based off body positioning.

More …

Exploration of Topics and Characterization in Starfinder Adventure Paths

As a longtime fan of tabletop roleplaying games when the time to choose a data source for my Natural Language Processing project choosing campaign/adventure path books was an easy decision. I chose Starfinder due to my personal interest in the intersection of science fiction and fantasy elements. Throughout this project, I wanted to explore the overlap of topics and characterization (particularly of antagonists) among different adventure paths to look for emerging archetypes or themes. I used 7 adventure paths for a total of 33 individual books. The Starfinder adventure paths can be viewed here. A summary of my findings can be found in this presentation. This tableau dashboard is a deeper dive into my topic modeling showing the prevalence of different topics within the different adventure paths.

More …

Predicting Response to a Bank Telemarketing Campaign

My third project at Metis used supervised machine learning algorithms to predict if a potential bank client would sign up for a term deposit account using this UCI dataset. By predicting which clients are more likely to sign-up and focusing efforts on those clients there is the potential to increase campaign success rates.

More …

Exploring Transformers and OpenAI's DALL·E

As part of Metis’s data science bootcamp we were tasked with investigating a topic related to data science not covered in the bootcamp; I chose learn about OpenAI’s DALL·E. DALL·E is a version of OpenAI’s GPT-3 making it a class of deep learning model called a transformer. While reading about DALL·E I became curious, what is a transformer and what sets it apart from previous models used in natural language processing?

Some backgroud on models before transformers

Let’s first discuss the precursors to transformers, RNNs and CNNs. Recurrent Neural Networks (RNNs) process each word within a text separately and pass the information from the previous word to the next network to process that network’s information. You could visualize this as building a chain, issues arise when that chain of words becomes too long. When this ‘distance’ between the needed relevant information and where it is needed becomes very large RNNs begin to struggle. The creation of Long-Short Term Memory (LSTM) models, a type of RNN, was an attempt to resolve this problem however these models still do not perform as well with longer texts. Since both RNNs and LSTMs sequential processing also prevents parallelization as they process word by word.

To resolve some of these problems researchers developed a technique called Attention. Instead of encoding the whole sentence in a hidden state, each word has its own hidden state that is passed to the decoding process. These hidden states in the decoding stage are used to figure out where the neural network should pay attention. While attention resolves some of the context based issues of RNNs, CNNs address parallelization.

Convolutional Neural Networks can work in parallel because each word of the input can be processed at the same time and do not depend on previous words to be translated. While the ‘distance’ from output and input in RNNs is in the order of N, for CNN’s it is in the order of Log(N). However CNNs still do not resolve the issue of dependencies when translating.

Finally Transfomers!

Transformers use CNNs together with attention models, specifically using self-attention. Each encoder in a transformer is made of two layers: a Self-Attention layer and a Feed Forward Neural Network (FFNN). Each decoder has the same layers but with an Encoder-Decoder Attention layer between the other layers. The Self-Attention layer looks at the token or input word and the entire sequence and attaches understanding from previous words that are relevant to the word and then outputs to the FFNN which is where the model creates its transformed output (translation, next predicted word, etc.). Transformers also use Multihead attention. When translating a word you may pay different attention to each word based on the question you are asking. In a sentence like ‘The dog ran outside because it was scared’, multihead attention helps the model access which word ‘it’ refers to. Positional encoding in transformers accounts for the order of the words in the input sequence. When compared to RNNs, transformers allow for much more parallelization thereby reducing training times which has allowed for the training of much larger datasets than were previously possible.

What’s to interesting about DALL·E?

What’s so interesting and exciting about DALL·E is that instead of creating text from text or a new image from an existing image, it creates images from text. While the concept of generating an image from text may not be a new one the sheer scope of DALL·E’s ability is nothing short of impressive. OpenAI’s blog post on DALL·E describes how the model works as, “a simple decoder-only transformer that receives both the text and the image as a single stream of 1280 tokens—256 for the text and 1024 for the image—and models all of them autoregressively. The attention mask at each of its 64 self-attention layers allows each image token to attend to all text tokens. DALL·E uses the standard causal mask for the text tokens, and sparse attention for the image tokens with either a row, column, or convolutional attention pattern, depending on the layer.”

But what are DALL·E’s capabilities? While the OpenAI blog post has a comprehensive list of DALL·E’s capabilities there are a few that I find particularly interesting. DALL·E has some ability to infer contextual details, this is the model’s ability to render details that were not explicitly mentioned in the text prompt such as shadows.

A paiting of a capybara sitting in a field at sunrise

DALL·E’s ability to combine unrelated concepts into new images ranges from the surrealist combinations of animal and instrument to more realistic images of products that take inspiration from an unrelated concept.

A snail made of harp, a snail with the texture of a harp An armchair in the shape of an avocado, an armchair imitating an avocado

DALL·E also exhibits some capability with the zero-shot reasoning present in OpenAI’s GPT-3 model it is based on. Zero-shot reasoning is a model’s ability to perform a task solely from description and a cue to generate the answer supplies in its prompt without additional training. the exact same cat on the top as a sketch on the bottom

While DALL·E is an exciting advancement in transformers; it is not without flaws or raised concerns. As seen in the text prompt “a collection of glasses is sitting on a table” section of the DALL·E blog post, DALL·E produces images both of drinking glasses and reading glasses showing difficulty with nouns that have multiple meanings. a collection of glasses is sitting on a table

It also shows difficulty with prompts that have a lot of variable binding where variables may become mixed up or completely excluded from the resulting image. An emoji of a baby penguin wearing a blue hat, red gloves, green shirt, and yellow pants

With how cutting-edge DALL·E is at the point of writing this post (February 2021, DALL·E blog post released Jan 2021), it isn’t surprising that even with the limited selection of material presented there are still obvious areas of needed improvement with the model. No model is perfect and one as advanced as DALL·E is no exception. What I cannot help but wonder about are the potential implications of models like DALL·E. I find it easy to image DALL·E being used as a method of essentially creating e-commerce deepfakes. Some of the images could easily be mistaken for real-world images, for example some of the avocado chairs look like images of a chair you might see on a furniture website. It is not out of the realm of possibility that someone could scam money from shoppers by listing items for sale generated by DALL·E that look realistic but do not in fact exist. DALL·E is not available for public use at the time of this post, I’m curious to see the additional research OpenAI will conduct and implement to mitigate bias and help prevent unethical usage of DALL·E.

Sources

DALL·E: Creating Images from Text
How Transformers work
The Illustrated Transformer
What is a Transfomer?
Transfomer (machine learning model)

Predicting Horror Movie IMDB Scores for a Female Audience

In my Metis linear regression project I created models for predicting the IMDB score for horror movies with a female audience. As a life-long fan of horror films I knew I wanted horror to be the genre of focus for my predictive modeling, but it wasn’t until I began looking closely at IMDB’s demographic breakdown that I realized the majority of users voting identified as male. From the over 600 movies I scraped from IMDB, each with a minmum of 10,000 votes, only about 18% of the total votes were from users that identified as women. With this target demographic in mind, I focused on creating models for 3 potential use cases for the models; predicting the score of a film already released, a film that has not yet been released, and a film that is still in the pitching or pre-production stages. My final models used GridSearchCV with Ridge Regression, however I also used GridSearchCV with Lasso Regression to view my most important features for each model. A summary of my modeling results can be found here.

More …