Data for AI: How to make it work

The Dasa Team
Aug 5, 2020
5 min read

Updated: Aug 30, 2021

The most important step in creating an AI is getting the data to train it. There are many public datasets on websites like Kaggle, which can be great to use. However, sometimes you cannot find the right dataset on these websites. For example, if you want to get data from a certain place or for a very specific problem, these websites may not have the dataset you are looking for. If so, you have to get the data differently.

This is what happened to us. We were looking at the problem of middle school stress. Because most online datasets are about adult stress, these were not very helpful for our problem. Therefore, we collected our data by sending out a survey as well as creating some synthetic data. Collecting our own data was a very long and tedious process, and this is a shortened down version of it. We hope it helps other kids who are looking for data for AI projects.

The Survey

The first method we used to gather data was through a survey of questions. When using a survey to collect data, the questions being asked are extremely important. Asking the right questions and making sure that the answers will help train your AI is crucial. For example, when we were looking for questions that would help us predict stress, questions like “What’s your favorite food?” prove to be irrelevant and unnecessary. However, questions like “How much sleep did you get?” can be relevant, as sleep is a factor that can impact stress. To help create questions, we met with a child psychologist who helped us put the right questions in our survey.

Privacy and Responsible Data Handling

Next, you must make sure that the user’s privacy is maintained, which also tends to get serious answers. If not, they are more likely to answer a false statement. Therefore, the data will be biased, making the AI less accurate. For example, we made sure that in the survey, there was no real “way” to identify the person from the answers. We assured our participants that the answers given will not be shared with anyone. If we had not maintained privacy, most of our data could have been biased by people providing less than truthful answers. We also put a note at the top of the survey making sure everyone knew that they cannot be identified by their answers.

Also, since ours was a survey about stress, we worked with our school teachers to make sure that we put a good message at the end of the survey about how to get help if you have stress.

Distribution

Next, we had brainstormed ways to send out our survey, to get the data we needed. This would mean that there would be more data for the AI to train with. Just like human beings, with more information, the AI tends to learn better. We had two main ways of spreading our survey. First, we designed a poster and added a QR code that people could scan to access our survey. Here is the poster we created:

This method proved to be the least helpful to us, but it still helped to gather data. We also asked our school principal for permission to share the survey with our peers. This was because we figured that more students would answer a survey given from the school than a few sixth graders. This step was one of the most challenging steps in the process, and it, therefore, took a lot of time to complete. Once it was done though — we got over 100 responses from 6th and 7th graders! Even though it was very challenging to send out the survey, we managed to get it through, and it turned out to be very successful.

Analysis

We got over 100 responses. We analyzed the data and created these pie charts below.

These pie charts helped us realize what was really causing stress, and reducing it. This was key to our app, because we would know what is causing the problem, and how to use different solutions to help the problem. For example, we found out that kids who did sports reported slightly less stress than those who didn’t, and kid’s who did Performing Arts had more stress than the ones that didn’t. Also, the amount of sleep played a big role in stress. We learned that the people who were getting less than 7 hours of sleep daily, had more stress than kids who got more than 7 hours. We used these pie charts to learn more, and make our app better.

Adding Synthetic Data

Larger datasets tend to generate more Accurate AIs. The dataset is the source of your training, so if it has more detail, not only will the accuracy be better, but there are more phrases that the AI can detect with bigger datasets. From the survey that we sent out, we got a good amount of data but not enough, so we had to create our own synthetic data to use. We created comma-separated files on Google Sheets for the questions on the survey. For the open-ended questions, like “How was your day?” or “How are your friends?”, we created a column for the text entered and then created an adjacent column for the “stress score”. If the answer indicated stress, we would assign it a stress score of 1, or yes. If not, the stress score assigned was 0, or no. Whereas, questions like “How much sleep did you have?” were multiple-choice, which made the dataset a lot easier to make because there were only so many options to have. For the rest of the open-ended questions we had to manually fill up each of the columns, and each one of us would write around 30+ of these combinations. This was for the types of questions like “How are your friends?”. We tried our best to correctly match the answers to what we believed our friends to respond like, so we had some fun creating this. We had roughly the same number of stress examples and non-stress examples, so our data was well-balanced

(These are examples of the synthetic data we created, as you can see, we expected people to answer “random”, so we had to make sure that the AI would not get confused with this.)

Training the AI — How well did it do?

Once we had analyzed the data clearly, we trained our AI’s. We used two different types of AI — we used one for text and the other for numbers and categorical data As more and more responses were used to train the AI, the accuracy and predictions got better. To make sure that our AI could predict Stress, Non-stress, and Random correctly, we checked the Confusion Matrix. The Confusion Matrix told us what the AI was getting wrong, or what it had not learned properly. Right now, the accuracy is around 80%, a pretty good level for the data we collected, but we are trying to make it even better.

In closing..

You can see more about the app we created with our AI — Calmzilla here.

We hope this helps you as you create datasets to train AIs to solve problems. If you have any questions, please reach out at aithatcounts@gmail.com