Data for AI: How to make it work

Updated: Mar 15


The most important step in creating an AI is getting the data to train it. There are many public datasets on websites like Kaggle, which can be great to use. However, sometimes you cannot find the right dataset on these websites. For example, if you want to get data from a certain place or for a very specific problem, these websites may not have the dataset you are looking for. If so, you have to get the data differently.


This is what happened to us. We were looking at the problem of middle school stress. Because most online datasets are about adult stress, these were not very helpful for our problem. Therefore, we collected our data by sending out a survey as well as creating some synthetic data. Collecting our own data was a very long and tedious process, and this is a shortened down version of it. We hope it helps other kids who are looking for data for AI projects.


The Survey


The first method we used to gather data was through a survey of questions. When using a survey to collect data, the questions being asked are extremely important. Asking the right questions and making sure that the answers will help train your AI is crucial. For example, when we were looking for questions that would help us predict stress, questions like “What’s your favorite food?” prove to be irrelevant and unnecessary. However, questions like “How much sleep did you get?” can be relevant, as sleep is a factor that can impact stress. To help create questions, we met with a child psychologist who helped us put the right questions in our survey.


Privacy and Responsible Data Handling


Next, you must make sure that the user’s privacy is maintained, which also tends to get serious answers. If not, they are more likely to answer a false statement. Therefore, the data will be biased, making the AI less accurate. For example, we made sure that in the survey, there was no real “way” to identify the person from the answers. We assured our participants that the answers given will not be shared with anyone. If we had not maintained privacy, most of our data could have been biased by people providing less than truthful answers. We also put a note at the top of the survey making sure everyone knew that they cannot be identified by their answers.


Also, since ours was a survey about stress, we worked with our school teachers to make sure that we put a good message at the end of the survey about how to get help if you have stress.


Distribution


Next, we had brainstormed ways to send out our survey, to get the data we needed. This would mean that there would be more data for the AI to train with. Just like human beings, with more information, the AI tends to learn better. We had two main ways of spreading our survey. First, we designed a poster and added a QR code that people could scan to access our survey. Here is the poster we created:


This method proved to be the least helpful to us, but it still helped to gather data. We also asked our school principal for permission to share the survey with our peers. This was because we figured that more students would answer a survey given from the school than a few sixth graders. This step was one of the most challenging steps in the process, and it, therefore, took a lot of time to complete. Once it was done though — we got over 100 responses from 6th and 7th graders! Even though it was very challenging to send out the survey, we managed to get it through, and it turned out to be very successful.