Before You Jump into the World of Machine Learning
Our world is home to a very large number of people and personalities. It’s not always easy to get along with a person the first time you meet them. It could be something as simple as a basic attraction, or you might just think this person is boring. However, the more time you spend with them, the more surprised you are by their characteristics. I liken it to this phrase:
You can’t judge a person without knowing the whole story. You may think you understand but you don’t.
You have no way of knowing if one day that person might save your life.
Let me tell you my story. I feel I am working with the most difficult people I could ever possibly get along with. They don’t talk the way we talk. If they do talk, I might not understand the way they speak. I can’t physically see all of them because they’re too huge for my brain and eyes to comprehend, so I can see only part of them. But alas, the more time I spend with them, the more addicted I become to being around them. What kind of job do I have? You might be thinking I write riddles for a living. Sometimes I think that’s the case, but not quite.
I am a data scientist at StackPros.
I am here to understand, communicate with and have fun with data, to find problems in and ask questions of data, and to finally get the right answers from data. I see the data almost as evolving into a good friend I can rely on. It can get tiring and even boring to hear everyday buzz words about Big Data and Machine Learning. Why don’t we try a different approach? I would like to share a different way of communicating with data so that you too can hopefully see that data can be a good friend. Below I’ve posted a few tips that will hopefully help you to enjoy your whole journey in the machine learning world, and hey, you never know – data might save your life someday.
Tip 1. Create question lists that you are interested in.
It is definitely worth taking several days, or even weeks if you have the time to think carefully about what exactly you want from the data. When I feel I’ve lost track of my end goal, this is the place I always return to. Jot down a list of all the things you are interested in, then order them by which you need to ask first, working your way down. By doing so, you might give yourself at least a general idea or picture of the flow you are going to explore. Does it sound obvious? Or conversely too vague? That’s OK, let’s go with some examples. The very first thing to ask yourself is ‘what is the final answer I want to get?’
Let’s take the online ads marketing industry as an example.
- Are you interested in predicting the number of ad clicks on a given day in the future?
- Or, in predicting what types of users click on what types of ads?
Once you decide on the final goal, list out some sub-questions.
- What does the data look like?
- Is there missing information?
- Do you have too many or too few columns?
- Are all the columns meaningful?
- Can you generate other features that can help the model predict well?
- Are there any relationships between predictors, and between predictor and outcome?
Once you’ve done this – when you decide to jump into the machine learning algorithm itself – you should have a bunch of question lists that will continue to help you understand the data. Remember – the end prediction of a machine learning model is merely a long list of answers to a long list of questions.
Tip 2. Get the right data or sample.
If you have a small data sample size, consider yourself lucky! Of course, there are many things you need to take into account. There is a very high chance of having extremely large data sizes in the marketing industry. Think about it this way. According to an IBM Marketing Cloud study, 90% of the data on the internet has been created since 2016, and the number of internet users is approximately 3.8 billion in 2017. I cannot even imagine how much data is created on the internet each day. Here’s a real eye-opener for you: check out this Google Search Statistics page to see today’s Google search numbers. The speed at which those numbers increased hurt my eyes.
It’s safe to assume that you have big data. The good news in this is that there’s a higher chance of having useful information. The bad news is that you might be easily distracted by noises in the data. Once you have a basic indication of how big the data is and what kind of elements the data contains, it is a good idea to sample data that represents the original data rather than using the actual data itself. Picture it as like making a good figure of original data. This practice can get very heavy as there are many strategies and procedures we can use when it comes to sampling methods. As this is a really important and crucial step, I will post a fully focused separate article on this subject. Keep your eyes out for it!
Tip 3. Communicate with the data using its language.
Congratulations! Now the FUN begins. Go ahead and look at the data and ask yourself: what does it look like? Can you understand it right away? If you can’t, then we need to help it tell you its story, reshape it, visualize it or manipulate it.
You might be wondering how we can do this? What language do I use? There are a few options: R, Python, and Julia are open-source languages you can begin with. There’s also SAS, a third-party software, but you can try the SAS University Edition for free. When selecting a language, consider that each has their own pros and cons. All that really matters is what you want to do with the data, not which language you are going to use.
Once you’ve chosen the language, you’re going to need a good amount of coding skills. If you don’t have this, that’s OK! If you really want to talk to the data, you need to learn. There are plenty of helpful websites out there like Udemy, Lynda, StackExchange, Analytics Vidhya and more. I can guarantee that once you ask the right questions and see the responses from the data, you will want to ask the next question right away. And the next, and the next, all the way until you almost understand the data.
Here is my all-time favourite quote from Andrew Ng, a professor at Stanford University Department of Computer Science and Department of Electrical Engineering.
“So, ask yourself: If what you’re working on succeeds beyond your wildest dreams, would you have significantly helped other people? If not, then keep searching for something else to work on. Otherwise you’re not living up to your full potential.”
Research from IDC shows that 90% of the data produced is unstructured, meaning that the majority of data does not follow a predefined data model. Many people, myself included, believe we can help others with all this data. I have always thought of myself as a communicator. I always think of how I’d explain this to my daughter when she’s old enough, or to a classroom full of teenagers: let me be a friend to your data. I’ll talk to it and I’ll let you know what kinds of stories they have hidden away, and what kinds of amazing answers they can provide.