Last post I wrote about the importance of having a business case and investing in core data science skills instead of technologies. This post I will continue the same idea of a good business case and then building a high performance team to achieve this and provide value.
**WARNING** It’s a long read, also very opinionated **WARNING**
Here are the other posts in case you have missed it:
- Part I: Where should I start?
- Part II: Spare Time and Getting the Job.
- Part III: Building the Right Team and Breaking the Business Model.
- Part IV: Programming and Statistics
- Part V: Data Munging and Feature Engineering. (Coming soon)
- Part VI: Yippee! Machine Learning. (Coming soon)
- Part VIII: How to Interview a Data Scientist. (Coming soon)
Months has gone by since you joined and the CEO asked you to extract insights from company’s data. While you were gathering your thoughts in your familiar nine-to-five cubicle, suddenly a cold chill ran down your spine as you noticed a reflection on the screen and you heard a soft voice: “What can you show me?”.
You scurried around to close the browser containing your favorite morning news sites and struggled to find the file to present your analysis. “Yes, got it!” as you double-click the file. You smiled awkwardly at you boss as the progress bar took another step.
The CEO stood silently.
Gone through the sales pitch, displayed trends, graphs after graphs and explained each one of them. You paused, looking for some feedback. The CEO said: “It is interesting that you can identify all the right-handed customer based on data analysis and machine learning, but how is that information going to help our banking business? Are you also going to spend another few months to find the left-handed customer too?”
Insights Does Not Mean Value
The story might be silly and simple, but I hope it delivered the idea. Just because you can find insights from data, doesn’t mean it’s useful. That is why I need to emphasize on the importance of a business case. Having a business case will guide you to identify and generate the correct insights, specifically the end goal in mind as well as all the processes in between.
For example: assuming you have a dataset of customer transactions, without a business case you would look at the data, generate some features, play with charts, finding correlations between features without knowing what you wanted. On the other hand, when you do have a business case such as “I want to prevent credit card fraud” or “I want to predict the next customer purchase”, you start thinking differently. Like “Oh, I wonder how many times the customer went to that specific shops?”, “What is the characteristics of a credit card fraud?”, “What are the previous sequences of items that the customer bought?”. By answering these questions, it would generate specific features that might help solve the problem. That is why the definition of a business case is super important.
This may sound rather silly and redundant and you may think: “Of course you need a business case! Everybody knows that! Duh!?”. You would be surprised how many people want big data / data science without a business case and thinks that machine learning is the silver bullet to improve their business (yep, the hype is real). Sometimes you will need to communicate with stakeholders multiple times to clarify the business case, because honestly, sometimes they don’t even know what they want.
Value Does Not Mean Insights
Here you will need to distinguish between big data and data science. So far I have been talking about them synonymously (my bad). Simply put, big data can be regarded as the technology stack and its implementation; data science is the analysis of data and extracting of insights using statistical methods and machine learning. When used well together, it can make black magic happen (because you know, it’s a black box… haha…).
Just by using Big Data you could already cut down licensing costs, streamline processes, deduplicate data, streaming, distributed processing, monitoring and many more. These use cases are examples that doesn’t require data science, and could potentially transform your business already. Who needs data science anyway!
So What Exactly is Value?
Think of it like this, you go to McDonald and you ordered and paid for 10 pieces of Chicken McNuggets. You opened the box and discovered 11 pieces of Chicken McNuggets inside, you get more than you put in and get that warm happy feeling on the inside. More concretely, value doesn’t always mean money, it’s could be time saved, convenience, quality of life changes etc. and can be seen as such:
A product or service that the users will love and is feasible for the engineers to produce.
Remember, business case comes first then determine the Minimum Viable Products (MVP) to be built. This concept is very important, it allows you to determine the resources you need, get the product to the customer fast, quick feedback, and fail fast. My boss once said: “Perfection in software engineering is just too expensive”, hence define the end goal of your MVP first. My recommendation is to go for the many small wins instead of the one big win, eventually you will build up enough foundation / pipeline with the small wins to achieve the big win continuously.
Even though data science could produce insights and models that could potentially change the business. The problem is that data science project generally takes a very long time to complete, not to mention most of the time it would only be sitting on the laptop of your resident data scientist. If you could not productionise your data product, it would not be helpful to the business and instead only a cost to the business. Sometimes it is also not feasible for the engineers to productionise the data product, for instance, even though Apache Spark is growing rapidly, it is still missing quite a bit of machine learning functionalities and cannot produce complex models, not to mention that crunching big data with parallel processing (another topic for discussion later) is rather complicated and tricky compared to the sample dataset that data scientists generally uses.
I am over simplifying this because in reality value generation is actually quite difficult to accomplish. You will need to a strong management with very clear defined strategy with an end goal and a strong team to deliver the result. There are forces beyond technology that will prevent value generation. Also the stages of building a data product is quite complex and deserves a topic on it own, which I will not discuss here.
Breaking the Business Model
If we are talking about value generation, we also need to talk about obstacles that prevent value generation. With every new technology comes a healthy or not so healthy dose of scepticism and cynicism. The hardest aspect of implementing big data is the mind-set, the bigger the organisation the harder it is to implement. Here are a few examples of obstacles, they might be obvious but still worth mentioning:
1. Misunderstanding of the Technology (can or cannot do)
How many times have you read that machine learning is going to cure cancer and create world peace, or that the robots are coming to take your job and kill all humans, or how big data is going to solve all your business problems? These same articles doesn’t tell you about the limitations of machine learning and big data, do they? So, it is important that you understand the business strategy how to choose and implement the correct big data technology to drive that strategy. Nothing breaks my heart to see a technology being used improperly and then say that big data does not work, like people trying to use HDFS / HBase as a traditional relational database and then say that it doesn’t work for them, blaming big data doesn’t provide value and a waste of money.
2. Bad Communication between Team Members
If you ever have spare time go watch your data scientist and engineers argue, it’s quite entertaining (they are such different people). Both has their point of view, but sooner or later they will be at an impasse. This is because the data pipeline is long and complicated and can be regarded as a big circle. The end product need to provide a feedback loop back to the source and the source needs to provide data to the end product. At any given point in the pipeline when one party of the project is not communicating with the other parties, the project will come to a halt. Some examples: when data scientist cannot get their data to work their black magic from the engineers; or when the engineer cannot get any explanation to what the model does from the data scientist; or when one data scientist makes one business assumption and another data scientist makes a different business assumption regarding the data without consulting the domain expert, which can spells big trouble when the system goes live i.e. think about interest rate adjustment for a bank, or credit check etc.
With all the news around how big data and artificial intelligence will replace jobs, why would any human want to lose their job? Fear brings about uncertainty and resistance to implementing new strategies and technologies. Imagine if you were a data analyst and your job is to deliver reports and present findings. Now that big data can do it in real-time, why do they even need you. It’s simply cheaper and faster when a machine does it.
This is one of my biggest headache in a large corporation, and a game that I refuse to play. Needless to say, everyone has their own agenda working in the company and if your department is the spanky new big data hot-shot in the company, you will also be the spanky new hot target to shoot down, where everyone is waiting for you to fail. Imagine when your company fails to improve their sales, the big data team get blamed for not having a recommender system that personalise customer’s needs. There are many more political scenarios that goes on in a corporate, that I am not aware of. At the end, politics are not a good thing, a good healthy competition between the teams are okay, politics are a no-no because it drives down morale and also creates hostility among colleagues which mean no team work.
If your organisation has all the above mentioned issues, then you don’t need big data because you already have big problems which is by far more costlier than big data. Now what do I mean by breaking the business model, big data implementation requires new ways of thinking:
- Instead of batching, think streaming.
- Instead of overnight processesing, think real-time processing.
- Instead of rule based decisions, think predictive / generative models.
- Instead of customisation, think personalisation
- Instead of firing and retrenching people, think about re-assigning and freeing up resources to do more valuable projects that machines cannot do.
- Instead of thinking “use all the data we’ve got”, think business cases and hand-pick the relevant data to use.
These new ways of thinking will surely change the business model over time and improve the processes and hopefully leads to profitability. One other main concern that I need to talk about is around ethics and privacy which I cannot over emphasise enough. Just because you CAN collect data from your customer, doesn’t mean you SHOULD. Just imagine that data getting out (internally or externally), what is the chance of company getting sued or face reputation loss? Sometimes you even need to exclude a deterministic feature out of your model because it is private and sensitive. For example: a simple predictive model to determine whether to grant or reject a home loan, but the key deterministic feature is based on their gender or race? How would you be able to explain to the customers that their application got rejected? Will you be able to defend yourself in the court of law? Also, if you ever have doubts on whether you should or shouldn’t be collecting specific data or use them, the chances are that it is not ethical and you shouldn’t collect and use them.
Building a High Performance Team
Now that you have everything in place: a business case, strong management support and strategy. You are ready to build a team to deliver results. Different stages of the data pipeline requires different team compositions, generally planning a big data team is like planning a heist or baking a cake. General rule of thumb, keep it small at about 5 ~ 10 people:
- 1 Data Architect
- 1 ~ 2 Data Admins
- 1 ~ 2 Data Scientists
- 2 ~ 4 Data Engineers
- 1 Domain Expert / Product Manager
- 2 Eggs
- 1 Cup of Flour
- Bake for 15 min at 180 degree °C
- A sense of humour
Yes, yes, skill shortages blah blah blah. Allow me to be brutally honest, I think skills shortage is a myth partly because some of the hiring criteria / requirements are rather restrictive, much like putting people into a shoe box for people with small feet. Big Data is so new and tools out there are still developing, and if you want to create value you will need to be creative to get around all the limitations of the technology or build your own. So why put people in a shoe box where you need a very specific knowledge on specific technology or machine learning, it’s just not practical because the landscape is changing so fast. Especially when you don’t have a business case but you are looking for people with HDFS / Kafka / Spark / HBase etc. these technology might not even be suitable for your business case even though they are popular. What you need is someone who is passionate, flexible, creative and a willingness to adapt and learn.
It’s true that you need specific skillset for specific roles but what really works well is a team with overlapping skills. For instance, engineer with some data science knowledge, architect with machine learning, admin with statistics, product manager with a bit of everything etc. Passionate people are driven, you don’t need to force them to learn or research the big data technology, they will come to you with many solutions. The key factor is a focus on the business case and constant communication, this drives ideas and implementation, especially when the overlapping of skillset kicks in, people with different skill backgrounds think in different fantastic ways.
Sum It Up!
In case you haven’t got it, you will need a business case, define the minimum viable product with clearly defined end goals. With a strong management support and a passionate team you can create value through constant communication.
Next, I will go through the importance of programming and statistics in the data pipeline, and continuing the theme of value generation on how these two aspects plays a role in the stages of data development as well as the ratio between the two. Hope you enjoyed this, any questions and comments are welcome!
Until next time, have a nice day!