Creating or Finding Value as a Data Scientist

Creating or Finding Value as a Data Scientist

Join our youtube channel to see more detailed content. 

By Jose Quesada

Hi, this is Jose Quesada for Data Science Retreat.

Today let’s talk about a word with no clear definition but has much meaning: Value. The word in itself has a more of a ‘fluffy’ or a vaguer meaning attached to it, so let’s give it a more concrete definition.

Everyone has ideas, and they want to make products out of these ideas. But what makes these ideas unique would be the value that it creates. It is this value that differentiates every entrepreneur from the rest.

So, what is Value? Personally, it means something which solves a real problem. Ideally, there are several audiences who have problems that they do not know how to solve. The concept of Machine Learning is something that gives a Data Scientist a tremendous amount of leverage over others to solve problems. 

I have made a video on this topic for those of you who prefer watching a video than reading, see below.

Play Video

So, what can you do with this unique leverage that creates value? Let’s navigate this issue through some examples.

  1. Suppose you have a child who is Dyslexic. Dyslexia is a disease that affects about 25% of the population, according to some studies. Dyslexic children have a hard time trying to read. For a parent, this is worrisome because, of course the child suffers, but also, the parent will have to spend more trying to solve the child’s problem. Since this is a solution that affects a large population, and you find a solution to this using Machine Learning, you create value.
  2. Another example is if ‘You’ have a rare condition. This is not an issue that most of the population faces, but creating a solution out of it will generate much value to you and others who have the same condition.
  3. There are small problems that companies face, like wanting to create a catalog of things they purchase to keep track and reduce the company’s cost. If you use Machine Learning to find a solution to this issue, a lot of companies can cut their costs with the value that you have generated.

Now, if you are a Data Scientist who is comfortable in his/ her company and thinks that you don’t want to look out for problems and solve them, though you may be able to survive in the long run, it is not something I personally recommend. Why?  Let me elaborate with an example:

Suppose there are 2 Data Scientists- Andreas and Vlad. Andreas is very good with people and wants to find solutions to the problems that various departments in his company face. He is well-liked in the company since he actively looks out for problems and solves them using Machine Learning, which other departments did not even know could be solved. Vlad, on the other hand, is a very productive employee too. If someone gives him a problem, he will find a solution to it. Everyone knows Vlad will be able to get to the bottom of any situation.

But who do you think makes more money? Andreas. Why? He finds problems and produces value for the company. He is an irreplaceable asset to the company; hence they will pay him more to keep him. He also has the ideal Product/ Market fit, which is very important in every company.

https://images.app.goo.gl/ebC6vsqV7HMCEiz39
How can you solve problems? By merely talking to people- be it on LinkedIn or real life.
In the book ‘The Lessons School Forgot’ by Steve Sammartino, he introduces a concept called ‘10 meetings’.
 

Suppose you are entirely new to the market, and you want to know more about it. You reach out to someone from the industry and ask to meet them over a cup of coffee to get to learn more. At the end of the conversation, you would have picked up certain keywords from the discussion that you can go home and lookup. Now, when you meet another person, your questions will be much more specific. And by the time you reach the 10th meeting, you possess good knowledge about not just the industry but also the problems in the industry, which you may be able to solve with Machine Learning. Then you can approach someone with a valuable solution to their problem, and you’ve secured the job as well!

The next idea is from the book ‘The Mom Test’ by Rob Fitzpatrick. Suppose you have a solution to an idea, and you go to your mom and present it. When you do, most probably, your mom will tell you that it is a great idea, even if it may not. Why? Because your mom loves you and wants to keep you in happy spirits. But what you learn from this is that people will lie to you.

So how do you get a reasonable opinion about what you think is a valuable idea? There are three things which you can do here.

1. Talk about their life instead of your idea:

Suppose you have an idea to make an app that gives you recipes based on the things you have in your pantry. You think this would be a successful app, so you test it with your mom, who has been cooking for a long time. When you tell her this, she seems enthusiastic at first and gives it a try. But eventually, you realize that she isn’t using the app since it isn’t really a necessity to her. Her way of cooking is a process whereby she plans much ahead, makes a list, buys things, and makes it. Such an app may not be a necessity to her or for many others.

2. Ask about specifics in their past instead of generics or opinions about the future:

If you ask your mom a generic question like ‘Do people like recipes?’, she will definitely say Yes! If you ask her a question like ‘How often do you check your recipe book’? She may say that the last she checked was five years ago as she has been cooking for 20 years.

Since what you want to know is that if people would be interested in an app that automatically generates recipes, you can derive ideas from your own circle by asking specific questions to them, which gives you a glimpse of their life and their needs.

3. Talk less, Listen more.

You can go from guessing the problem to be in a spot where you know the problem really well. You could be the person who knows about the problem the most in the world, and you have the solution. That is a wonderful place to be. So, you want to go in that direction.

This is very important when you want to know whether your idea is good enough or not. You may be very consumed in your idea that you may have forgotten other issues that come with it, it’s a necessity, or it’s practicality. When you talk to someone, just listen and you would get to know more about the actual issue and how often the product will be used- giving you the idea of what you actually need to do rather than you had planned to do.

 

To conclude, what makes every idea a class apart is the value that it creates.

To create more value, it is necessary to actively lookout for problems that require solutions and work towards creating the best solution from the rest.

 

transcription credit : Neeraja Nair

What do you want to achieve with your Machine Learning project

What do you want to achieve with your Machine Learning project?

Join our youtube channel to see more detailed content. 

By Jose Quesada

Hi, this is Jose Quesada for Data Science Retreat.

Today I’m going to talk to you about one thing that I’ve been thinking about for a while which is, what do you want to achieve with a project in AI or Machine Learning or Data Science. People all care about having projects having something to show. But there are different levels, this is something that I really want to go deeper.

Most people only want to learn a skill set with that project, which is fine.

But there are like different levels which we will discuss here and I would personally opt for the last option which is to create a product which you can charge for because if you fail you’re going to have kind of a fall back on one of the others.

I have made a video on this topic for those of you who prefer watching a video than reading, see below.

Play Video

Learning a skillset that’s a given. No matter what project you’re going to do, you are going to learn something, not a big deal. Now, one other goal would be to look competent and the real goal is to find a job. This is also very common “oh I’m doing this just because I want to put it on my LinkedIn and my github and I’m going to get the attention of recruiters or hiring managers. You want to get a job, that’s totally fine. So, there’s a difference between these two, right? It could be that, the thing that you do for running a skillset is not the same thing when you move to look higher now.

 

Now we’re getting into deeper things. So, you want to solve a real problem with your projects, that’s not very common. I would like it to be more common, but it’s not. I see people saying “oh I want to have a social impact, I want to help people, I want to solve your problems” but it’s very unlikely that you find people doing something about it with machine learning. And it’s totally doable, there’s plenty of things that you can do in machine learning today to solve a real problem that real people have. It’s kind of a shame that we are not doing more of it. So that would be one more goal that is kind of deeper than looking competent to get a job, right?

 

There is one more level which is to create a product. A product is not only a project it’s a product because you can charge for it. This is something that is packaged in a way that these people that have their problem and maybe can pay for it. They are willing to do something about it, that is a product.  Most people, I would say 99% of the people that I deal with today never get to the point where they take a project and build it to a point where they can charge for it. This is not a criticism, there’s a real reason for this. Traditional companies, they all want it, but they don’t get there mainly because of how much money and effort real big companies spend trying to figure out what will work and if they build it, how to market it and so on.

 

It’s totally fine, if you as an individual cannot get to this point and build something that you can charge for. But it’s a big part. All these companies spend a lot of money in doing market research, focus group, trying to get data from their users and using that data to sell them back something like netflix does. Netflix produces series that are influenced by the data they collect. Sometimes you see a series and you can see “oh this was targeted to teenage women who are overweight, and they would love this series”. How do they find out that there is a market that would want to consume that series because they have data and they know things that the person is watching? So, the big difference between you and these big corporations is that, they need to solve a problem that affects millions ideally billions of people.

 

That’s cool for you, example they will never start a new project if it doesn’t affect billions of people. So that limits their options quite a bit. They are kind of the king of the jungle for machine learning, there is no other company doing better with machine learning so far than google. They are probably good at products as well and you can question that because they give a lot of products right, but they have a problem which is they need to be products for billions of people. You don’t have that problem; you can go to a tiny little narrow group of people that you know very well and help them. You may know them so well that you may meet one of them who will identify with your idea and you might identify with their problems. It will be a beautiful match within the two of you.

 

So, here there are some ways to think if you are going to go for the product part of it where you want to create a product that you want to charge for. I’m not saying that you need to do that, but if you want to go that way then there are like different scales.

You pay People >>>>>>>>>>> They pay you

Sometimes you come up with an idea, and you want people to use it, but they actually don’t and give you reason such as I don’t have time now. Then you pay people to use it, or you promise them a favour. Imagine that your product is a questionnaire. You tell them if you fill this questionnaire then I’m going to give you a chance to win a Macbook. So to pay people to do something for you, that’s kind of one extreme.  The other extreme is where they pay you and they will be sad if your product doesn’t exist anymore. You want to be moving from you paying to they are paying you. I’m not saying that this is easy. This is something that companies big and small like startups as well struggles with right now.

You are undifferentiated >>>>>>>>> You are the first solution they think about

Sometimes you come up with an idea, and you want people to use it, but they actually don’t and give you reason such as I don’t have time now. Then you pay people to use it, or you promise them a favour. Imagine that your product is a questionnaire. You tell them if you fill this questionnaire then I’m going to give you a chance to win a Macbook. So to pay people to do something for you, that’s kind of one extreme.  The other extreme is where they pay you and they will be sad if your product doesn’t exist anymore. You want to be moving from you paying to they are paying you. I’m not saying that this is easy. This is something that companies big and small like startups as well struggles with right now.

You are guessing >>>>>>>>> you have validated the problem and solution

The last one is, you are consuming, you read a lot, you see blog posts about companies being built, about people making money with products maybe somebody just came up with something as simple as a newsletter.They’re making money with it, you read the newsletter you consume. You must aspire to go to the point where you produce, when you write. People should follow you because of what you write.

You consume, read >>>>>>>>> You produce, write

You can go from guessing the problem to be in a spot where you know the problem really well. You could be the person who knows about the problem the most in the world, and you have the solution. That is a wonderful place to be. So, you want to go in that direction.

What is that one hack to get all these scales much faster to move from one extreme to other?

It is

Let’s just look at the scales now, assume that you are in a niche. It’s easier to get people to pay you if you are targeting the right people and they have a problem that is very narrow to themselves. It’s a problem that’s worth paying for.

You’re probably the first solution to think about if you are the one person talking about this. Nobody else cares about it but you do, and you care for it very much. You know the problem very well, and you’re building a solution for that niche. You can be producing and writing about that niche and then it’s easy for people to follow you to find you. So basically, all these things become easier if you have a niche.

This is actually the big advantage that you have over google. You don’t need to create a product for a billionpeople, but a product that’s probably for few thousands. You’re going to be looking competent. So, like I mentioned earlier with this product, you’re going to learn a lot, solve your problem and you may be able to charge for it.

Therefore, the most important message I wanted to convey is that, it’s much easier for you in every possible way to aim to create a product that you could charge for and then degrade into solving a problem, finding job and lastly learning a skill. Companies are going to be impressed if you show up and show a product that you build with your hands that solves a problem in the market which you could charge for.

This is something that I would hire if I ever see in my doorstop, two data scientists with similar profiles but one of them have built a bankable project then he/she is way out of the league of anybody else and I would hire them.

To conclude “Aim to work on building a product rather than a mere project”

How to pick a successful AI project, part 3: Working with models

This post is part of a series on ‘how to pick a successful AI project’. Part 1, Part 2.

Here I’ll cover what I’ve learned working with models and how this matters to pick up an AI project that will succeed. I mentored over >165 of them and counting at Data Science Retreat.

Not much theory

If you ask a meetup presenter ‘how did you pick your architecture?’ the most likely answer is something similar to ‘I copied it from a blog post or paper, then tweaked it.’ There’s little theory that guides how to pick an architecture. The field seems to be at the stage of medieval apprenticeships, where apprentices copy the work of masters. Although the field produces 1000s of papers per year, there’s very little in terms of theory. Practitioners generate ‘rules of thumb’ to pick architectures and train models. One good example is Andrew Ng’s book ‘machine learning yearning.’

This book is dense with the closest thing that we have to ‘theory’ to pick up architectures and finetune models.

You need to have a gold standard

Your model must improve a KPI that matters. That means that there’s something observable, measurable that the model does better than the baseline (which often means no model at all).

Supervised learning is better than unsupervised in that you can justify what your model does.

If you use cluster analysis, your boss could always say ‘you show me 3 clusters, why not 5? I think there are 5.’ there’s no correct answer to this. A supervised model has clear performance measures, plus it can often be checked ‘by eye’ (hey, your dog classifier missed this dog that looks like a cranberry cupcake.)

Use a pretrained model

With transfer learning, instead of starting the learning process from scratch, you start from patterns learned when solving a different problem. This way, you leverage previous learning and avoid starting from scratch.

When you’re repurposing a pre-trained model for your own needs, you start by removing the original classifier, and then you add a new classifier that fits your purposes. You save time (in weeks when dealing with big deep networks with millions of parameters).

There are repositories of models in the public domain, for example:

https://modelzoo.co/

https://github.com/Cadene/pretrained-models.pytorch

Try also https://paperswithcode.com. As the perfect name indicates, it’s a searchable collection of papers that have a public implementation, an excellent place to start.

If you have done fast.ai or many of the other ML courses out there, you know more than enough to start reusing pretrained models. Even if you cannot find a pretrained model that matches your problem, using one that is barely related usually works better than starting from scratch. More so if your data is complex and your architecture will be more than a dozen layers. It takes a long time (and big hardware!) to train big architectures.

It’s good to stay somewhat on top of new models; tracking state of the art is not necessary, but for sure, it’s now easier than ever. Twitter will tell you if anything significant has just popped up. If someone made a great demo, Twitter would catch fire. Just follow a few people who post about these things often.

To navigate arXiV, try arxiv sanity (this is good to pick up trends, I don’t recommend you to make paper reading a priority if you want to be an ML practitioner. You will likely need to move so fast to deliver results that reading papers becomes a luxury you cannot afford.) About talk videos: https://nips.cc has now videos for most talks. ‘Processing’ NeurIPS is a giant job, so it’s easier to read summaries from people soon after they attended.

Most projects I supervised (at least in the last year or two) used transfer learning. Think about your former self, 10 years ago. Would you be surprised if you told your past self that in the future, anyone could download a state-of-the-art ML model that took weeks to train? And use it to build anything you want?

Published papers in all scientific disciplines use ML, but their models are not very strong; improving them is a quick win

Take, for example, this paper on how to predict battery life from discharge patterns — published in Nature, one of the best scientific journals. Their machine learning is rudimentary at best; this is actually to be expected, as the authors are in electric engineering, not machine learning. The team focused more on their domain knowledge in electrical engineering than on the machine learning part. A very astute team of participants at Data Science Retreat batch 18 (Hannes Knobloch, Adem Frenk, and Wendy Chang) saw an opportunity: what if we make better predictions with more sophisticated models? They not only managed to beat the performance of the model on the paper; they got an offer from Bosch to continue working on it for 6 months (paid! no equity stake). They refused the offer because they all had better plans after graduation.

There’s an entire world of opportunity doing what Hannes, Adem, and Wendy did; so many papers out there provide data and a (low) benchmark to beat. Forget about doing Kaggle competitions; there’s more opportunity in these high profile papers!

Avoid the gorilla problem

What follows only applies to models that produce results that a user sees. If your models’ end-user is another machine (for example you produce an API that other machines consume) you can skip this section.

Your ML model provides value to your users, but only as long as they trust the results. That trust is fragile, as you will see.

In 2015, Google photos used machine learning to tag the contents of the pictures and improve search. While the algorithm had accuracy levels that made Google execs to approve it for production, it had ‘catastrophic mislabeling.’ You can imagine the PR disaster this was, both for google and for machine learning as a field. Google issued a fix, but the first fix was not sufficient, so Google ultimately decided not to give any photos a “gorilla” tag.

What do we learn from this? If your problem depends on an algo that has any chance of misclassifying something that breaks trust: pick another problem.

In the 200 projects I supervised, when a team brought up an idea that had the ‘gorilla problem,’ I steered them away from it. You can spend months doing stellar ML work that is invalidated by the gorilla problem. Another example is tagging ‘fake news’: if your algo tags one of my opinion leaders (one I trust blindly) as ‘fake news,’ you have lost me forever.

Multiple models doing a single task (example: picking up cigarette butts)

Making the self-driving car work is a constellation of engineering problems. Many different ML models work in unison to take you to where you want to go (pardon the pun).

An example from our labs: Emily, a self-driving toy car that finds and picks up cigarette butts we mentioned before, is doing 3 subtasks:

– Identify cigarette butts

– Move the car close enough so that the cigarette butts are within reach

– Pick the cigarette butt (stabbing)

Each subtask is a model.

Note that cigarette butts are incredibly poisonous (one can contaminate 40 liters of water) and hard to pick with a broom because they are so light). As a result, they accumulate in public areas. A swarm of these robots could have a serious ecological impact. Of course, it’s still early days, and plenty of practical problems remain: would people steal the cars for parts? Even if they don’t, would they share the street with autonomous robots that will make plenty of mistakes and may block their path at crucial times?

One lesson to learn is that combining 3 models lets you solve a problem that was unreachable otherwise. Each problem in isolation may not be that tough; in fact, it might be a solved problem.

Understanding context

What problem is this model trying to solve? You know these ‘product guys’? They think people “hire” products and services to get a job done. What is the job that your model is getting done?

This might be obvious at times, but not so obvious some other times, and there lies opportunity.

Imagine that you work for a hospital. After lots of deliberation, your boss has decided that your next task will be to build a model that predicts when an intensive care patient is going to crash. What is the ‘job’ of this model?

One way to look at it: the job is to save lives.

One other way to look at it: the job is to use the resources of the hospital optimally. When a patient crashes, it takes lots of people to try to get her to be stable again. Every nurse and doctor that have anything to do with this patient rushes to the room, and abandons any task they are doing. Retaking the task is costly. Task switching is very inefficient; chronometer in hand try doing two tasks A and B multiple times, AAAABBBBB vs ABABABAB. The second takes longer, for pretty much any tasks A and B. This is why getting distracted by a notification is so damaging for productivity.

In any case, whether you think your model is saving lives (period) or allocating hospital resources optimally (to save more lives!) makes all the difference.

Because ‘bosses’ who are not ‘close to the metal’ cannot really estimate what the right job for the model is, you will have to do it. It’s a good fit for the analytical mind of the data scientist.

And there you have it; a complete manual to pick a successful AI project, in three installments. I hope this was useful, and that it helps you solve problems real people have with machine learning. There’s never been a better time to be alive. So much low hanging fruit, so much leverage thanks to this flavor of technology, so much productivity gains if we manage to educate a few more thousand people in AI. If this manual has helped, I’d love to hear from you and witness the project you built. Send it to me on twitter at @quesada, my DMs are open.

How to pick a successful AI project, 2: working with data

This post is part of a series on ‘how to pick a successful AI project’. Part 1, Part 3.

Here I cover hacks and tricks concerning working with data when selecting an AI project.

Your minimum viable dataset is smaller than you think

How can we double food production by 2050 to feed 9 billion people? Could the solution be started by two people walking around with taking pictures with a smartphone?

For an example of a bootstrapped dataset, take Blueriver Inc. They do precision agriculture. 90% of herbicide usage can be reduced by precisely spraying only at the right spots.

See & Spray machines use deep learning to identify a greater variety of plants with better accuracy and then make crop management decisions on the spot. Custom nozzle designs enable <1-inch spray resolution. They focus on cotton and soybeans.

In September 2017, John Deere acquired Blue River for 300 million.

What was the dataset Blue River started with? Collected by a handful of guys with phones taking pictures while walking down an entire crop field:

Often the amount of data you need for proof of concept is ‘not much’ (thanks to pretrained models!).

You can collect and label the data yourself, with tiny, tiny resources. Maybe 50% of the portfolio projects I have directed started with data that didn’t exist and the team generated them. Jeremy Howard, the founder of fast.ai, is also adamant about destroying the myth that ‘you need google-sized datasets’ to get value out of AI today.

What a great opportunity; it’s exciting to be alive now, with so many problems that are solvable with off-the-shelf tech. A farming robot can crunch mobile camera data opens a new path. Next time you think up AI projects, use Blue River as a reference of what’s possible with data you create.

Knowing “your minimum viable dataset is smaller than you think” opens up the range of projects you can tackle tremendously.

Because of pretrained models, you don’t need as much data

Model zoos are collections of pretrained networks. Each network there saves a tremendous amount of time AND data to anyone solving a problem similar to one already solved. For example:

Curators at model zoos make your life far easier. With recent success in ML, researchers are getting bigger grants, publish models from academia that have enjoyed powerful machines and weeks of computation. Industry leaders publish their models often, in the hope of attracting talent or nullifying a competitor’s advantage.

80% of the time of a data scientist is cleaning data; the other 20% is bitching about cleaning data

It’s a joke, but it’s not far from the truth. don’t be surprised if you spend most of your time with data preparation.

Data augmentation works well on images

Don’t do it ‘by hand’, today there are libraries for most common data augmentation tasks (Horizontal and Vertical Shift, Horizontal and Vertical Flip, random rotation, etc). It’s a pretty standard preprocessing step.

Get unique data by asking companies/people

At Data Science Retreat, One team once needed bird songs. It turns out there’s an association of bird watchers that has a giant dataset of songs. They could give it to our team, no questions asked.

Companies may have data they don’t care much about. Some other times, they do care about them but still would give them to you if you offer something of value in exchange, such as a predictive model they can use. Governments often have to give you data if you request it.

If you run out of ideas to get data with more traditional means, try asking for it. Sending a few emails may be a good use of your time. Academics should share their data if they published a paper about it and you asked. It doesn’t always work, but it’s worth trying. Small companies looking for any advantage may want to partner with you as long as they get some benefit.

Compute on the edge (federated learning), avoid privacy problems

This is our life in 2019: Huawei’s new ‘superzoom’ P30 Pro smartphone camera can identify people from far away and apply neural networks to lip reading. Progress with computer vision systems that can re-identify the same person when they change location (for example, emerging from a subway) all indicates that mass surveillance is growing in technical sophistication. Privacy is ‘top of mind.’

Corporations get access to more and more citizens’ private data; regulators try to protect our privacy and limit access to such data. Privacy protection doesn’t come without side effects: it often produces situations where scientific progress suffers. For example, GDPR regulation in Europe seems to be a severe obstacle to both researchers and companies to apply machine learning to lots of interesting problems. At the same time, datasets such as health data benefit form privacy; imagine if you had a severe illness and an employer would not hire you because of this.

A solution: What if, instead of bringing the corpus of training data to one place to train a model, you could bring the model to the data wherever it’s generated? That is called federated learning, first presented by Google in 2017.

This way, even if you don’t have access to an entire dataset at once, you can still learn from it.

Andrew Trask, harbinger of federated learning

If you are interested in this topic, follow Andrew Trask. He has a coursera course on federated learning and a handy jupyter notebook with a worked-out example.

Why is this important in a conversation about picking AI projects? Because if your project uses federated learning, you may have a much easier time getting people to give you data and use your product. It opens the door to a different class of projects.

Data reuse and partnerships

Data has excellent secondary value.

That is, often, you can find uses for data that the entity collecting it didn’t think about. Often through partnerships, new uses of data can produce a secondary revenue stream. For example, you can integrate:

– data on fraud

– data from credit scoring,

– data from churn,

– data about purchases (from different sources)

The organisation that published these data (containing personal data) might use a license that restricts your usage. You need to check if they explicitly have a license for data reuse, otherwise it is best to contact them and agree on a license for your project.

Beware of problems though:

1. If your product depends on data that only that one single partner can produce, you are in their hands. The moment you decide to end the partnership, they ended your business.

2. Data integration will be difficult, more so if the only shared variable in different datasets is a person. Data that would help identify a person is subject to regulation in some parts of the world.

Even if you can legally integrate these different data sources, remember there are entire teams dedicated to integration in big companies. Never assume this will go smoothly.

Using public data is not the ‘only option.’ Some people will complain that for every dataset out in the open, there are plenty of projects already. It’s hard to stand out. Maybe it’s worth it to email people in industry to get some data nobody else has (data partnerships). If you offer the result of your models in exchange for access, some companies may be persuaded.

How about doing Kaggle? Not a great idea for a portfolio project, because 1/ the hard pard of finding the problem and data is done and 2/ it’s hard to stand out from the crowd of competitors that probably spent more time than you fitting models and have better performance.

Finding secondary use in data is a fantastic skill to have. Coming up with project ideas trains this skill.

Use unstructured data

For decades, all data ML could consume was in the form of tables. Those excel files flying around as attachments, those SQL databases… tabular data was the only thing that could benefit from ML.

Since the 2010s, that changed.

Unstructured data are:

  1. Images
  2. words written by real people that don’t follow a pre-defined model, using language riddled with nuances.
  3. Video
  4. audio (including speech)
  5. sometimes, sensor data (streams)

Data that is defined as unstructured is growing at 55-65 percent each year. Emails, social media posts, call center transcripts,… all excellent examples of unstructured datasets that can provide value to a business.

You may think that ‘everyone knows this, so why mention it ‘… but in my experience, there are large companies left and right that didn’t receive the memo. If you work for one of these and happen to find a use case for unstructured data that they may have, you are onto something that could be a career changer.

Take, for example, banks. For them, data means numerical information from markets and security prices. Now they use satellite images of night light intensity, oil tank shadows, and the number of cars in parking lots, for example, can be used to estimate economic activity.

In my experience at Data Science Retreat, most people chose unstructured data for portfolio projects. And it’s easy to pass ‘the eyebrow test’ with these. Plus, they are abundant in the wild. Everyone has access to pictures, text… in contrast to tabular data such as money transactions.

One downside: unstructured data can trigger compliance issues. You never know what is lurking on giant piles of text. Is there confidential information in these emails? Is our users’ personal data leaking, even when we tried to anonymize them?

You may remember the AOL fiasco. On August 4, 2006, AOL Research released a compressed text file on one of its websites containing twenty million search keywords for over 650,000 users over a 3-month period intended for research purposes. AOL deleted the search data on their site by August 7, but not before it had been mirrored and distributed on the Internet. AOL did not identify users in the report; however, personally identifiable information was present in many of the queries, creating a privacy nightmare for the users in the dataset.

There you go; things I’ve learned about picking a good project that have to do with collecting and cleaning data.

In the last part of this series, I’ll cover what I’ve learned on model building that affects how you pick an AI project.

How to pick a successful AI project, part 1: Finding the problem and collecting data

This post is part of a series on ‘how to pick a successful AI project’. Part 2, Part 3.

Imagine that you have 200 hrs of your life to improve your career prospects in ML. What is the best use of this time? I’m betting on doing a portfolio project.

Let’s compare different options:

1. You could go to meetups

2. You could do tutorials, follow template code and be the Github user # 2245 who has this same exercise

3. You could take classes or MOOCs

Option 1 relies on serendipity. Assume you meet a potential job opportunity every 5 meetups (which could be a bit optimistic!) You may well spend a year going to meetups and getting exactly zero job opportunities if:

  • you are not looking amazing on paper/online, and
  • you don’t communicate very well in person

Assuming commuting to and attending the meetup takes 5hrs, How many meetups can you do? 40. That’s 20 conversations worth having. Still, without some strong signal that you can perform (such as work experience, or a tangible ML project you built from scratch), you will not convert these random conversations into job offers.

Options 2 and 3 (tutorials and MOOCs) don’t differentiate you enough from the masses. They are a prerequisite to reaching the level that would make a hiring manager take notice. Because work experience is hard to get when you are trying to get into a new field (chicken and egg problem) that leaves you with my preferred option, and the reason I wrote this series of posts: show the world what you can do by having an AI portfolio project.

A substantial project strongly dominates the other options

Once I asked Ted Dunning: “What is the one thing you care about as an interviewer?”. His answer: “I only care about one thing: What you have done when nobody told you what to do.”

Having a portfolio of projects shows creativity, which is extremely important in data problems.

Data cleaning, even model fitting, is somewhat grunt work. We will likely automate this in the future.

What is not automatable (and where you want to excel) is at:

• finding a problem

• That is worth solving (produces business value; helps someone)

• and is solvable with current tech

Ok, so you want to do a good AI project

The rest of this post is what I’ve learned after mentoring more than 150 AI projects over five years at the companies I’ve founded.

I’ve grouped what I know into four classes:

• Finding the problem

• Collecting data

• Working with data

• working with models

Finding the Problem

What human task is your project helping or displacing? If the task’s decisions take longer than a second for a human, pick another one

Machines are good at helping humans with boring tasks. The rule of thumb (from Andrew Ng) is that you want to tackle tasks that take a human less than one second. Driving, for example, is a long series of sub-second decisions. There’s little deep thinking there. Writing a novel is not like that. While NLP progress may look like we are getting closer and closer, writing a novel with AI is not a good project.

Myth: you need a lot of data. you need to be google to get value out of machine learning

When looking for problems, you are looking for data too. Often you feel there’s no data to address the problem you found. Or not enough data. Data is valuable, and those who have data don’t make them public. There’s plenty of public data, but nothing that matches what you need.

You come up with hacks, of course. Maybe you can combine different sources? Maybe you generate and label the data yourself?

Then you read online that you need giant datasets. You are screwed. You are not going to label millions of images just by yourself.

It’s true that for very complex deep learning models, you need those giant datasets. An excellent boost for ML in the 21st century is that now we have lots more data, and compute power, to train more complex models.

Many of the revolutionary results in computer vision and NLP come from having a lot of data.

Does this mean that you cannot do productive machine learning if you have no big compute clusters or datasets in the Terabyte range? Of course not.

Libraries contain pretrained models (like Inception V3, ResNet, AlexNet). You can use then to save computation time.

Take, for example, YOLO (you only look once). Anyone with an off the shelf GPU can do real-time object detection and segmentation in video. That is amazing. How many projects can you conceive with this feature? Make it an exercise: list 10 projects that you could build based on YOLO. By the time you read this, it’s likely there’s a new algorithm that does the task better. It’s a beautiful time to be alive. You don’t need terabyte-sized datasets to get value off of machine learning.

Pick a problem that passes ‘the eyebrow test’

The skill of picking a problem is as useful as the skill to solve it. Some problems will capture your imagination, and some will leave you unmoved. Pick the first.

And then there are those projects that make people think that what you are saying is impossible. You want one of those.

How do you know you picked the right problem? Use ‘the eyebrow test.’ You want to see the eyebrows of the other person going up when they hear it. If they don’t, keep looking.

Some problems are mere curiosities and can suck your resources. Don’t go for those. There’s tremendous potential for impact using ML right now. Why spend your time creating a translation algorithm from Klingon to Dothraki (both invented languages) when you can make people’s lives easier?

At Data Science Retreat we believe that because there are plenty of ‘good problems,’ there’s no excuse to pick one that doesn’t pass the eyebrow test. It takes time to find a good problem. How long should you spend? As much as needed to pass the eyebrow test.

It helps if you are in the surroundings of people who have found a good problem before. Sometimes, those with deep domain knowledge in ML are not as useful as you may think to pick up ideas.

You are selling the value of AI. When you do ‘eyebrow test’ projects, you help not only the people-who-have-that-problem (always keep them in mind) but… your fellow data scientists. The more impressive your projects are, the more the general public will appreciate how transformative AI is for society. And the more opportunities you create for everyone else in the field.

You only need to be in the ballpark of a good idea

One participant came to me saying that he was running late to pick up an idea and that he had exhausted all his possibilities. He had nothing.

“Ok, let’s look at what you are passionate about. What’s alive in you?”

“Well,… I hate waste.”

So we googled ‘waste trash machine learning.’ Nothing too obvious, nor too exciting, came about. After some more search, we found a Stanford project that categorized trash. The students had a trash dataset, painfully labeled. Still, this was not exactly an idea that passes the ‘eyebrow test.’

“What if instead of categorizing trash, we could build something that picks up trash?” (iterating on the idea)

“You mean something that goes around autonomously?”

“Yes, a self-driving toy car. There was one project in batch 08 of DSR that did exactly that. The car ran laps on a circuit. You would have to modify it to identify trash, get close to it, and pick it.”

“That sounds amazing!” (eyebrow test passed)

With time, this ballpark idea morphed from general trash to picking up cigarette butts, which are terrible for the environment. Details about how to pick the cigarette butts improved with time: stabbing them, instead of trying to grab them with a robotic arm. We will talk more about this project later in this series.

Pick a problem that moves you. If nothing does, use ‘watering holes’ to listen to problems that move people

If you have the problem you want to tackle, you have a significant advantage. You understand the need. You can build the solution and know how well it works by applying it to yourself. You can tell between ‘nice to have’ and ‘pain point.’ At this point, what we are doing is no different from what startup founders and product managers do.

You have in your hands the closest thing to a shortcut: have the problem yourself. You will save time but not going into detours that don’t help. You will have a keen sense of what to build.

In 2014 I was building a company that eventually failed doing customer lifetime value (CLV) predictions for e-commerce stores. The product CLV predictions, though, was something I could deliver myself as a consultant. So I became a ‘CLV consultant.’ One of my clients was hiring a data scientist, and they hired me fulltime, so I magically transitioned into data science.

Many others had the same problem: they had tech skills, maybe a Ph.D., and they wanted to become data scientists, but they didn’t know how. Remember, this was 2014, before the web started boiling with advice on how to become a data scientist. I built a business around helping others solving this problem, and it’s been working well for the last five years.

I knew well what the problem is; too much information, unclear guidelines, interviewers who don’t know how to recognize talent. Every step of the way, I felt I knew what I was doing with this business; a feeling that is extremely valuable. Transitioning to data science is an excellent problem; 5 years later, people still struggle with it.

I don’t play golf. If I wanted to build a product for golfers, I’d be lost entirely. I would build features nobody need; I would miss the pain points. Even if I ran interviews and listened to the market, I would at a disadvantage to a golfer.

So my advice to pick a portfolio project: pick a problem you know well. Even better if it’s a problem that moves you. If you lost three friends to suicide, build something to prevent suicide. In this case, you don’t have the problem yourself, but you have a strong motivation to solve it.

What if you have no problems whatsoever? You have been in the same industry forever (say Oil and Gas), and all valuable problems there got taken care of!

I don’t believe you. There’s no industry so mature that all problems are solved. But, ok, you cannot come up with something that moves you, some problem that you have.

Then observe what problems other people have. Large groups of people. They tend to congregate in public spaces and bitch about their problems; every time you see people bitching about something… turn that into an opportunity to do a project.

Which public spaces? I call these ‘watering holes’ (HT to Amy Hoy). Online, you can have obscure forums, but Reddit and twitter are the easiest. Just sit there and ‘listen’ to people discuss the problem. Learn every detail of it. Is it a real problem, or a ‘nice to have’?

For example, you may join a gaming subreddit to see if gamers care about having stronger AI in videogames. Or if they care for VR. These ideas are too abstract to be good project ideas, but you get where I’m going.

Collecting data

Integrate distinct data sources

Often companies are so focused on getting value out of the data they have they forget they can increase value using the data not in their company but publicly available.

There’s plenty of open access data. And APIs for data that changes frequently. There’s no reason not to use multiple data sources. You can solve a more interesting problem (one that was not obvious before) by integrating APIs.

For a collection of APIs, check https://www.programmableweb.com/.

Problems that seemed impossible with a single source of data become solvable when you add a new data source. Boring projects come alive. Thankless tasks become a joy to work with if you manage to find a twist that shows more value.

When using multiple data sources, you have to stitch them together using a shared key (a column that is present in both datasets.) You cannot combine data sources that don’t have a shared key, and that tends to be a showstopper for many ideas.

Instead of collecting or reusing data, produce your own data

You don’t need to find data, or own it. Thanks to pretrained models (see later section on them), you don’t need all that much data, which means you can produce it yourself. Removing “I don’t much/any data” as an obstacle opens up the space of problems you want to solve.

To produce original data, I found one big hack: use hardware. Sensors are cheap, and they give you data that you own.

JD.com’s Shanghai fulfillment center uses automated warehouse robotics to organize, pick, and ship 200k orders per day. 4 human workers tend the facility. JD.com grew its warehouse count and surface area 45% YoY. AI affects manufacturing too. It’s easier than ever to produce things en masse, and this means there are a lot more hardware ‘toys’ in the market. Things that you would never consider to be affordable like microscopes, spectroscopes are reaching the mass consumer market and are eminently hackable. These are wonderful data sources!

Because of Shenzhen, Kickstarter, etc. hardware is evolving far faster than before. It’s never going to be as fast to iterate on as software, but we are getting there. Have you checked what’s available on Ali express? There are multiple sensors you can buy for under 100 bucks. Attach one of these to a phone running your code, and you have a portable purpose-specific machine.

For example, you can buy a microscope and use it to detect Malaria without a human doctor. Deep learning running on a phone is good enough to count parasites on blood.

Ali express is full of cheap hardware that you can attach to a phone. You can add a lot of value to someone’s life using a mixture of phones (that run the ML code) and

Example: our Malaria Microscope.

Eduardo, AIscope’s founder, after reaching 1000x the first time

Malaria kills about 400k people per year, mostly children. It’s curable, but detecting it is not trivial, and it happens in parts of the world where hospitals and doctors are not very accessible. Malaria parasites are quite big, and a simple microscope can show them; the standard diagnostic method involves a doctor counting them.

It turns out you can attach a USB microscope to a mobile phone, and run DL code on the phone that counts parasites with accuracy comparable to a human. Eduardo Peire, DSR alumni, started with a pubic Malaria dataset. While small, this was enough to demonstrate value to people, and his crowdfunding campaign got him enough funds to fly to the Amazon and collect more samples. You can follow their progress here: http://aiscope.net.

For another example, you can buy a spectroscope that pointed at any material, and it’d tell you its composition. It’s small enough to attach to a phone for a hand-held scanner. Can you detect traces of peanuts in food? Yes! There you go, a solution to a problem real people have. If you are allergic to peanuts, this will buy you a certain quality of life.

Sensors are cheap nowadays, and they will help you get unique data. You can turn a phone into a microscope, a spectroscope, or any other tool. The built-in camera and accelerometer are excellent sources of data too.

Next on this series: what I’ve learned about working with data and how this can help you pick successful AI projects.

Understanding a Machine Learning workflow through food

Photo by Cel Lisboa on Unsplash

Originally posted on Towards Data Science.

Through food?!

Yes, you got that right, through food! :-)

Imagine yourself ordering a pizza and, after a short while, getting that nice, warm and delicious pizza delivered to your home.

Have you ever wondered the workflow behind getting such a pizza delivered to your home? I mean, the full workflow, from the sowing of tomato seeds to the bike rider buzzing at your door! It turns out, it is not so different from a Machine Learning workflow.

Really! Let’s check it out!

This post draws inspiration from a talk given by Cassie Kozyrkov, Chief Decision Scientist at Google, at the Data Natives Conference in Berlin.

Photo by SwapnIl Dwivedi on Unsplash

1. Sowing

The farmer sows the seeds that will grow to become some of the ingredients to our pizza, like the tomatoes.

This is equivalent to the data generating process, be it a user action, be it movement, heat or noise triggering a sensor, for instance.

Photo by no one cares on Unsplash

2. Harvesting

Then it comes the time for the harvest, that is, when the vegetables or fruits are ripe.

This is equivalent to the data collection, meaning the browser or sensor will translate the user action or the event that triggered the sensor into actual data.

Photo by Matthew T Rader on Unsplash

3. Transporting

After the harvest, the products must be transported to their destination to be used as ingredients in our pizza.

This is equivalent to ingesting the data into a repository where its going be fetched from later, like a database or data lake.

Photo by Nicolas Gras on Unsplash

4. Choosing Appliances and Utensils

For every ingredient, there is the most appropriate utensil for handling it. If you need to slice, use a knife. If you need to stir, a spoon. The same reasoning is valid for the appliances: if you need to bake, use an oven. If you need to fry, a stove. You can also use a more sophisticated appliance like a microwave, with many, many more available options for setting it up.

Sometimes, it is even better to use a simpler appliance — have you ever seen a restaurant advertise “microwaved pizzas”?! I haven’t!

In Machine Learning, utensils are techniques for preprocessing the data, while the appliances are the algorithms, like a Linear Regression or a Random Forest. You can also use a microwave, I mean, Deep Learning. The different options available are the hyper-parameters. There are only a few in simple appliances, I mean, algorithms. But there are many, many more in a sophisticated one. Besides, there is no guarantee a sophisticated algorithm will deliver a better performance (or do you like microwaved pizzas better?!). So, choose your algorithms wisely.

Photo by S O C I A L . C U T on Unsplash

5. Choosing a Recipe

It is not enough to have ingredients and appliances. You also need a recipe, which has all the steps you need to follow to prepare your dish.

This is your model. And no, your model is not the same as your algorithm. The model includes all pre– and postprocessing required by your algorithm. And, talking about pre-processing

Photo by Caroline Attwood on Unsplash

6. Preparing the Ingredients

I bet you the first instructions in most recipes are like: “slice this”, “peel that” and so on. They don’t tell you to wash the vegetables, because that’s a given — no one wants to eat dirty vegetables, right?

Well, the same holds true for data. No one wants dirty data. You have to clean it , that is, handling missing values and outliers. And then you have to peel it and slice it, I mean, pre-process it, like encoding categorical variables (male or female, for instance) into numeric ones (0 or 1).

No one likes that part. Neither the data scientists nor the cooks (I guess).

Photo by Bonnie Kittle on Unsplash

7. Special Preparations

Sometimes you can get creative with your ingredients to achieve either a better taste or a more sophisticated presentation.

You can dry-age a steak for a different flavor or carve a carrot to look like a rose and place it on top of your dish :-)

This is feature engineering! It is an important step that may substantially improve the performance of your model, if done in a clever way.

Pretty much every data scientist enjoys that part. I guess the cooks like it too.

Photo by Clem Onojeghuo on Unsplash

8. Cooking

The fundamental step — without actually cooking, there is no dish. Obviously. You put the prepared ingredients into the appliance, adjust the heat and wait a while before checking it again.

This is the training of your model. You feed the data to your algorithm, adjust its hyper-parameters and wait a while before checking it again.

Photo by Icons8 team on Unsplash

9. Tasting

Even if you follow a recipe to the letter, you cannot guarantee everything is exactly right. So, how do you know if you got it right? You taste it! If it is not good, you may add more salt to try and fix it. You may also change the temperature. But you keep on cooking!

Unfortunately, sometimes your pizza is going to burn, or taste horribly no matter what you do to try to salvage it. You throw it in the garbage, learn from your mistakes and start over.

Hopefully, persistence and a bit of luck will produce a delicious pizza :-)

Tasting is evaluating. You need to evaluate your model to check if it is doing alright. If not, you may need to add more features. You may also change a hyper-parameter. But you keep on training!

Unfortunately, sometimes your model is not going to converge to a solution, or make horrible predictions no matter what you do to try to salvage it. You discard your model, learn from your mistakes and start over.

Hopefully, persistence and a bit of luck will result in a high-performing model :-)

Photo by Kai Pilger on Unsplash

10. Delivering

From the point of the view of the cook, his/her work is done. He/she cooked a delicious pizza. Period.

But if the pizza does not get delivered nicely and in time to the customer, the pizzeria is going out of business and the cook is losing his/her job.

After the pizza is cooked, it must be promptly packaged to keep it warm and carefully handled to not look all squishy when it reaches the hungry customer. If the bike rider doesn’t reach his/her destination, loses the pizza along the way or shake it beyond recognition, all cooking effort is good for nothing.

Delivering is deployment. Not pizzas, but predictions. Predictions, like pizzas, must be packaged, not in boxes, but as data products, so they can be delivered to the eager customers. If the pipeline fails, breaks along the way or modifies the predictions in any way, all model training and evaluation is good for nothing.


That’s it! Machine Learning is like cooking food — there are several people involved in the process and it takes a lot of effort, but the final result can be delicious!

Just a few takeaways:

    • if ingredients are bad, the dish is going to be bad no recipe can fix that and certainly, no appliance, either;
    • if you are a cook, never forget that, without delivering, there is no point in cooking, as no one will ever taste your delicious food;
  • if you are a restaurant owner, don’t try to impose appliances on your cook — sometimes microwaves are not the best choice — and you’ll get a very unhappy cook if he/she spends all his/her time washing and slicing ingredients

I don’t know about you, but I feel like ordering a pizza now! :-)

If you have any thoughts, comments or questions, please leave a comment below or contact me on Twitter.