Note: The code for this post can be found here
Understanding how the variables are distributed in the data is an important step and should happen early in the Exploratory Data Analysis (EDA) process. There are a number of tools available to analyze the distribution of data. Visualization aids are likely the most popular because a well constructed chart can quickly answer important questions regarding the data. For example:
Note: The code for this post can be found here
In this article, we’re going to build a simple Rock Paper Scissors game in Python with two different approaches: Rules-Based System vs. Machine Learning. Through this comparison, I hope to express how Machine Learning works and its motivations. To be clear, this is not an illustration of Image Recognition or Pattern Recognition (which hand will a player choose next) but rather Machine Learning concepts in general. As automation continues to revolutionizing the future of work across industries, companies must explore different ways to streamline their operations. …
Note: The code for this post can be found here
The last step in the Machine Learning Life Cycle is to put the model into production, also known as “operationalizing” the model. It often means enabling the model to generate outputs based on new data given. In the context of a real-world application, to deploy the Machine Learning model into production is to integrate it into the existing environment, allowing other systems to call it for making inferences. …
Feature Engineering is an important step in the Data Science workflow. It is the process of extracting features from raw data using data mining techniques and domain knowledge. This can involve performing transformations or univariate, binary, and multivariate statistical analysis on existing data. These derived values can make data more intuitive for analysts and their algorithms. An experienced practitioner can quickly assess the problem and brainstorm new features to create based on existing data. These features, ranging in complexity, may require calculations that are easily done (and more readable) using Lambda functions. …
Note: The code for this post can be found here
API stands for Application Programming Interface. It is a software intermediary that allows systems to communicate with each other. Most businesses online have likely built APIs for customers and/or for internal use. For example, when a user enters a URL into their browser, e.g. www.medium.com, they are making a request to Medium’s server. Medium will then give back a response to be interpreted and displayed on the user’s browser. Modern client-to-server communications are mostly handled by APIs. The type and response will be dependent on a set of dedicated URLs…
Regular Expressions (Regex) is an essential tool for text analytics. It is powerful in searching and manipulating text strings. Compared to the traditional approach for processing strings with a combination of loops and conditionals, one line of regex can replace many lines of code. Some well known use cases for such text processing include:
In this article, we’ll go through the basics on how to use Regex. The focus is primarily on defining the correct patterns for the…
Making the Most out of Decision Trees
<You can find the code used for demonstrate here>
Tree-based classification models are a type of supervised machine learning algorithm that uses a series of conditional statements to partition training data into subsets. Each successive split adds some complexity to the model, which can be used to make predictions. The end result model can be visualized as a roadmap of logical tests that describes the data set. Decision trees are popular for small-to-medium-sized data sets because they are easy to implement and even easier to interpret. However, they are not without challenges. …
<Code for this article can be found here>
Are GPUs faster than CPUs? It’s a very loaded question, but the short answer is no, not always. In fact, for most general purpose computing, a CPU performs much better than a GPU. That’s because CPUs are designed with fewer processor cores that have higher clock speeds than the ones found on GPUs, allowing them to complete series of tasks very quickly. GPUs, on the other hand, have much greater number of cores and are designed for a different purpose. At inception, GPU was originally designed to accelerate the performance of graphics…
With more people now than ever relying on social media to stay updated on current events, there is an ethical responsibility for hosting companies to defend against false information. Disinformation, which is a type of misinformation that is intended to manipulate and mislead, can create unrest and panic. Other types of misinformation such as rumors and hoaxes, if left unchecked, also has the potential to bring mental and physical harm to unwary readers. The key to stopping the spread of misinformation is taking swift action against them since they have the tendency to travel very quickly. In fact, studies show…
What are the Most Frequently Discussed Topics in the News by Category?
<Complete code for this demonstration can be found here>
This is a continuation of the Fun with NLP series — Natural Language Processing is a fast growing field in Machine Learning that makes it possible for computers to read, hear, speak, interpret, generate human language. In the Fun with NLP series — I demonstrate simple tricks that I used to analyze text data. For demonstration purposes, I’m using data from the NYT Archive API to explore the question:
What are the most frequently discussed topics in the news…