Strategical technology projects are a double-edged sword for all businesses. On one hand, technology reflects an opportunity to improve productivity. On the other hand, new technology also creates new threats. Therefore, executives are under constant pressure to keep up with the stream of innovation to stay relevant.
When it comes to choosing the technology you want to invest in, the key problem that arises is: Where do I place my “bets”? If you bet on the wrong horse, you might end up losing the race. If you bet on the right horse, you will find yourself as one of the fore-runners of this innovation wave.
To answer the question for you with data, I have conducted an analysis of paid-traffic information to identify the current state and the dynamics of technological trends. And the findings send a clear message for strategical technology projects: there are new threats, there are chances, there are lost bets and there are winners.
The figure presents the main results of the analysis I have conducted. It has the following components:
Additionally, I have scaled all values in such a way that they fit on a scale between 0 and 1, where 0 will be assigned to the technology with the lowest value in that dimension and 1 will be assigned to technology with the highest value in that dimension.
Technologies are at the core of any digital transformation. 93% of companies consider innovative technologies as necessary for reaching their digital transformation goals (SAP, 2019). Using the results, you will be able to understand, where you should place your bets and which technologies you should keep in mind for your digital journey.
With a closer look at the figure, we can identify 6 key strategic insights based on the clusters identified. These insights can help you to design your strategical technology projects.
RPA remains a highly dynamic field from the demand side as well as from the supplier side and we can anticipate more breakthroughs in this key technology. This is a technology that is mature enough to produce substantial benefits for companies, while it remains constantly developing. Gartner has already detected this trend and coined it “hyperautomation” for 2020.
Action: This is not just a safe bet, but it is a must to keep pace with technological advancement. Find out more here about how to kick off your digital transformation with RPA.
Deepfakes emerged from the advancement of neuronal networks and their application on generative adversarial networks (GAN) to reproduce realistic yet fake photos and videos. Deepfakes have already proven to be a threat to politics with the Financial Times writing “How deepfakes are coming for politics” (2018) and it is just a matter of time until the technology is used against companies. The public interest is high, which explains the demand side.
Action: There is still a void from the supplier side, which you should expect to be filled with new opportunities. Keep an eye on the supplier side until it is mature enough for strategical technology projects.
The decliners comprise of various technology that were expected to bring substantial business benefits, have been heavily promoted and the strategical technology projects with these technologies failed to meet these expectations. One example is blockchain, which has grown to a substantial size. However, its limited scalability and probably lack of maturity have led the demand side to slow down drastically.
Action: These technologies are not mature yet for mass implementation or have a very narrow focus. Redirect your focus on other more promising technologies and keep an eye on the decliners for possible new “breakthroughs” or a new level of maturity in the area.
There are constantly new offerings emerging for the challengers. These technologies have almost achieved a maturity to be able to leave a significant impact on your organization or will very likely reach that level in the future. However, only little businesses have realized that which gives you a chance to become one of the first-movers.
Action: Carefully consider the technologies in this cluster and select the most relevant ones. Place your bets on technologies in this area to reap the first-mover advantage.
The visionaries are experiencing a strong demand growth and there might be very good reasons for that. However, there is little development from the business side to meet this demand. Either these technologies have achieved maturity and are unlikely to develop in the future, or they reflect a sudden shift in demand. In this case, you should expect business to become heavily involved with these technologies to meet the demand and to develop new offerings.
Action: These technologies can become very relevant in the future if the supplier side starts reacting to the high demand. Evaluate their potential impact on your organization early on to know how they would impact your organization and keep a very close eye on these technologies.
The laggards tend to already be very mature with little supplier side development and moderate demand side as the adoption rate of these technologies increases. These technologies will unlikely help you become a digital leader in your industry, however, they often serve as the basis for many other strategical technology projects.
Actions: Analyze your current technologies in use and see whether you can use any of the laggards to strengthen your digital backbone. The laggards are safe and sound investments as they have already been implemented in several businesses and are mature enough.
To conduct the analysis, I have used various SEO-information underlying the various technological trends. The information used includes the average paid-traffic price, the trend on google trends, the SEO-difficulty, the search volume, the number of search items available on google.
Additionally, I made sure to also include the top 10 related keywords to ensure validity by the law of higher numbers, e.g. to evaluate “Virtual Reality”, we would also evaluate “Virtual Reality Examples”, “VR development”, “VR” … .
The main underlying assumption is that the activities form the real market will be reflected in the online search data. This is true for most of the data but can be inaccurate in some cases. Therefore, one should definitely look at the whole chart with care at the conclusion for strategical technology projects.
So, what are your thoughts on my analysis? I would be very excited to hear more from you now!
Accenture(2019). The Post-Digital Era is Upon Us: ARE YOU READY FOR WHAT’S NEXT?.Retrieved from https://images.idgesg.net/assets/2018/01/state_of_the_cio_01_ciod_winter_final.pdf.
CIO (2018). State of the CIO. Retrieved from https://images.idgesg.net/assets/2018/01/state_of_the_cio_01_ciod_winter_final.pdf.
Deloitte(2019). Executive summaryTech Trends 2019. Retrived from https://www2.deloitte.com/us/en/insights/focus/tech-trends/2019/executive-summary.html#endnote-4.
Financial Times (2019). Can you believe your eyes? How deepfakes are coming for politics. Retrieved from https://www.ft.com/content/4bf4277c-f527-11e9-a79c-bc9acae3b654.
PwC (2019).Technology trends 2019. Retrieved from https://www.pwc.com/gx/en/ceo-survey/2019/Theme-assets/reports/technology-trends-report-2019.pdf.
SAP (2019). SAP Study Says Up to 93 Percent of Companies Consider Intelligent Technology Key to Digital Transformation. Retrieved from https://news.sap.com/2019/03/forrester-survey-intelligent-technology-digital-transformation/.
The post Strategical Technology Projects: Place Your Bet Wisely appeared first on Economalytics.
]]>Are you aware of how inaccurate marketing forecasting influences your business outcomes? Inaccurate forecasting can be very, very, very expensive for your company and let me here give you a few examples from real occurrences:
Whether you are trying to forecast the market size, future market share, the sales figures of the next month or some other financial outcomes, accuracy matters. Forecasting problems can be solved today easier than ever before with the right machine learning methods. In this article I will outline to you the current status of forecasting methods and how you can leverage machine learning for the following marketing forecasting problems:
A well-working machine learning pipeline for marketing forecasting offers advantages in many areas within the company. While there are clear cost-savings associated with implementing machine learning, it also has spillover effects on many other areas: shareholder confidence, staffing, manufacturing, logistics and more.
The message for management is clear. Improving only marketing forecasting, a single process at the core of the corporate web will substantially improve many business outcomes, because many other corporate processes depend on it. There is a clear ROI with little investment.
On top of it, today many cutting-edge technologies such as neuronal networks exist that you can leverage to improve your marketing forecasting. In the following section, I will give you an overview of the state-of-the-art marketing forecasting techniques. I can bet that your business will identify with the methods that I outline there.
Armstrong and Brodie, two US researchers, have investigated the various forecasting methods used in the forecasting world. Their analysis shows that there are two main families of methods used in marketing forecasting:
However, there is one fact that businesses have neglected so far. Since the computational possibilities, the amount of available data and the quality of available data have improved substantially, a third powerful family of marketing methods is emerging: machine learning methods.
Machine learning methods bring in a third powerful toolbox for building marketing forecasting pipelines by maximizing the precision of forecasting. While machine learning maximizes precision, they lose their explainability, e.g. a person cannot explain how the machine learning model came to its predictions.
Machine learning methods will not defeat qualitative and statistical methods, but rather will augment the other two families leading to the trend of augmentation through machine learning. Here are the tree strategical shifts that machine learning methods will introduce to marketing forecasting:
Armstrong, J. C., Brodie, R. J. (1999). Forecasting for Marketing.
Loeb, W. (2013). Why Are Walmart Stores Such a Mess?. Retrievedfrom https://www.forbes.com/sites/walterloeb/2013/07/17/why-are-walmart-stores-such-a-mess/#3837007973da.
Koch, C. (2004). Nike Rebounds: How (and Why) Nike Recovered from Its Supply Chain Disaster. Retrieved from https://www.cio.com/article/2439601/nike-rebounds–how–and-why–nike-recovered-from-its-supply-chain-disaster.html.
The post Marketing Forecasting: Machine Learning & Future Trends appeared first on Economalytics.
]]>In times of rapid digital evolution, I will present you today how you can prepare your business with an IoT cyber security use case for the threats of tomorrow. There is no doubt that IoT will be a critical success factor for many companies. Internet of Things (IoT) services and implementations are growing drastically and steadily to reach again an all-time of available sophisticated analytics, service deployments, and IoT applications:
The pace of global IoT growth is tremendous, but it leaves a critical and very important key vulnerability unanswered: how safe and reliable are the IoT applications? What is the IoT cyber security use case? How do we make sure that all IoT applications fulfill their functions as expected? Within an IoT-system, if a small part performs differently than expected or has been hacked, it can have devastating consequences for the whole system.
Lack of security systems to monitor your IoT applications leaves your organization exposed to substantial financial and reputational risks, where a small failure within the system can lead to the whole system reacting in unexpected ways. Among classical cyber security problems like intrusion in the system and denial-of-service (DoS) attacks, there are also simple internal failures of the system that occur unexpectedly like software bugs and hardware failures.
In conclusion, lack of security means that you cannot ensure performance as well as reliability of your IoT applications and that you also miss out on an opportunity to learn from the past by applying analytics to your IoT environment. This keeps your IoT-initiatives at high risk and will keep them from unleashing the full potential.
Two researchers from Berlin, Witzig & Gulenko, have investigated how real-time monitoring can be applied to the new emerging and challenging fields of Internet of Things applications and found a simple but powerful method for monitoring. In particular, they have investigated how half-space tree algorithms can be used to implement such reliability checks and detect abnormal behavior in the context of IoT in real-time so that preventive measures can be applied. Their research reveals an interesting IoT and cyber security use case.
In their study, they test their approach on a real-world IoT-example and achieve impressive results using the unsupervised method: the detection rate of dangerous behaviors was as high as 99,4% with a less than 3% false alarms.
In computer science and machine learning, half-space tree learning is an algorithm that computes a decision tree form pre-existing data to classify certain events in the IoT-environment as “normal” or “abnormal”. There are three key challenges imposed by IoT applications that this algorithm overcomes:
The half-space tree learning algorithm overcomes these three key challenges in the area of IoT cyber security better than other machine learning algorithms such as random forest, neuronal networks or k-means clustering. Therefore, it provides a simple and implementable IoT cyber security use case using modern analytics capabilities.
The approach presented by the two researchers Witzig & Gulenko represents a new way of approaching and tackling security and compliance monitoring problems within the business sphere, which other IoT cyber security use cases cannot provide. It differs from classical monitoring approaches in 4 ways, which contribute to the advantages of the approach:
1) Unsupervised learning approach instead of a supervised learning approach
Classical detection approaches follow a supervised learning approach, which uses already labeled data to train the machine learning models first. However, this has several disadvantages in IoT. First, enough labeled data is usually not available to companies. Second, once a model is trained, it cannot adapt to the changing interaction patterns of IoT devices in real-time quickly. The half-space tree algorithm is an unsupervised approach that overcomes these hurdles by giving the application flexibility.
2) Component-level monitoring instead of system-level monitoring
The half-space tree algorithm is implemented for each IoT-device individually, meaning that the computation burden is spread across all IoT-devices and the trees can be tailored to the individual IoT device. Compared to system-level monitoring, where there is a central monitoring instance surveilling the interaction of devices, the component-level approach makes the detection faster, accurate and easier to scale. While a system-level center reaches it is computational limits fast, a component-level approach can scale up and scale down with your amount of IoT devices without problems.
3) Accountability instead of black-box predictions
Half-space trees are simple algorithms that can be very easily interpreted and understood. Even for simple business clerks, it is easy to understand how abnormal behaviors are detected. While classical machine learning approaches lose the “explainability” of their predictions, half-space trees can be easily understood and investigated. This increases user acceptance enables you to learn from the trees and always provides you with accountability.
4) Generalizability
While other machine learning algorithms are very context-specific, need labeled training data and can sometimes only work with quantitative variables, half-space trees can perfectly work with quantitative and categorical variables without preexisting data. This makes it applicable to many more scenarios.
Forbes (2018). How IoT Is Impacting 7 Key Industries Today. Retrieved from https://www.forbes.com/sites/insights-inteliot/2018/08/24/how-iot-is-impacting-7-key-industries-today/#6c72130f1a84.
Fortune Business Insights (2019). Internet of Things (IoT) Market Size, Share and Industry Analysis By Platform (Device Management, Application Management, Network Management), By Software & Services (Software Solution, Services), By End-Use Industry (BFSI, Retail Governments, Healthcare, Others) And Regional Forecast, 2019 – 2026. Retrieved from https://www.fortunebusinessinsights.com/industry-reports/internet-of-things-iot-market-100307.
Gartner (2018). Early adopters of IoT are working through the challenges of implementation to deliver compelling business value. Retrieved from https://www.gartner.com/smarterwithgartner/lessons-from-iot-early-adopters/.
Gartner (2018). Gartner Identifies Top 10 Strategic IoT Technologies and Trends. Retrieved from https://www.gartner.com/en/newsroom/press-releases/2018-11-07-gartner-identifies-top-10-strategic-iot-technologies-and-trends.
Gulenko, A., Schmidt, F. (2019). Unsupervised Anomaly Alerting for IoT-Gateway Monitoring using Adaptive Thresholds and Half-Space Trees.
Netscout (2019.) NETSCOUT Threat Intelligence Report: Dawn of the terrorbit Era. Retrieved from https://www.netscout.com/sites/default/files/2019-02/SECR_001_EN-1901%20-%20NETSCOUT%20Threat%20Intelligence%20Report%202H%202018.pdf.
Phonemon Institute (2019). Third Party IoT Risk: Companies don’t know what they don’t know. Retrieved from https://sharedassessments.org/2019-iotstudy/.
SonicWall (2019). SonicWall 2019 Mid-Year Threat Report show worldwide malware decrease of 20%, rise in ransomware-as-a-service, IoT attacks and cryptojacking. Retrieved from https://www.sonicwall.com/news/sonicwall-2019-mid-year-threat-report/.
Slade, R. (2017). The Internet of Things (IoT): A New Era of Third-Party Risk. Retrieved from https://sharedassessments.org/the-internet-of-things/.
The post IoT Cyber Security Use Case for the Future appeared first on Economalytics.
]]>Have you ever worked in a position, where you had to conduct highly manual and repetitive work on the computer? Let me tell you some examples: enter more than 100 invoices in your system every day, perform dozens of bookings on your system or gather information from the same pages by the same means. These are the daily work routines of many workers and probably also in your work environment. The tasks are difficult to be automated with common technologies because you would need substantial investments that will not pay off and the tasks might change regularly so that there is no other way than to keep it manual.
However, this way your company fails to progress in the race of digital transformation and remains prone to avoidable shortcomings of missing automation:
Robotics process automation (RPA) enables you to automate tasks without complex IT-knowledge, that cannot be automated through traditional means:
As you can see in the figure, some activities are simply too complex which would make them very costly to automate through traditional means. These activities, for instance, require the entry of data across systems without a common interface. Even though RPA implies that physical robots will be moving around in your office, it means the automation of the manual activities on the computer like filling out a spreadsheet, entering information into the enterprise resource planning system (ERP) by an RPA-solution. After a successful RPA implementation, all the mouse moves or keyboard entries will be done automatically – by the software. Here is an example of a task that was automated by RPA:
In this section, I will explain to you how RPA is a great start for a digital transformation and explain to you step-by-step, how you can use RPA within your organization. RPA is a great first step to kick off a digital transformation for several reasons:
The results that can be achieved through RPA are impressive. Telefónica O2 indicated an ROI between 650% and 800% after three years after having automated roughly 160 processes (Lacity et al, 2014). Additionally, Accenture reports that with the implementation with RPA, companies can reduce process costs up to 80%, reduce time by 40% and improve compliance substantially (Khalaf, 2017).
In the following section, I will explain to you how you can achieve these results and what are the pitfalls to be avoided along the process.
The first step of your RPA transformation journey is to understand the context that will drive your activities. What brought you to the point that you want to implement RPA? Here a few examples of common stories:
Understanding the context will help you in several ways because it will help you to define the scope of the project, the target group and formulate the right goal. Here are a few examples of goals
In the next step, you will need to make a few important tactical and strategical decisions. Remember that now you are only designing a plan for the first wave of your RPA implementation. In the first wave, you will maybe automate a maximum of 5 processes and use the lessons-learned to adjust your strategy for the subsequent implementations.
Within this phase there a few questions that you will need to clarify:
After you have devised a general approach, it is time to get more concrete. Ideally, you will start to automate the process first that is predicted to yield the highest return on investment. The goal is the return on investment to fund further automation.
In this step, you should analyze your current existing processes to choose the ones that are ideal candidates for your first implementation wave. Ideal RPA processes will fulfill the following criteria:
The key is to focus on processes that have the ideal basis for RPA and do not need refinement. However, in your future RPA endeavors, it might be important to introduce Business Process Management as a basis for ideal RPA.
Based on the strategy you choose, you should decide the concrete implementation. That involves two questions:
These are two key strategical questions, which you for only will determine for the first implementation wave. Here it might be very insightful and efficient to involve experienced consultants that do not only specialized in one tool but understand other tools. That way the consultant will be able to understand your situation better and support you with all the following activities:
This step is the step that involves the implementation of the first wave by automating the most mature and promising processes using RPA. The results of this implementation will provide the results for further to estimate further implementation and to improve the whole process.
The goal is to take small fast steps to reap the quick wins. Several projects show that a process can be automated and deployed within two weeks or even less (Lacity et al, 2014).
The goal of the whole endeavor should not only be to automate a few processes quickly, but also to integrate RPA as a constant part of the organization. This is crucial for achieving long-term success with RPA and involves building internal expertise and capabilities.
That can mean that first staff members attend regular training on the software and occasionally consultants are insourced for advanced problems and further training. It involves building up process documentation and process design expertise as well as guidelines for successful RPA experiences.
In the end, this will enable you to implement your RPA projects even faster, to establish a strong support center to improve the stability of the RPA services and drive the innovation from within the company.
Using the experience from the first implementation wave, it is time now to continue with the second wave. The experience should make the implementation faster, more reliable and efficient.
For instance, Telefonica O2 reports that its employees needed 3 months to complete the first wave and to build up enough expertise to be able to tackle the more challenge process automation. For the second wave, the company even managed to develop 75 additional robots that handled 35% of all back-office transactions (Lacity et al, 2014).
There is a trend emerging to slowly outsource the RPA automation process and the robot maintenance activities to shared delivery centers, where the labor costs are less. This enables the company usually to first scale up the implementation faster at lower costs and to expand the RPA automation activities internationally.
However, one must not forget that outsourcing is only an option if the internal expertise and RPA capabilities exist because the business people will be still the ones who will do the quality assurance and possess the process expertise. This means that even with the shared delivery center, the expertise that was gathered in the first two ways remain within the organization. Only then the seventh optional step can work.
The RPA market has experienced strong growth for good reasons. These are the advantages of RPA:
A study shows that 30 to 50% of RPA projects initially fail because they fail to deliver the desired outcome (Das, 2018), which is why I strongly recommend engaging a consultant for the beginning of your RPA journey. In the following, I will list some of the disadvantages of the technology and how you can make sure to minimize the downsides.
Das, G. (2018). Robotic process automationfailure rate is 30-50%. Retrieved from https://www.businesstoday.in/sectors/bpo/robotic-process-automation-failure-rate-indian-bpo-business-process-outsourcing-exl-ceo-rohit-kapoor/story/267187.html.
Forrester Research (2014) Building a Centerof Expertise to Support Robotic Automation.
Khalaf, A. (2017). The benefits (and limitations) of RPA implementation.Retrieved from https://financialservicesblog.accenture.com/the-benefits-and-limitations-of-rpa-implementation.
Lacity, M., Willcocks, L., Craig, A. (2015).Robotic Process Automation at Telefónica O2. October 2, 2019.
The post The Robotics Process Automation Guide: Explaining Successful RPA Projects appeared first on Economalytics.
]]>Welcome to the Machine Learning Project Calculator!
This tool is based on more than 50 real machine learning projects and their outcomes. As you will see, this caluclator focuses at the moment primarily on projects, where you try to predict binary outcomes (e.g. whether a stock increases or declines). If you would like to see more features, please contact me on my contacts page or leave a comment below! All your feedback would be highly appreciated and taken into account for the development of this tool.
Copyright Andrej Pivcevic 2019
The post Machine Learning Project Planner by Economalytics appeared first on Economalytics.
]]>What gives your company a sustainable competitive advantage in the era of increasing digitalization, where several new technologies arise, where the walls that kept new entrants away are collapsing, where transaction as well as communication costs are decreasing, where computational power is almost available to everyone, and where more and more powerful algorithms are forged? The answer is straightforward: data is the main factor that will determine whether you will be able to keep up with your competitors. The more data you have that your competitors cannot gain access to, the stronger the competitive advantage.
You wonder why? It might be true that there are more powerful algorithms emerging and that the computational power available becomes more accessible to everyone, but the important point is, who owns the data that we will use to make decisions and to improve the company? If everyone has access to the algorithms and computational power, then the only thing the competitors won’t have access to will be your data. And one way to stay ahead, is the understand, how you can harvest additional data with web scraping. This article will show you have you can use web scraping and crawling to gather further data for your company.
Web scraping is the process of automating the data extraction from the World Wide Web in an efficient and fast way. This is at the heart of market research and business strategy, for instance when you want to compare the prices of your online-store to the prices of the competitors regularly.
In this article we will go through the advantages of web scraping, the applications of web scraping and finally all possible forms of web scraping for your company. Depending on the strategy of your company, the goal of the web scraping and the complexity of the website to be scraped, different forms of web scraping might be preferable. At the same time, if you are just an individual data scientiest looking for a good introduction into the web scraping world, this article will also give you first good insight on how to precede.
There is hardly no area, where web scraping does not have a profound influence. Where data is increasingly becoming a main resource to compete, acquiring the data has also become especially important.
In order to show you the advantages and disadvantages of each method, we will have a look at the following categories mentioned below. For each category, we will assign a score ranging from 1 (poor performance) to 5 (very good performance).
Almost every programming language you will use will have a library that will let you scrape dynamic pages, or at least, that will let you send GET-request through the internet. For Python it would be for instance Scrapy, and for R it would Rvest. This is simplest coding-approach, that can let you extract a high amount of data in a short time. However, it is also the least powerful coding based approach. You will be able to scrape only static homepages. As soon as the structure of the homepages becomes more complex or interaction with the homepage is required, the approach fails.
Automated browsing is also based on a programming language. A programmer basically writes down in a programming language that supports Selenium (Python, R, Java and more) the instructions, what should be done in a Browser. In the backend, you automate all the steps that you would usually do manually on your browser (for example type in the URL and then press enter, click on the first link in the navigation, copy the values from a certain area and paste them into an local excel sheet). The written script will then execute all your instructions by opening a browser and simulating each step as if a human was behind the steps. This is a rather more complex approach compared to simple static webscraping, but at the same time a much more powerful approach, because you can scrape AJAX based homepages, interact with homepages to retrieve certain information that would not be accessible otherwise. At the same
Many homepages and internet-based companies provide own APIs in order to let you access their data. This makes the scraping process much easier and faster, as the data can be scraped with little amount of coding and will be provided in a format that is ready for use. However, the disadvantage of the official APIs is that there usually not for free and cost depending
Even if the homepage you want to scrape does not provide an official API, there are chances that there is a “hidden API”, escpecially if the hompage works with AJAX-calls. A proficient programmer could easily access the AJAX-interface, send requests with little code and extract all information necessary in an easy interpretable format like JSON. While this approach can give you access to large amounts of data, it is generally less flexible and requires advanced knowledge of how homepages are developed. If you want to know more about hidden APIs and how to implement them, then I would suggest you consult the following two homepages:
There is a vast variety of different web scraping tools that will suit your need and help you implement your web scraper with little coding. There are different tools ranging from very powerful ones that regularly change the IP-address and can overcome even captcha, to simple ones that can simply scrape only static homepages. There are tools that can help you scrape data regularly on a continuous basis or that can help you conduct a one-time scraping. Many tools also offer additionally customer support. The only advantage of this approach is that it is very costly depending on the capabilities of the tool. Some tools like Octoparse, let you scrape data for free up to a certain limit. Here is a description of the abilities of Octoparse:
“Octoparse is a fantastic tool for people who want to extract data from websites without having to code. It includes a point and
In case you want to dive further into this approach, here is a homepage that compares 10 web scraping tools.
This is the approach to go if you plan to outsource the scraping completely. From your side, all that is required is to hire a web scraping service and to explain exactly what information you need and the rest will be taken care of by the service. This approach is especially useful for one-time scraping. However, this approach can also be quite costly. A popular web scraping service is DataHen, that is regularly recommended. To get more information on the
When choosing the right approach, you should consider whether you want to outsource the web scraping process or develop it internally. For your web scraping project, try to keep it a simple as possible. That implies that you should only use powerful tools, if they are really necessary. If you settle for a complex approach that is not required, you will overspend on maintenance and features that are not required.
Web scraping offers several advantages including the following ones:
While web scraping can provide the company with tremendous benefits, there are also a few downsides and assumptions it rests on:
The post The Ultimate Guide to Web Scraping for Business appeared first on Economalytics.
]]>A switching regression model is used to either classify unobservable states or to estimate the transition probabilities for these unobservable states in a time series. It can be considered as a clustering algorithm for time series, which gives you the estimated equation for each cluster and the probability that the time series falls into that cluster at the given point in time. A switching regression can be applied in any business area where you have a time series, and has already been successfully applied by economists to analyze the business cycles, by mutual fund managers in assessing mutual funds and by investment bankers to evaluate stock returns.
I will explain you on the basis of an example what a switching regression can do. A time series is a collection of data where you followed an individual over a longer period of time and recorded specific variables at several points on time. A simple time series is for instance is the price of gold on the stock market. Here you can see the development of the gold price from 1995 until today.
When you look at the figure, you will realize that fitting a simple linear regression might not be a good idea, because the time series does not grow in a straight line. Ideally, you would hypothesize that the first part until approximately 1970 would fit to a rather very flat regression line, the parts from 1970 until 1983 and from 2000 until 2015 to a steeply increasing regression line, the part from 1983 until 2000 to a mildly decreasing regression line. A switching regression model would help you first to identify, how many different unobservable phases are there, what are their estimated equations, how does the influence of certain variable differ depending on the state, and what is the probability that the time series is in any of the different phases at any point in time. Here is the example, what states a switching regression model would identify for the gold price time series:
A switching regression analysis can be practically applied any field, where you want to analyze different unobservable states in time series. It has been already successfully applied in the area of finance and economics to understand business cycles, asset allocation, stock returns, interest rates, portfolio management, and exchange rates. However, the also other possible application in various areas. Here are a few examples:
In general, you should consider using a switching regression model for the following five purposes:
You can use a switching regression model when the underlying process is a markov process. This means that your time series is believed to transition over a finite set of unobservable states, where the time of transition from one state to another and the duration of a state is random. It is not difficult to use a switching regression and you can do it in four simple steps. I will show you how to compute and interpret your own switching regression model based on gold data from the introduction.
First of all, I need to upload the data and make sure that all the variables have the right data type. In this case, when you upload the data set, you will see that the variable Date is still a character. Therefore, I will convert it to a Date-type using the function as.Date().
############# Library
# install.packages("MSwM")
# install.packages("ggplot2")
library(MSwM)
library(ggplot2)
############# Step 1: Set up Data
Gold <- read.csv("C:/Users/apivcevic/Desktop/Privat/Switching Regressions/monthly_csv.csv")
Gold$Date <- as.Date(paste(Gold$Date,"01",sep="-"), format="%Y-%m-%d")
ggplot(Gold, aes(Date, Price)) + geom_line()
In the second step, you will need to decide on the number of states that you expect. In the context of switching regressions and Markov processes, you usually say regimes instead of states. However, I will continue using the word states. Your decision on the number of states should be theory-driven. That means that you have a clear theory how many states should be possible and how many states you want to estimate. If you analyze a stock, you might expect only two states: the stock goes up or goes down. Therefore, you would assume only two hidden states. Now let’s have look at our example:
In our example, I expect three different hidden states. The first one is a stagnating state, the second one is a sharply increasing state that we can observe after 2000, and a volatile stagnating state that we can mostly observe before 2000. Therefore, I assume that there should be three different states. Keep in mind, that you do not want to specify too many states for two reasons. First, the more states you have the more complex the interpretation gets. Second, the estimation of a switching regression model is computationally complex, which means the more data you have and the more states you have, the longer your it will take to compute it.
############# Step 2: Decide on States
nstates <- 6
The switching regression will now estimate a different linear equation for each state that we specified. Furthermore, it will calculate the transition probabilities for each state according to the following overview, where p_{ab} stands for the transition probability from state a to state b:
Since I have an economic background, here my small question to you. Why was the price of Gold so stable until 1970 (there is a pretty logical explanation ;))?
We will use the msmFit()-function form the MSwM-package to estimate the switching regression. The msmFit()-function needs as input a regression model produced by the lm()-function.
############# Step 3: Estimate Switching Model
olsGold <- lm(Price~Date, Gold)
msmGold <- msmFit(olsGold, k = nstates, sw = c(FALSE, TRUE, TRUE))
At this point, I should mention that there are various types of markov-switching regression models, where each type has its advantages and disadvantages. You can basically apply all statistical tools you know from time series. Here two examles:
If you understood the three examples, you will realize that I applied the simplest switching regression model here: a univariate first-order switching regression with fixed transition probabilities. Furthermore, there two general families of switching regression models:
We can interprete a switching regression models in two ways, first by looking at the coefficients and secondly graphically.
############# Step 4: Interpret & Evaluate Switching Model
summary(msmGold)
The code will give use the following results:
Markov Switching Model
Call: msmFit(object = olsGold, k = nstates, sw = c(FALSE, TRUE, FALSE))
AIC BIC logLik
10010.85 10056.59 -5001.427
Coefficients:
Regime 1
---------
Estimate Std. Error t value Pr(>|t|)
(Intercept) 121.4889 0.0008 151861.125 < 2.2e-16 ***
Date(S) 0.0209 0.0013 16.077 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 98.68257
Multiple R-squared: 0.8681
Standardized Residuals:
Min Q1 Med Q3 Max
-9.909234e+01 -1.857703e+01 7.932031e-04 1.963472e+01 1.555179e+02
Regime 2
---------
Estimate Std. Error t value Pr(>|t|)
(Intercept) 121.4889 0.0008 151861.125 < 2.2e-16 ***
Date(S) 0.0772 0.0013 59.385 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 98.68257
Multiple R-squared: 0.8948
Standardized Residuals:
Min Q1 Med Q3 Max
-2.803412e+02 -2.134766e+00 -2.736030e-04 3.681998e-04 4.849814e+02
Regime 3
---------
Estimate Std. Error t value Pr(>|t|)
(Intercept) 121.4889 0.0008 151861.125 < 2.2e-16 ***
Date(S) 0.0502 0.0013 38.615 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 98.68257
Multiple R-squared: 0.92
Standardized Residuals:
Min Q1 Med Q3 Max
-208.87791580 -4.74304709 -0.07559093 1.23991461 197.72230992
Transition probabilities:
Regime 1 Regime 2 Regime 3
Regime 1 9.955638e-01 1.712038e-08 0.01485319
Regime 2 5.173007e-09 9.717252e-01 0.02020467
Regime 3 4.436244e-03 2.827476e-02 0.96494213
You will realize, that it will give us a different equation for each state.
There you see now, that none of the regimes has a negative estimate. Apparently, the price of gold has been increasing through all three states it would go through. The effect size is the highest in State 2, therefore this states will probably represent an extreme growth. State 3 has a more moderate effect size, therefore it makes sense to name state 3 moderate growth. Finally, state 1 has the lowest effect size, so I would suggest to name it slow growth. In your further analysis, it might be interesting to include further independent variables to see for instance, how much they were the driver behind the growth in each phase.
Another thing we can look at are the transition probabilities, which are summarized at the very bottom of the output.
What you will see is that the states are pretty stable, which means that the underlying states change rarely over a period of a month. Furthermore, you will see that the transition probabilities for switching to the first “moderate growth” state are generally higher than for any other. Of course you can go in greater depth in your analysis, but I will leave that to you.
I will use the following code to produce the relevant graphs. I will have one graph for each state. You will see that each graph consists of two figures. The upper one displays the gold time series and grey-highlighted areas. The grey-highlighted areas are where the switching regression model estimated that the time-series was in the respective state. The lower figure displays the probability that the time-series was in the respective state for any point in time.
# Graphical Overview of Probability and predictions
plotProb(msmGold, which=2)
plotProb(msmGold, which=3)
plotProb(msmGold, which=4)
When we are looking at the upper figure, we can see that this state most likely describes the one of slow growth. The probabilities also seem to be very clear with little chance for misinterpretation.
The second regime apparently is the high-growth one or the one with the highest volatility, as the gold price increases rapidly, peaks, and then it falls down to a little higher price than it was before it started to soar. Only the increase around 200 has a lower probability as it looks as this part does not necessarily fit that well into this state.
Finally, the third state seems to be the one of moderate growth. Also here the probabilities are not that clear for the one cluster around 200. Regardless of that, it looks relatively reasonable. I will not dug deeper into the interpretation here as well. I will leave that you.
Switching-regression models have a few advantages compared to other regression models. Here is a short overview.
If you are still interested into the topic, I can recommend you the following readings to dive deeper into the topic:
MSwM exmaples – Jose A. Sanchez-Espigares, Alberto Lopez-Moreno, Dept. of Statistics and Operations Research
Hamilton, J. D. 1989. A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica 57: 357–384.
1993. Estimation, inference and forecasting of time series subject to changes in regime. In Handbook of Statistics 11: Econometrics, ed. G. S. Maddala, C. R. Rao, and H. D. Vinod, 231–260. San Diego, CA: Elseiver.
Kim, C.-J. 1994. Dynamic linear models with Markov-switching. Journal of Econometrics 60: 1–22.
1994. Time Series Analysis. Princeton, NJ: Princeton University Press. (Chapter 22)
The post Switching Regressions: Cluster Time-Series Data and Understand Your Development appeared first on Economalytics.
]]>Qualitative Content Analysis is the method that helps you summarize the meaning of qualitative data using a coding frame. It helps you to boil down a high amount of information to its most important core while it only picks up distinct concepts (mutually exclusive) and covers all aspects present in the data (collectively exhaustive). Qualitative Content Analysis reduces and summarizes the data which makes it different from other qualitative methods that aim at enriching or interpreting data.
The first main advantage of Qualitative Content Analysis is that it can be applied to a wide range of data: songs, speeches, social media posts, pictures, interviews, newspaper articles, journal entries and others. The second main advantage is that it can help you to translate qualitative data into quantitative data so that you can apply statistical analysis. It is particularly useful to solve specific problems where only the core information is needed especially in areas such as Marketing & Sales, Human Resources, Brand Management, Product Development, Quality Management and Qualitative Benchmarking.
Qualitative Content Analysis can be used in all areas where a big amount information needs to be summarized to its core what can range from analyzing technologies to analyzing how the own company is perceived in public. The main limitation of Qualitative Content Analysis is availability of data as you need ideally a wide range and a high amount of data to apply it consistently. Here are a few examples, where it might be applied:
Conducting a Qualitative Content Analysis involves 7 steps involving roughly making use of a coding frame, generating category definitions, segmenting the material into coding units, and distinguishing between a pilot phase and a main phase of analysis. It is very important that you have a clearly defined research question before you start. Furthermore, keep in mind that for qualitative methods, it is not as straightforward and easy to access reliability and validity as for quantitative methods. Therefore, the quality of the result is assessed by other mean as for instance by consistency and a systematic approach. They ensure that your results are valid, reliable and especially, credible.
After you have formulated a clear research question, the first step would be to gather data that reflects ideally the full diversity of your research topic. You will also need to clarify whether you want to use one form of data, for instance only newspaper articles, or several ones, for instance videos and books. My general recommendation is that you try to focus on one form of data as it will be more consistent and easier to segment the material later on. However, depending on the focus, you might also place greater focus on diversity of your data source rather than consistency. To give you one example, if I had to analyze product reviews on Amazon for the Apple IPhone X, I would select the material that I can find online on Amazon as simple as it is and not focus on other sources, as the research question would be to identify how is the IPhone X perceived in the Amazon reviews.
Building a valid and reliable coding frame is the most critical and trickiest part of the whole analysis. Coding frame is at the heart of any Qualitative Content Analysis and specifies all the different meanings that you want to capture and distinguish in your analysis. Therefore, it is especially important to be cautious here and to make right decisions.
Coding frames consist of at least one main category and at least two subcategories. They can vary in complexity and consist of any number of main categories, contain several hierarchical levels and even subcategories within subcategories. However, since the main goal is usually to understand and communicate the results from the Qualitative Content Analysis, I recommend it to keep it simple and to avoid adding further subcategories to subcategories.
The main categories represent the more abstract aspects of your material that are of interest for you. The subcategories cover concretely what is actually mentioned within that specific main category. You can think of a main category being like a variable in statistics and subcategories of the values that this variable can take on. For instance, hair color could be a main category when we want to summarize the physical aspects of people and the corresponding subcategories might be brunette, blond or black.
There are further three requirements for a coding frame to work.
If you meet these requirements, then you laid the first stone towards a good coding frame. A good coding frame is reliable and valid. Reliable means that even others can understand as well apply your coding frame and ideally recreate your results. Valid means that your coding frame captures all important aspects of your material so that all relevant sections of your material can be assigned to a main category or subcategory. Constructing a coding frame is not difficult and it also entails three essential steps:
Step 1 – Selecting the material: You select a part of your material (for instance around 50%) and ideally the most diverse parts, that you will use to build the coding frames. That way you ensure that your coding frames cover ideally all important aspects present in the whole 100% of your material. My recommendation is that you do not try to build your whole coding frame at once. It might make sense to focus on one aspect at a time. That way you will make sure that you do not miss out any important aspects and that you build a coding frame that is consistent.
Step 2 – Creating the categories: When you create the categories, you have three possibilities on how to start, depending on whether you decide to do it in a data-driven way, concept-driven way or a mix of both:
In case you decide to work in a data-driven or mixed way, you will have again several strategies on how to develop the categories from the material. The two most often practiced strategies that will help you derive the main categories as well as subcategories systematically are subsumption and progressive summarizing.
Step 3 – Defining the categories:
In Qualitative Content Analysis, it is very important that your categories are clear even to other people and it should be clear what you mean by a given category. This is very important for the reliability of your coding frame, because when it is not clear what you mean by a given category, people will not be able to understand your coding frame and tend to assign passages to different subcategories than you. And this is a big problem not only because it will be difficult to present your results, but they will also be simply less credible if they are repeatedly misunderstood by other people.
This is why you write a definition for each main category and subcategory. For main categories, the definitions can be short, but for subcategories they should be more extensive. A definition should always include the following elements:
Here you should to make sure that subcategories within one main category are indeed mutually exclusive. Especially for this requirement, decision rules might be very useful.
You should remember that the development of a coding frame is not a linear process. It might mean that you will need to go back to earlier steps, if you are not successful while summarizing it. From my personal experience, it is of crucial importance that you know what you research question is. If you for instance analyze how digital transformation influences the company, then you will find out that the same research question might lead to very different coding frames depending on how you interpret the question. See the following two examples:
Here you answered the very same question with probably the very same material with entirely two different coding frames. Therefore you should have a clear idea what your goal is.
When you segment your material, you divide it into several chunks also called units of coding. Segmentation is especially important for coding, because you will need to assign a subcategory from each main category to every unit of analysis you have and it is the basis for the comparison later when two different people code. You will need to choose the units of analysis in such a way that they can be interpreted in a meaningful way with respect to the subcategories.
<overview text passage, codes, coding sheet>
In order to segment for instance several images, you can simply take each image as unit of analysis. If your material consists of newspapers, you can decide each newspaper article to be a segment. In order to segment the material properly, it might be necessary to define criteria specifying when a segment should start and when one should end. There are two types of such criteria.
Is there an advantage of formal criteria over thematic criteria and vice versa? Definitively. Thematic criteria have the advantage that one unit of coding will correspond to one particular topic and depending on the structure of the material. It might make your material even more valid when for instance formal criteria would segment the material in such a way, that one unit of analysis covers several topics. This might produce a conflict because one unit of coding would fit very well to two subcategories within a main category. Thematic criteria will avoid this information less and make sure that your coding frame is more representative of the material. Furthermore, sometimes your research question simply favors thematic criteria. If you are interested in conflicts within a specific book, then it simply does not make sense to structure that book according to formal criteria.
On the other side, formal criteria have the advantage, that they are clear, understandable and fast. It is very easy to segment your material according to formal criteria and it is hardly ambiguous. That means, even when other people were to segment your material, they will derive the same units of coding. Likewise as for thematic coding, some research questions will already favor formal criteria such as how chapters differ from each other within a book or what are the most frequent aspects mentioned by my customers in the product reviews.
When you segment the material, you usually assign a number to each unit of coding consecutively. If you use formal criterion, you can skip the extra step of segmenting your material and it can be done in parallel with coding. If you have thematic criteria, you will need to segment the material first before you code it. Furthermore, at the end you should derive a coding sheet that you will use to code your material. The columns contain the categories, the rows contain the units of coding and in each cell you will write down the subcategory for the respective main category and unit of coding. Your coding sheet should look like this:
Main Category 1 | Main Category 2 | Main Category 3 | … | |
Unit 1 | ||||
Unit 2 | ||||
… |
If you have quantitative background, you will realize that your coding sheet will resemble a dataset consisting of only categorical variables. This actually is the way Qualitative Content Analysis can help you translate qualitative information into quantitative information, so that you can run statistical analysis on the data. After you completed all the following steps and filled out your coding sheet, you could for instance compute frequencies or correlation coefficients between different categories.
After you developed your coding frame, you will need to test it out on part of the material that is ideally different from the material that you developed the coding frame on. Ideally the material selected should again cover all types of data and all aspects you anticipate to find in your data. In this step, you want to assess the quality of your coding frame before you start to apply it on the whole data. Therefore you want to check for reliability and validity of your coding frame. This you will do in the next step.
Assess the reliability of the coding frame
Reliability describes to what extent your coding frame is reproducible and generalizable. In order to evaluate the reliability, you will need to double-code the material using the same coding-frame. That means that first find a second person that will help you and you apply the same coding frame on the same data at the same time independently from each other. If you will need to work alone. It is also possible to you code the material twice yourself, but make a two-week break between the two runs. If the definitions of subcategories are clear enough and if the subcategories are indeed mutually exclusive, then the units of coding should be assigned to the same subcategories by both people, you and your partner.
After you have completed the double-coding, then you assess the reliability of your coding frame. This can be done in two ways:
In order to achieve reliability, it is very important to keep the complexity and scope of your coding frame as small necessary. Coding frames that consist of more than 200 categories will be more likely to lead to errors by both coders as well as to disagreement. One possibility to handle larger coding frames is to not code all main categories at once, but do it consecutively.
Assess the validity of the coding frame
Validity tells you the degree to which your categories cover the material and relevant concepts present in your material. For data-driven parts of your coding frame, you can assess the validity of your material first by checking whether all units of coding of your material fits into one subcategory of every main category. Second, you look whether you needed to introduce residual categories. Too many residual categories tell you that you coding frame is not valid enough. Third, you check whether you have used a subcategory much more often than others and whether certain subcategories have not been used at all. If this is the case, it might be better to split the most frequent categories into smaller more precise ones.
After you have adopted and revised your coding frame, it is time to code all your material. If your coding frame proved to be already sufficiently valid and reliable, you can code all your material alone without double-coding. If you needed to adapt your frame, it will be important that you double-code again in order to make sure that your newly produced coding frame is reliably and valid. However, it is not necessary to double-code everything. It might be enough to just double code around one fourth of your material. Generally, the more changes you had to make in Step 5, the more you should double-code.
Finally, there is one more trick. If you have shown that your coding frame is reliable, you can divide up the unit of coding among several people and split the work. It works, because a reliable coding frame will produce the same results regardless of who is coding.
After you have completed Step 6, you will ideally end up with a completed coding sheet. If your goal was to analyze customer reviews for your products, then your coding sheet might look the following way:
Complaints | Suggestions | Likes | |
Customer 1 | Long Delivery | Open shop | Great quality |
Customer 2 | Long Delivery | Open shop | Great quality |
Customer 3 | Too expensive | Offer loyalty bonus |
At this point, depending on your goal and research questions, there might be several possibilities on how to deal with results. Generally, there are three possibilities:
The Qualitative Content Analysis is a very flexible method that has these advantages:
Despite these advantages, Qualitative Content Analysis also has some clear disadvantages:
In case you want to read more theoretical articles on Qualitative Content Analysis, I recommend that you look have a look at the books and articles listed in the references. Here further two articles of how Qualitative Content Analysis is applied in Marketing.
The first one is rather theoretical and aims at giving you an overview, while the second one is more practical and shows how Qualitative Content Analysis can complement a quantitative Analysis.
If you have further questions, criticism, you need help or you have other ideas on how one could apply Qualitative Content Analysis, feel free to leave comment or to drop me a line.
The SAGE Handbook of Qualitaitve Data Analysis – Uwe Flick.
Mayring, Philipp (2010) Qualitative Inhaltsanalyse: Grundlagen und Techniken.
Berger, Arthur A. (2000) Conent Analysis, in Arthur Berger (ed.) media and Communications Research Methods. Thoursand Oaks: Sage pp. 173-85.
Hsie, Hsiu-Fang and Shannon, Sarah E. (2005) “Three approaches to qualitative content analysis“, Qualitative Health Research, 15: 1277-88.
Krippendorff, Klaus (2004) Conent Analyis: An Introduction to its methodology. Thousand Oaks, CA: Sage (1^{st} edition, 1980).
Schreier, Margrit (2012) Qualitative Content Analysis In Practice. London: Sage.
The post Qualitative Content Analysis: What do your customers really mean? appeared first on Economalytics.
]]>In our small case study, I will show you how you a can understand your customer by their actual underlying utilities and preferences by showing you a concrete example of a conjoint analysis. The case is fictional. Conjoint analysis is a set of methods that enables you derive the underlying utilities and preferences of consumers by looking at their decision. In contrast to classical methods, you do not need to run after the customer and ask him what he likes, but rather you just observe his actually choice or judgement. Based on the customers’ choices, you then derive the most likely set of preferences, here called utility function. If you want to know more about conjoint analysis, then check out my in-depth article about conjoint analysis. If you want to know how you can build your own conjoint analysis, check out my detailed step-by-step guide for constructing your own conjoint analysis.
In the small case today, I will help a laptop startup company named Ethos understand its primary target customer: students at a university. Ethos wants to sell their laptop mainly online through platforms and it is excited to bring their vision into realilty. However, they know that they have to make the right decisions and have three main questions that they want to have answered.
In this section, I will shortly go through the seven steps presented by me on how you can construct your own conjoint analysis.
After having talked to the product manager of Ethos, it is clear that the attributes we want to look for are the following ones with the following expectations:
Attribute | Levels | Expected Influence | Expected Interactions |
Brand | 5: Apple, Lenovo, Acer, Asus, Ethos, Other | – | – |
Cores | 2: Dual Core, Quad Core | linearly increasing | RAM |
RAM | 3: 4 GB, 8 GB, 16 GB | Linearly increasing | Cores |
Hard Drive | 3: 256 GB, 512 GB, 1024 GB | logarithmic | – |
Display Size | 3: 12 Inch, 14 Inch, 15.2 Inch | quadratic with optimum | – |
Display Quality | 2: Normal, HD | – | – |
Touch Screen | 2: Yes, No | – | – |
These are the variables that are thought to be the most important ones, because the consumers make decisions on them. Ideally, the variables have resulted from a qualitative investigation such as focus groups and interviews. One interesting point is that we might expect an interaction between the variable cores & RAM, since many cores with little RAM is thought to be much less interesting for a consumer than many cores with lots of RAM. If the concept of interactions is new to you, then I recommend you look at the two articles provided in the introduction that provide the theoretical background.
In our case the problem is relatively clear, we want to understand the possible customer. Therefore, a vector model or a mixed model cannot help us further. The ideal-point solution on the other side offers an interesting map for each person, but it is less useful in answering the second and third question that Ethos posed. The ideal model would be a part-worth model in our case. A part-worth model fits very well with using a fractional factorial design. We can use it to answer all three questions and we can even visualize the results with clear graphs. This makes it the ideal model in order to understand the customer. With respect to predicting the market share, the mixed-model should be prefered over the part-worth model. However, we also have mostly categorical variables and for the sake of simplicity, we will also use the part-worth model to predict the market share instead.
If we want to use a part-worth model, it makes most sense to use the concept evaluation method. Since Ethos wants to sell its laptop online, the goal is to make the conjoint analysis as similar to this situation as possible. Since on a platform like Amazon the laptops are usually indeed shown in a concept way, concept evaluation seems to be the best fit. By making it similar, we can increase the probability that we can later on generalize it to the real case, e.g. students buying laptops from Ethos on an online platform like Amazon one day. Another thought is that when customers search for laptops on online platforms, they do not buy them directly.
A second important aspect is that, according to the interviews conducted with potential customer prior to constructing the conjoint analysis, the customers do not make immediate decision about the purchase of the laptops. They rather first go through the laptops they can find online and make a first evaluation of them. Then they in most cases decide for the one they consider the best depending on their preferences. This makes us believe that it makes sense to ask our customers to rate each alternative rather than let them make decisions immediately.
Since there are no interaction effects, we will use a fractional factorial design that we can generate simply using the package “DoE.base” in R. Using this package, it is possible to test out the optimal number of levels and variables for a fractional factorial design. There are many other packages available, but “DoE.base” is the simplest and most straightforward way, as the other packages require more in-depth knowledge. We use the following code to generate a fractional factorial design and insert our level descriptions:
####################### Preparation
#### Step 4: Experimental Design
# Creating a fractional Design
install.packages("DoE.base")
library(DoE.base)
test.design <-oa.design(nlevels =c(6,2,3,3,3,2,2))
FracDesign <-as.data.frame(test.design)
names(FracDesign) <-c("Brand", "Cores", "RAM", "HardDrive","DSize","DQuality","TouchScreen")
levels(FracDesign$Brand) <-c("Apple", "Lenovo", "Acer", "Asus","Ethos", "Other")
levels(FracDesign$Cores) <-c("Dual Core", "Quad Core")
levels(FracDesign$RAM) <- c("4GB", "8 GB", "16 GB")
levels(FracDesign$HardDrive) <-c("256 GB", "512 GB", "1024 GB")
levels(FracDesign$DSize) <-c("12 Inch", "14 Inch", "15.2 Inch")
levels(FracDesign$DQuality) <-c("Normal", "HD")
levels(FracDesign$TouchScreen) <-c("Yes", "No")
rm(test.design)
# Save design into an excel file
install.packages("xlsx")
library(xlsx)
write.xlsx(FracDesign, "C:/Users/Economalytics/Desktop/ExperimentalDesign.xlsx")
Brand | Cores | RAM | HardDrive | DSize | DQuality | TouchScreen | |
1 | Acer | Dual Core | 16 GB | 1024 GB | 12 Inch | Normal | Yes |
2 | Lenovo | Quad Core | 16 GB | 256 GB | 15.2 Inch | Normal | Yes |
3 | Other | Dual Core | 8 GB | 512 GB | 12 Inch | Normal | No |
4 | Asus | Dual Core | 16 GB | 512 GB | 14 Inch | HD | No |
5 | Apple | Quad Core | 8 GB | 256 GB | 12 Inch | Normal | Yes |
6 | Lenovo | Dual Core | 8 GB | 256 GB | 12 Inch | HD | No |
7 | Ethos | Dual Core | 8 GB | 1024 GB | 14 Inch | Normal | No |
8 | Asus | Quad Core | 16 GB | 1024 GB | 12 Inch | HD | Yes |
9 | Other | Quad Core | 4 GB | 256 GB | 14 Inch | HD | Yes |
10 | Apple | Dual Core | 16 GB | 256 GB | 15.2 Inch | Normal | No |
11 | Asus | Dual Core | 4 GB | 256 GB | 15.2 Inch | Normal | No |
12 | Ethos | Dual Core | 16 GB | 256 GB | 12 Inch | HD | Yes |
13 | Ethos | Quad Core | 8 GB | 512 GB | 12 Inch | HD | No |
14 | Ethos | Quad Core | 16 GB | 512 GB | 15.2 Inch | HD | No |
15 | Lenovo | Dual Core | 16 GB | 1024 GB | 14 Inch | Normal | No |
16 | Apple | Quad Core | 4 GB | 1024 GB | 12 Inch | HD | No |
17 | Ethos | Dual Core | 4 GB | 256 GB | 14 Inch | Normal | Yes |
18 | Other | Dual Core | 16 GB | 512 GB | 15.2 Inch | HD | Yes |
19 | Other | Quad Core | 8 GB | 1024 GB | 14 Inch | Normal | Yes |
20 | Asus | Quad Core | 4 GB | 512 GB | 12 Inch | Normal | Yes |
21 | Lenovo | Quad Core | 8 GB | 512 GB | 15.2 Inch | HD | Yes |
22 | Acer | Quad Core | 16 GB | 512 GB | 14 Inch | Normal | Yes |
23 | Asus | Quad Core | 8 GB | 1024 GB | 15.2 Inch | Normal | No |
24 | Acer | Dual Core | 8 GB | 1024 GB | 15.2 Inch | HD | Yes |
25 | Other | Quad Core | 16 GB | 256 GB | 12 Inch | Normal | No |
26 | Lenovo | Quad Core | 4 GB | 512 GB | 14 Inch | Normal | No |
27 | Apple | Dual Core | 8 GB | 512 GB | 15.2 Inch | Normal | Yes |
28 | Asus | Dual Core | 8 GB | 256 GB | 14 Inch | HD | Yes |
29 | Apple | Quad Core | 16 GB | 1024 GB | 14 Inch | HD | No |
30 | Lenovo | Dual Core | 4 GB | 1024 GB | 12 Inch | HD | Yes |
31 | Other | Dual Core | 4 GB | 1024 GB | 15.2 Inch | HD | No |
32 | Apple | Dual Core | 4 GB | 512 GB | 14 Inch | HD | Yes |
33 | Acer | Quad Core | 4 GB | 256 GB | 15.2 Inch | HD | No |
34 | Ethos | Quad Core | 4 GB | 1024 GB | 15.2 Inch | Normal | Yes |
35 | Acer | Quad Core | 8 GB | 256 GB | 14 Inch | HD | No |
36 | Acer | Dual Core | 4 GB | 512 GB | 12 Inch | Normal | No |
The table from above shows the fractional design that we will use for our conjoint analysis with one row corresponding to one run. The idea is that each person that participates in our conjoint analysis, will go through each run and rate the laptop. The more people participate, the better and more precise information we will have in order to estimate the market share and to understand our potential customers. Now, let’s have a look how many runs would be necessary if we were to run a full factorial design:
# Example for full factorial design
install.packages("AlgDesign")
library(AlgDesign)
numberlevel = c(c(6,2,3,3,3,2,2))
fulldesign <-gen.factorial(numberlevel)
nrow(fulldesign) # Runs full factorial
nrow(FracDesign) # Runs fractionalfactorial
Here, it becomes evident the advantage of the fractional factorial design. If we had to run a full factorial design, we would have needed let one person go through 1296 runs. This means that each person participating in the study would need to rate 1296 laptops in that case! Using a fractional factorial design, we managed to reduce it to only 36 runs, that is an incredible reduction of 97%. However, this is only possible if there are no interaction effects between our variables. Initially we expected an interaction between the variables Cores and RAM, but upon some interviews, it seems like that there does not seem to be any significant interaction. Therefore, the main prerequisite for a fractional factorial design is met. We will not discuss the disadvantages and further thoughts for designing an experiment here, because we want to keep it simple.
Since it was clear from the very beginning, that Ethos will go fo an online sales strategy, it was very important to design the presentation of an alternative as realistic as possible. While for a physical shop you might showcase prototypes of different products in the real environment and then ask for a rating, you will also want to make it realistic for the online scenario. Since Ethos considered to sell its laptops on Amazon because it would be difficult to attract customers from the scratch, it was necessary to adapt the concept evaluation to the design of Amazon including the disadvantages and advantages it might offer. Therefore, I constructed an experimental homepage that resembled amazon for collecting data. Below you can find an example of how the 20th run from the table in Step 4 would look on the homepage:
Another consideration is that it might be useful to add a description of all attributes and why they might be important, before the customer starts to rate the laptops. A laptop purchase by a student can be considered an investment on which they will spend a considerable amount of time and inform themselves prior to the purchase. It cannot be compared to a drink in the supermarket or an ice cream. We need to make sure that the customer can fully inform themselves before they make decisions. Therefore, we include a description of all attributes, the importance and the relevance. For instance, we would explain that high RAM might be important if you edit videos, edit high resolution images or process high amounts of data. Furthermore, we would add a constraint in such that they have to read through the description and that the whole experiment cannot be completed under 30min. This forces the customer to think every option through and really engage with the alternatives in order to achieve realistic and accurate ratings.
Since we want the customer to rate each alternative, we will need a metric measurement, particularly a likert scale. Likert scales are per default interval scales, which means that we would only have the knowledge of how much the overall utility would increase by changing the level of an attribute. We are also restricted to an interval scale due to the fact that we chose a part-worth model as well as fractional factorial design. A continuous or ratio variable would generally not be possible with a fractional factorial design or part worth model unless we can make some assumption about linearity and interactions which are simply unrealistic. But the advantage of a likert scale is that it has proven to be more reliable in studies. The rating of a run might look like this:
Finally, there is not much room left to choose from the pool of estimation methods. The best fitting estimation method for our case is to use multiple linear regressions to estimate the utility function for each individual, because multiple linear regressions are perfectly able to estimate each factor and are well suited for fractional factorial design. What the method will do in our context, in a nutshell, is to look at the ratings of a customer and calculate the most likely utility function. Hence it tries to understand the choices and understand which attributes are the most important for each individual.
Finally, we create a survey, gather participants from our target group and we let them participate in our survey. They basically rate each run from the experimental design with a number ranging from 1 to 9, where the higher number indicates that the laptop suits the preferences more. 9 indicates a perfect fit, while 1 a very bad fit. Now that we prepared the complete conjoint analysis, it is about time to collect the data. For our case, we create a simulated dataset using the following code:
####################### Creating Utility Functions
#### Data Collection (Create Dataset)
# Create basis
set.seed(1234)
n <- 89 # number of participants
Data <- data.frame(Participant =1:89)
Data$Participant <-as.factor(Data$Participant)
for (run in 1:36) {
Data[,paste("Run",as.character(run), sep = "")]<- sample(c(1:9), n, replace = TRUE)
}
# Shaping the data
Data[,c(6,11,17,28,33)] <-Data[,c(6,11,17,28,33)] + 2 # Improve Apple
Data[,c(8,13,14,15,18,35)] <-Data[,c(8,13,14,15,18,35)] - 2 # Decrease Ethos
Data[,c(2,4,5,7,8,11,12,13,16,18,19,25,28,29,31,32,33,37)]<- Data[c(2,4,5,7,8,11,12,13,16,18,19,25,28,29,31,32,33,37)] - 0.6
Data[,c(2,3,5,9,11,13,15,16,19,23,26,30)]<- Data[, c(2,3,5,9,11,13,15,16,19,23,26,30)] + 0.9
Data[,c(2,3,6,9,10,13,18,19,20,21,22,23,25,28,29,31,33,35)]<- Data[,c(2,3,6,9,10,13,18,19,20,21,22,23,25,28,29,31,33,35)] + 1
Data[,-1] <- round(Data[,-1])
Data[,-1][Data[,-1] < 1] <- 1
Data[,-1][Data[,-1] > 9] <- 9
Now that we collected the data, it is time to run the analyses. We will run the analysis in four steps and try to answer the questions that we need to know for Ethos. First of all, we will estimate the part-worth model and visualize it for a few variables. The part worth models are supposed to help us understand the target customers and help us derive the “ideal” laptop. In a second step, we will dig deeper into our customers’ minds and try to understand what variables really matter. For this purpose, we will calculate the relative variable importance and compare these. Especially, we want to understand how the brand influences the consumers and whether there are any disadvantages for Ethos. Finally, we will show you quickly how you can estimate your potential future preference share and make simulations. The question that we want to answer, are the following ones:
First of all, we will need to merge the results with the design, so that each row represents a laptop with its features followed by the ratings it received by the 89 participants:
########################## Estimatingthe Part-Worth Models
# Merging FracDesign and Data
install.packages("data.table")
library(data.table)
Data$Participant <- NULL
Data <- transpose(Data)
rownames(Data) <- c(1:36)
Conjoint <- cbind(FracDesign, Data)
In the next step, we estimate the part-worth values for each person using a multiple linear regression model. At this point, the procedure might differ depending on the purpose, but since we want to estimate the preference share at a later point in time, we need a model for each person.
# Compute linear regression for eachperson
install.packages("rlist")
library(rlist)
Regressions <- list()
for (person in 8:ncol(Conjoint)) {
model <- lm(Conjoint[,person]~ factor(Brand) +
factor(Cores) +
factor(RAM) +
factor(HardDrive) +
factor(DSize) +
factor(DQuality) +
factor(TouchScreen) , data =Conjoint)
Regressions <- list.append(Regressions, model)
}
The estimates of the linear regression are our part-worth utilities, whereby we need to remember, that for each categorical variable, one level is used as reference level. This means that for one level in each categorical variable no estimate will be shown because its value will be automatically 0. This shows us that part-worth utilities are interval scale variables. This we will need to consider when we construct a dataframe with all the part-worth utilities for each person. The following code does exactly that. It creates a dataframe where each row represents a level of a variable and where each column represents a participant.
# Create dataframe with part-worthvalues
vars <- c("Intercept",
rep("Brand",6),
rep("Cores",2),
rep("RAM",3),
rep("HardDrive", 3),
rep("DSize",3),
rep("DQuality",2),
rep("TouchScreen",2))
lvls <- c("Intercept",
as.character(levels(Conjoint$Brand)),
as.character(levels(Conjoint$Cores)),
as.character(levels(Conjoint$RAM)),
as.character(levels(Conjoint$HardDrive)),
as.character(levels(Conjoint$DSize)),
as.character(levels(Conjoint$DQuality)),
as.character(levels(Conjoint$TouchScreen)))
Results <-data.frame(Variable=vars,Levels=lvls)
for (person in 1:n) {
c <- as.vector(Regressions[[person]]$coefficients)
coef <-c(c[1],0,c[2:6],0,c[7],0,c[8:9],0,c[10:11],0,c[12:13],0,c[14],0,c[15])
Results[,paste("Person",person,sep="")] <-round(coef, digits = 1)
}
Now that we have the table, we simply calculate the average for each level and plot the result for each variable. Optionally, it might be also interesting to add the standard deviation as whiskers for each level as well. The standard deviation would tell us how homogenous the target group is with respect to one level and might give us a hint on whether it would be even useful to offer more than one laptop.
# Ge averages and visualize them foreach variable
Results[,"Average"] <-round(rowMeans(Results[,-c(1,2)]),digits = 1)
install.packages("ggplot2")
library(ggplot2)
# Brand
subs <- droplevels(subset(Results,Variable == "Brand"))
subs$Levels <- reorder(subs$Levels,subs$Average)
if (min(subs$Average)<0) {
subs$Average <- subs$Average + abs(min(subs$Average))
}
gg1 <- ggplot(data=subs,aes(x=Levels, y=Average, group=1)) +
geom_line() +
geom_point() +
ggtitle("Brand")
# Cores
subs <- droplevels(subset(Results,Variable == "Cores"))
subs$Levels <- reorder(subs$Levels,subs$Average)
if (min(subs$Average)<0) {
subs$Average <- subs$Average + abs(min(subs$Average))
}
gg2 <- ggplot(data=subs,aes(x=Levels, y=Average, group=1)) +
geom_line() +
geom_point() +
ggtitle("Cores")
# DQuality
subs <- droplevels(subset(Results,Variable == "DQuality"))
subs$Levels <- reorder(subs$Levels,subs$Average)
if (min(subs$Average)<0) {
subs$Average <- subs$Average + abs(min(subs$Average))
}
gg3 <- ggplot(data=subs,aes(x=Levels, y=Average, group=1)) +
geom_line() +
geom_point() +
ggtitle("DQuality")
# DSize
subs <- droplevels(subset(Results,Variable == "DSize"))
subs$Levels <- reorder(subs$Levels,subs$Average)
if (min(subs$Average)<0) {
subs$Average <- subs$Average + abs(min(subs$Average))
}
gg4 <- ggplot(data=subs,aes(x=Levels, y=Average, group=1)) +
geom_line() +
geom_point() +
ggtitle("DSize")
# HardDrive
subs <- droplevels(subset(Results,Variable == "HardDrive"))
subs$Levels <- reorder(subs$Levels,subs$Average)
if (min(subs$Average)<0) {
subs$Average <- subs$Average + abs(min(subs$Average))
}
gg5 <- ggplot(data=subs,aes(x=Levels, y=Average, group=1)) +
geom_line() +
geom_point() +
ggtitle("HardDrive")
# RAM
subs <- droplevels(subset(Results,Variable == "RAM"))
subs$Levels <- reorder(subs$Levels,subs$Average)
if (min(subs$Average)<0) {
subs$Average <- subs$Average + abs(min(subs$Average))
}
gg6 <- ggplot(data=subs,aes(x=Levels, y=Average, group=1)) +
geom_line() +
geom_point() +
ggtitle("RAM")
# TouchScreen
subs <- droplevels(subset(Results,Variable == "TouchScreen"))
subs$Levels <- reorder(subs$Levels,subs$Average)
if (min(subs$Average)<0) {
subs$Average <- subs$Average + abs(min(subs$Average))
}
gg7 <- ggplot(data=subs,aes(x=Levels, y=Average, group=1)) +
geom_line() +
geom_point() +
ggtitle("TouchScreen")
install.packages("gridExtra")
library(gridExtra)
grid.arrange(gg1, gg2, gg3, gg4, gg5,gg6, gg7, nrow=4, ncol=2)
These tables are the core of every conjoined analysis and give us precious information on how changing the feature of our laptop for Ethos would improve the utility. For instance, increasing the hard drive from 256 GB to 512 GB interestingly decreases the utility substantially, which might be a sign that the target group has a low budget and prefers others feature. We might have also included price as feature to assess the price sensitivity of our target group for instance. A more interesting approach would be, if we used price as the predicting variable instead of utility, e.g. we measured utility on the basis of how much our future customers would be willing to pay for a laptop.
Using these figures we can already answer the first two questions that Ethos had:
So, in building the laptop required, where should Ethos start? What should be the first priority? A simple approach to this question is by looking at the relative variable importance, which basically tells us how important a variable is compared to others when a consumer makes a purchase decision. The relative importances can be simply calculated in two steps. First, for each variable calculate the biggest possible difference by subtracting the level with the lowest utility from the level with the highest utility. Second, for a variable A, its relative variable importance is simply the ratio of the biggest possible difference of A and the sum of all biggest possible differences of all variables. But luckily enough, there is a R-function that does the math for us.
# Compute relative importance
install.packages("relaimpo")
library(relaimpo)
Importances <- data.frame(Variable= c("Brand", "RAM", "HardDrive","DSize", "Cores", "DQuality","TouchScreen"))
for (model in 1:n) {
relImp <- calc.relimp(Regressions[[model]], type =c("lmg"), rela = TRUE)
relImp <- as.vector(relImp@lmg)
Importances[,paste("Person",model,sep="")] <-round(relImp, digits = 3)
}
Importances$Average <-rowMeans(Importances[,-1])
Importances <- reorder(Importances$Variable,Importances$Average)
ggplot(Importances,aes(x=reorder(Variable, Average), y=Average)) +
geom_col() +
coord_flip() +
scale_y_continuous(labels = function(x)paste(x*100, "%"))
For predicting the market share, we will assume that the board of Ethos decided to produce the “ideal” laptop that we defined in the first step. Now the board wants to know, what would be the potential market share in the best case, if Ethos would go to the market with the laptop. Ideally, we would have now a dataframe with all all the available laptops from all brands that our laptop would need to compete against. However, for the sake of simplicity, I will create an example competition of about 49 different laptops that Ethos’ laptop will compete against. The following code will create the laptop list:
######### Predict potential marketshare
# Simulate laptops from brands
vnames <- c("Brand","Cores", "RAM", "HardDrive","DSize","DQuality","TouchScreen")
brand <-sample(c("Apple", "Lenovo", "Acer","Asus", "Other"),49,replace = TRUE)
cores <- sample(c("DualCore", "Quad Core"), 49, replace = TRUE)
ram <- sample(c("4 GB","8 GB", "16 GB"), 49, replace = TRUE)
harddrive <- sample(c("256GB", "512 GB", "1024 GB"), 49, replace = TRUE)
dsize <- sample(c("12Inch", "14 Inch", "15.2 Inch"), 49, replace = TRUE)
dquality <-sample(c("Normal", "HD"), 49, replace = TRUE)
touchscreen <-sample(c("Yes", "No"), 49, replace = TRUE)
Market <- data.frame(a=brand,b=cores, c=ram, d=harddrive, e = dsize, f= dquality, g = touchscreen)
names(Market) <- vnames
Now I basically create the fitted or predicted values for each user for each laptop using the regression models that I derived from earlier.
# Caclulate utility scores for eachlaptop for each user
for (participant in 1:n) {
Market[,paste("P",participant,sep="")] <-predict(Regressions[[participant]], newdata = Market[,1:7])
}
Finally, I just have to look at which laptop “wins” for each person that participated and just count the brands, in order to derive the “market share”. The “win” can be considered to be laptop purchase decision that this individual would make under neutral and optimal conditions. And here comes one main limitation of the procedure. What I calculate here, in fact, will not be the real market share. It will rather be a “preference” share, because you will never find neutral and optimal conditions at the market. Neutral and optimal conditions would be for instance that there is no distribution channel advantage for none of the brands or that the consumer had the chance to evaluate all the laptops in the basket, which is rather unlikely. Of course, it is possible to enhance the method by correcting the result with respect to these constraints present in the real market, but the preference share gives you the information, how you would stand if you would not have any disadvantages (or advantages depending on the perspective) compared to your competitors.
# Determine the potential market share
purchased <-unlist(apply(Market[,8:ncol(Market)], 2, function(x) which(x == max(x))))
purchased <-Market$Brand[purchased]
brandcount <-as.data.frame(table(purchased))
brandcount$Freq <- brandcount$Freq/ sum(brandcount$Freq)
ggplot(brandcount, aes(x=purchased,y=Freq)) + geom_bar(stat="identity")
Now we can see that the market would be governed mainly by Acer, Asus and Lenovo given our simulated market and Ethos would be far off with approximately 1% market share. Is that surprising? No, for two reasons. Firstly, as we found out, Ethos faces some significant brand disadvantages. Secondly, Ethos is selling only one laptop compared to any of the other competitors who are selling 10 laptops on average in our simulated market.
At the end we managed to answer all three questions that our consulting client Ethos had and we demonstrated how powerful and informative a conjoint analysis can be. Of course, there some disadvantages that we have not touched upon like the fact that it is difficult to gather data accurately. When you conduct the conjoint analysis, you should also integrate ways to ensure validity and reliability.
However, the main advantage of a conjoint analysis is that it is flexible and you can adapt it to your needs. First, you can use different preference models if you want to achieve more realistic results. Second, after you derived the preferences, you can conduct further analyses on them. You could condunct a principal component analysis or cluster analysis to find out which customers are similar. You could also calculate how many different laptops you should launch to optimize your market share or you might even combine conjoint analysis with machine learning methods. Third, instead of using survey data, you might also use actual purchase data.
So what is the story now? Ethos will be able to gain a 1% market share if it is able to produce and sell the laptops for a competitive price and if the market conditions were ideal. Ethos knows now how the customer thinks and knows what would be the laptop that would fit to the needs. However, will that be enough to beat the competition? I would say no, because Ethos will need to develop a unique distribution channel if it wants to beat the competition. The reason is simple. You can produce the ideal laptop, but if you customer never finds out, he will never buy it. Therefore, you will need to be one step ahead of your competition. What do you think?
The post Conjoint Analysis – Understand Your Customer And Beat The Competition appeared first on Economalytics.
]]>Bold
Italic
]]>
What do Netflix, Spotify, Todoist, Evernote and LinkedIn have in common? They all run under a freemium business model. The freemium model became an especially popular business model in the digital startup, newspaper and service scene since the emergence of Software-as-a-Service (SaaS). The basic idea of the freemium model is that a service provider offers a less functional version of its product for free to encourage premium subscription because potential buyers can explore and test the free version and avoid a risky big jump. SaaS, on the other hand, was enabled by cloud computing that opened up the possibility to migrate the whole software on a cloud. Hence, the customer does not need to install the software and update it anymore. He just subscribes online and he can use it via the server like a service.
In this article, I will shortly show you how to analyze the freemium model using a survival and hazard model, what assumptions a freemium model rests on, and how you can use the information from the survival and hazard model to derive actions on how to improve the business model on the example of a fictive software company called “SeventhCloud”.
SeventhCloud initially developed a software that extracts procurement data from its customer’s ERP system and produces a dashboard to show saving potentials as well as possible compliance risks. The customers are mainly based in northern and western Europe and the company had faced problems attracting new customers with the software. After a review, it found out that new startups, which were growing at a fast pace, are offering SaaS-products rather than software. The board decided to follow the trend and kick-off the digital transformation as well as develop their own SaaS-solution. While developing the new product, the question emerged whether a freemium model would be more profitable than a general subscription model and how they could improve their subscription model.
The Problem: The hidden costs of the freemium model
SeventhCloud has launched a prototype of the SaaS solution as a free version and if interested, customers could pre-register for the premium version that is scheduled to come in three months. Due to time reasons, they have only followed their current subscribers for 30 days. The information that they have gathered can be expressed by the following in R simulated dataset:
################################ Create Dataset # Duration & Censored set.seed(1234) duration <- round(abs(rnorm(n=89, mean=18, sd=12))*24) # in hours duration[sample(1:89, 10)] <- 0 # some immeditally made decisions censored <- ifelse(duration > 30*24, "yes", "no") duration[duration > 30*24] <- 30*24 # Censoring # ID id <- 1:89 # subscription subscription <- ifelse((sample(c(TRUE,FALSE, FALSE),89,replace = TRUE) & (censored == "no")),"yes","no") # Time spent on Application in days (out of subscription time) appdays <- c() for (k in 1:89) { if (subscription[k] == "yes") { appdays <- c(appdays, sample(1:(duration[k]/24),1)) } else { appdays <- c(appdays, round(sample(1:(duration[k]/24),1)/4)) } } # Industry industries <- c("Manufacturing", "IT", "Telecommunication", "Consulting", "Food", "Automotive","Health", "Finance") prob_industries <- c(0.3,0.3,0.05, 0.01, 0.04,0.2,0.02, 0.08) industry <- sample(industries,89,replace = TRUE, prob=prob_industries) # Size size <- sample(c("1-50","51-1000","1001+"), 89, replace = TRUE, prob = c(0.6,0.35,0.5) ) # Previous customer prevcustomer <- sample(c("yes", "no"),89,replace = TRUE, prob = c(0.1,0.9)) # Creating Dataset Subscription <- data.frame(id=id, duration=duration, censored=censored, subscription=subscription, appdays=appdays, industry=industry, prevcustomer=prevcustomer, size=size) Subscription$id <- as.factor(Subscription$id) rm(id, duration, censored, subscription, appdays, industry, prevcustomer, size, industries, prob_industries, k)
As already indicated, while exploring the freemium model, SeventhCloud had two important question that had to be answered in order for it to be able to develop its strategical plan.
So, the question that might arise now is, why do we need a more complex survival model to answer these questions? Wouldn’t simply taking proportions, averaging or a simple linear regression suffice? Well, the answer is no due to the nature of the data we have. First, we have collected spell data, which describes durations. Spells cannot be negative, which is why we cannot use a simple linear regression. Second, we have right-censored data. Censoring generally describes the problem that we know that some values lie in a certain range, but we do not know the exact value. In our case, we are only able to follow all subscribers up to a 30-day period, but they might still subscribe at some point beyond the 30 days. Hence, where censored is “yes”, we just know that the premium subscription has not happened until day 30, but might have happened any point after day 30. Using just averages or a linear regression will generally lead to a biased result because the censored observation is treated as if they occurred at the censoring time (see figure below).
Censoring is a different than truncation. In truncation, an observation that would lie within a specific interval was completely left out and not observed. This is not applicable in our case. Overall, specific limited dependent models such as the Tobit model or Heckmann model have been developed to estimate models in case of censoring or truncation and we can differentiate between these general types of censoring and truncation:
Luckily, survival and hazard models are also able to deal with censored data. In order to address the two questions, we will apply survival and hazard models.
The Method & Premises: Does the customer survive, unsubscribe or subscribe?
The main purpose of survival and hazard models is to investigate the time until a certain event occurs. Therefore, the starting event (free subscription) and the end event (premium subscription or unsubscription) need to be clearly defined. The models allow someone to assess not only the risk at a specific time for an event to occur or the probability to survive until a certain point but also what factors influence these probabilities and whether groups differ from each other.
We apply these models to spell data, e.g. data about durations. However, there are two types of spell data:
The models we introduce here, work only for single spell data. If we have multiple-spell data, we can only use the single spell models if the independent variables have a constant effects regardless of the period or episode (1), if the duration distribution of an individual only depends on the time since entry into the present state (2), and if successive episodes of individuals are independent (3). These assumptions, however, rarely hold true. If we study for instance the probability for an individual to get a heart attack, then the older a person gets, the stronger the effects of other health-related variables get (the first assumption violated), the more likely a heart attack is (the second assumption violated). And people, that already experienced a heart attack, are at higher risk to suffer another heart attack (the third assumption violated). In our case, we have single spell data.
First, we will compute a survival model for the probability that a user stays a free user and does not unsubscribe. This survival model will help us assess the cost of all subscribers. Then we will calculate a second hazard model to assess the probability that a free user will subscribe to the premium offer. This will help us estimate the revenue from premium users. In order to make these estimations, we have got the following information from the SeventhCloud managers:
In the next section, we will solve the three problems of SeventhCloud using the information given and implement survival and hazard models using R. More concretely I will implement a non-parametric Kaplan Meier Model and a Cox proportional hazards regression model. Since the Cox proportional hazards regression model is only semi-parametric, both models applied are not parametric models because we do not assume any underlying distribution. This makes them suitable to any kind of data and more powerful when the underlying relationship really does not follow a certain shape, but less powerful if the underlying relationship does follow a certain shape.
Solution: Survival and hazard models
We have defined the moment that people register, for a free subscription or premium subscription directly, as the starting event for the spell data. Since SeventhCloud was only able to track the people for 30 days, there will be no subscription duration longer than 30 days. If the ending event has not occurred until the end of the 30^{th} day, the observation is right censored. However, there is a small trick about the definition of the ending event, which we will exploit for the cost, revenue, and profit estimation. In our case, we have two ending events. The first ending event is a premium subscription. In that case, we recorded the hours counted since free subscription until the premium subscription. The second ending event is subscription. In that case, we recorded the hours counted since free subscription until premium subscription. This is unusual for survival analysis, which usually requires only one clearly defined ending event. However, before we go into the estimation of the models, we should first have a look at the data that we have.
# install.packages("ggplot2") summary(Subscription) library(ggplot2) # duration ggplot(Subscription, aes(x=duration/24)) + geom_histogram() + ggtitle("Free-subscription Duration") + xlab("Duration in Days") + ylab("Number of Users")
We can clearly see the right peack at 30 days which confirms that our data is right-censored. Luckily, we have a variable indicating which observations have been censored. Let’s have a closer look at that.
# censored ggplot(Subscription, aes(x="", y=censored, fill=censored))+ geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0) + scale_fill_brewer(palette="Dark3") + ggtitle("Censored Data")
Around a fifth of the observations seem to have been censored. That’s quite some.
# subscription ggplot(Subscription, aes(x="", y=subscription, fill=subscription))+ geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0) + scale_fill_brewer(palette="Dark3") + ggtitle("Subscription")
Out of the many people that have signed up, around 45% chose a premium subscription after 30 Days. That is quite surprisingly good deal, however, 45% is a “biased” conversion rate as I have already illustrated above since we have right censored data. In fact, the real conversion rate will be lower than 45%.
# appdays ggplot(Subscription, aes(x=appdays)) + geom_histogram() + ggtitle("Interaction Days with App") + xlab("Duration in Days") + ylab("Number of Users")
Apparently we have a wide variations how many days SeventhCloud’s application was used.
# industry ggplot(Subscription, aes(x="", y=industry, fill=industry))+ geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0) + scale_fill_brewer(palette="Dark2") + ggtitle("Industries")
Interestingly, the SaaS-product seems to appeal especially to the IT and manufacturing industry. That’s something worth noting and investigating further.
# prevcustomer ggplot(Subscription, aes(x="", y=prevcustomer, fill=prevcustomer))+ geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0) + scale_fill_brewer(palette="Dark3") + ggtitle("Previous customer?")
This plot also reveals something unexpected. Apparently only a small fraction of the subscribers have been previous customers. The SaaS-product that is being developed is able to attract a new customer segment that is different from SeventhClouds current customers.
# size ggplot(Subscription, aes(x="", y=size, fill=size))+ geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0) + scale_fill_brewer(palette="Dark2") + ggtitle("Company Size of Customer")
The company size of the customers seems to matter less, even though bigger companies seem to be more likely to subscribe. This is understandable at the end, because our solution appeals only to companies that need and have implemented a good ERP system.
We will also need to prepare some packages and functions that we will need later.
################################ Survival Analysis # Adding Variable unsubscribed Subscription$unsub <- ifelse((Subscription$subscription == "no") , "yes", "no") # install.packages(survival) install.packages("survival") library(survival) # for survival & hazard models install.packages("survminer") library(survminer) # for ggsurvplot # function that interpolates mathematical step-functions step_approx <- function(x,y) { new_x <- c() new_y <- c() for (i_x in x[1]:x[length(x)]) { if (i_x %in% c(x)) { pos <- which(x == i_x) i_y <- y[pos] } new_x <- c(new_x, i_x) new_y <- c(new_y, i_y) } df <- data.frame(x=new_x, y=new_y) return(df) }
Now everything is prepared and we understand the data sufficiently. We will start to answer the two questions that SeventhCloud had.
Question 1: What is the cost and revenue per 100 free and premium-version subscription?
We will answer the first question in 5 steps.
Step 1: Estimate the Kaplan Meier curve for the costs scenario
For the first step, we first have to create a survival-object using the survival package. Based on the survival-object, we can easily compute the Kaplan Meier curve.
####### Question 1: What is the cost and revenue per 100 free and premium-version subscription? ### Step 1 # Creating a survival object for right-censored data SurvSub <- Surv(Subscription$duration, Subscription$censored=="no") SurvSub # plus indicates right censoring KaplanMeier <- survfit(SurvSub ~ 1) KaplanMeier summary(KaplanMeier) plot(SurvSub, main ="Probability of Survival", xlab="time since free version subscription in h", ylab="probability of still being in subscription")
The Kaplan Meier curve is a step-function showing us the estimated probability of survival for a given point in time. The “survival” in our case means the probability that the person has not unsubscribed because we will need to pay the same cost for people regardless of whether they are free subscribers or premium subscribers.
Step 2: Calculate total monthly costs per 100 free subscriptions based on Kaplan Meier curve
If we multiply the probability for a given point in time that a person is still subscribed with 100, then we get the expected number of survivors of 100 initial subscriptions. If we multiply this number with the variable costs per hour and add these up, we would get the expected variable costs per 100 subscriptions. Now we only need to add the fixed costs.
### Step 2 str(KaplanMeier) y <- KaplanMeier$surv x <- KaplanMeier$time cost_surv <- step_approx(x,y) tcost_per_h <- (20/24)*100 fixcost <- 100000 tcost100 <- sum(cost_surv$y) * tcost_per_h + fixcost # because stepwise function tcost100
According calculation at this step, the result is 126,500.90 € for the total cost for every 100 subscriptions.
Step 3: Estimate the Kaplan Meier curve for the revenue scenario
Now we repeat the first step with the difference, that we do it for the revenue scenario. This time, we only get the revenue from the ones that have a premium subscription and here comes the trick that I have mentioned earlier for why we can use two ending events.
### Step 3 NotCensored <- subset(Subscription, censored == "no") SurvSub2 <- Surv(NotCensored$duration, NotCensored$subscription=="yes") SurvSub2 # plus indicates right censoring KaplanMeier2 <- survfit(SurvSub2 ~ 1) KaplanMeier2 summary(KaplanMeier2) plot(SurvSub, main ="Probability of Survival", xlab="time since free version subscription in h", ylab="probability of still being in subscription")
Step 4: Calculate the total monthly revenue per 100 free subscriptions based on Kaplan Meier curve
Using the Kaplan Meier curve from step 3, we can now calculate the expected monthly revenue per 100 free subscriptions.
### Step 4 str(KaplanMeier2) y2 <- KaplanMeier2$surv x2 <- KaplanMeier2$time rev_surv <- step_approx(x2,y2) trev_per_h <- (100/24)*100 trev100 <- sum(rev_surv$y) * trev_per_h # because stepwise function trev100
The code returns me an overall revenue of 198,954.80 €. Now we can compute the profit.
Step 5: Calculate monthly profit per 100 free subscriptions
Finally, we just calculate the profit using the results from step 2 and step 3.
### Step 5 profit <- trev100 - tcost100 profit
Now we can summarize the result for SeventhCloud and say, that the expected costs per 100 free subscriptions are 126,500.90 €, the expected revenue is 198,954.80 €, and the profit is 72,453.82 €. Based on this information, they can conclude that the freemium model is working on them. Of course, the calculation I have done was simplified. However, it is the basis for more advanced and complex estimation. It would be for instance possible, to use probability density functions instead of a step function and calculate the total cost, revenue and profit per 100 subscriptions for an unlimited time. It would also be possible to include tax considerations, use more complex cost structures than this simplified procedure. You can also calculate the net present value based on the very same procedure I just showed you.
Question 2: How can we improve the freemium model and increase the premium subscription rate?
The second question can be answered in two steps. First, we will use the Cox hazard regression model to identify differences between groups, possible causal factors influencing the time that a person will take to decide for premium subscription or unsubscription and to derive hypothesis for the future development. In the second step, we will calculate other possible scenarios based on the first question.
Step 1: Choose different variables and compute Cox regression
Besides the necessary variables, we have further variables which we can use to investigate their relationship and the duration until subscription or unsubscription. We will do it here for the case until the premium subscription to find possible relationships and formulate hypotheses on how to make people sign up faster for the premium offer.
####### Question 2: How can we improve the freemium model and increase premium subscription rate? ## Step 1: How to reduce increase premium subscription? coxfit2 <- coxph(SurvSub2 ~ NotCensored$prevcustomer + NotCensored$appdays + NotCensored$industry + NotCensored$size, method = "breslow") coxfit2
This gives us the following output:
Call:
coxph(formula = SurvSub2 ~ NotCensored$prevcustomer + NotCensored$appdays +
NotCensored$industry + NotCensored$size, method = “breslow”)
coef exp(coef) se(coef) z p
NotCensored$prevcustomeryes 0.0859 1.0897 0.6661 0.13 0.90
NotCensored$appdays 0.0738 1.0766 0.0525 1.41 0.16
NotCensored$industryConsulting -0.6299 0.5326 1.1414 -0.55 0.58
NotCensored$industryFinance -0.4750 0.6219 1.0779 -0.44 0.66
NotCensored$industryFood -1.0413 0.3530 1.0734 -0.97 0.33
NotCensored$industryHealth -1.3304 0.2644 1.1265 -1.18 0.24
NotCensored$industryIT -0.1096 0.8962 0.5094 -0.22 0.83
NotCensored$industryManufacturing -1.4925 0.2248 0.6873 -2.17 0.03
NotCensored$industryTelecommunication -0.9828 0.3743 1.0980 -0.90 0.37
NotCensored$size1001+ -0.2467 0.7814 0.5421 -0.46 0.65
NotCensored$size51-1000 0.4136 1.5122 0.5390 0.77 0.44
Likelihood ratio test=11.97 on 11 df, p=0.4
n= 79, number of events= 26
Now that is a typical output of the Cox regression model, but before we start to interpret it, we need to remember that it is not a survival but a hazard function. That means that the output of the model is relative risk with the form h(t;x) = h_{o}(t)e^{βx}, where h_{o}(t) is the baseline hazard, x is a covariate and β its parameter. As we can see the natural exponent e^{βx}, it means that we cannot just take estimate coefficients and interpret them directly. By each unit increase of x, the baseline hazard increases by e^{βΔx}. Furthermore, we have the p-values which quantify the uncertainty or precision of our estimates. The general rule of thumb is that if the p-value is below 0.05, then we have a significant effect and we can assume that there is a relationship.
It is also important here to differentiate between statistical significance and economic significance. Statistical significance only helps us to identify whether there might be any relationship between the y-variable and x-variables. However, it does not tell us whether the variable is relevant. If the effect of the variable or its influence on the y variable is very little despite the statistical significance then the economic significance is low. Imagine we would have a significant variable that increases the baseline by 0.00001% or another one that would increase the baseline hazard by 30%. Then the former one would be far less economically significant.
Now in our case, we detect two interesting things. First, we can formulate the hypothesis that the industry matters, as one level differs significantly from the reference level. The more interesting question that should be addressed in the future is why it differs? Does it differ because our products cover the need of certain industries better or because there is a lack of competition? The second interesting thing we can observe is the variable appdays, which is not significant. Nevertheless, we put it on the agenda, because there might be another interesting relationship. We formulate the hypothesis that the high-group is more likely to sign up. For further investigation, we create another plot differentiating between the people that have used the app only little (appdays low) vs. the people that have used the app a lot (appdays high).
# Create two KapplanMeier Models for both subsgroups NotCensored$Appdayslabel <- ifelse(NotCensored$appdays >= median(NotCensored$appdays), "high", "low") appdaysSurv <- Surv(NotCensored$duration, NotCensored$subscription=="yes") appdaysFit <- survfit(appdaysSurv ~ Appdayslabel, data=NotCensored) ggsurvplot(appdaysFit, data=NotCensored) table(NotCensored$Appdayslabel)
high low
41 38
The shape seems to be very interesting and the groups seem to be balanced, which is very important in order to to be able to make adequate conclusions and generalizations. The interesting thing we observe first is that there is a sudden drop at days XX for the low-group, while the survival chance of the high-group decreases rather continuously. The question that might come up here is also why is there a sudden drop? A possible hypothesis is that the low-group and high-group might have different needs.
The rough analysis steps just shown can also be done in greater depth with further methods and with greater care when picking possible hypothesis. You can also do the same procedure to investigate the hazard function for unsubscription to find out depending on your strategy how to a) either increase free subscriptions or b) decrease free subscriptions depending on the strategy.
Step 2: Calculate different scenarios (profit margin and comparison)
Given the hypothesis and conclusions from the first step, you could now do different simulations to calculate a business case for this scenario using the procedure presented to you in question 1. In fact, we can simplify everything into a simple function:
### Step 2 freemium_profit <- function(df) { SurvSub <- Surv(df$duration, df$censored=="no") KaplanMeier <- survfit(SurvSub ~ 1) ### Step 2 from Q1 y <- KaplanMeier$surv x <- KaplanMeier$time cost_surv <- step_approx(x,y) tcost_per_h <- (20/24)*100 fixcost <- 100000 tcost100 <- sum(cost_surv$y) * tcost_per_h + fixcost # because stepwise function print(paste("Total cost is", round(tcost100,digits = 2),"EUR")) ### Step 3 from Q1 NotCensored <- subset(Subscription, censored == "no") SurvSub2 <- Surv(NotCensored$duration, NotCensored$subscription=="yes") KaplanMeier2 <- survfit(SurvSub2 ~ 1) ### Step 4 from Q1 y2 <- KaplanMeier2$surv x2 <- KaplanMeier2$time rev_surv <- step_approx(x2,y2) trev_per_h <- (100/24)*100 trev100 <- sum(rev_surv$y) * trev_per_h # because stepwise function print(paste("Total revnue is", round(trev100,digits = 2),"EUR")) ### Step 5 from Q1 profit <- trev100 - tcost100 print(paste("Total profit is", round(profit,digits = 2),"EUR")) }
For demonstration purpose, we want to investigate what would be the expected potential profit if we had only the low-group from step 1 and only the high-group from step 2.
LowAppdays <- subset(Subscription, appdays < median(appdays)) HighAppdays <- subset(Subscription, appdays >= median(appdays)) freemium_profit(LowAppdays) freemium_profit(HighAppdays)
freemium_profit(LowAppdays)
[1] “Total cost is 112905.7 EUR”
[1] “Total revnue is 198954.76 EUR”
[1] “Total profit is 86049.05 EUR”
> freemium_profit(HighAppdays)
[1] “Total cost is 129630.72 EUR”
[1] “Total revnue is 198954.76 EUR”
[1] “Total profit is 69324.04 EUR”
Now we can see that the profit for the low-scenario would be 86,039.05 € and for the high-scenario 69,324.04 €. Given the results and if the second hypothesis holds true, we might be able to increase the profit per month per 100 subscriptions by focusing on the low-appdays users. Of course there are more considerations to be made, but using this methodology it is possible to simulate different scenarios.
Conclusion: Survival & hazard models are powerful tools for the freemium model
There are still three general remarks to be made on the freemium model. First, the methodology shown fits regardless of the strategy. And a very important takeaway message from the story and methodology shown is that it helps solve a very important problem of the freemium model:
The methodology presented for answering the second question helps you to navigate to the “right” balance and further optimize the offers.
Second, of course, you could compute the conversion rate in the traditional way, but it will be biased. Biasedness means it will not be precise and it will not reflect the true state of the conversion rate because of the right censoring of our data. Therefore, you will derive a more precise conversion rate by using the results from a survival model. Sometimes, the classical conversion rate will not be far off, but there is no guarantee for that.
Third, a common problem that freemium model users face is that the subscriptions will start to flatten at some point. At this point, many companies pivot away to a limited freemium version with a 30-day free trial or completely abandon it. The methodology shown can tell you in advance when a pivot might be needed by simply tracking the profit for the first month per 100 new subscriptions and the number of monthly subscriptions. If any of these two metrics starts dropping, then you should analyze the reason and take the right actions to sustain the business.
Now, what is the story? SeventhCloud has created a good service that seems to be appealing to new clients especially from the area of manufacturing. Their business will continue to grow and they will make a great profit. If SeventhCloud specialized in manufacturing and increases quality as well as the frequency with which the subscribers interact with the platform, they can even achieve greater growth with the freemium model.
]]>
In our small case study, I will show you how you a can understand your customer by their actually underlying utilities and preferences by showing you a concrete example of a conjoint analysis. The case is fictional. Conjoint analysis is a set of methods (framework) that enables you derive the underlying utilities and preferences of consumers by looking at their decision. In contrast to classical methods, you do not need to run after the customer and ask him what he likes, but rather you just observe his actually choice or judgement, and then derive the most likely set of preferences here called utility function. If you want to know more about conjoint analysis, then check out my in depth article about conjoint article here. If you want to know how you can build your own conjoint analysis, check out my detailed step-by-step guide here.
The Problem: Can a laptop startup company compete against Apple, Dell and co.?
In the small case today, I will help a laptop startup company, naming it Ethos, understand its primary target customer: students at a university. Ethos wants to sell their laptop mainly online through online platforms and is excited to bring their vision into real life. However, they know that they have to make the right decisions and have three main questions that they want to answer.
The Method & Premises
In this section, I will shortly go through the seven steps presented by me on how you can construct your own conjoint analysis.
Step1: The Problem & Attribute
After having talked to the product manager of Ethos, it is clear that the attributes we want to look are the following ones with the following expectations:
Attribute | Levels | Influence | Interactions |
Brand | |||
Cores | |||
RAM | |||
Hard Drive | |||
Display Size | |||
Grafics | |||
Touch Screen | |||
Weight | |||
These are the variables that are thought to be the most important ones on which the consumers make decisions. Ideally, the variables have resulted from qualitative investigation such as focus groups and interviews.
Step 2: The preference Model
In our case the problem is relatively clear, we want to understand the possible customer. Therefore, vector model or mixed model cannot help us further and are rather not the ideal solution. The ideal-point solution on the other side offers an interesting map for each person, but it is less useful in answering the second and third question that Ethos posed. The ideal model would be part-worth model in our case. A part-worth model fits very well with using a fractional factorial design, we can use it to answer all three questions and we can even visualize in graphs. This makes it the ideal model in order to understand the customer.
Step 3: Data Collection
If we want to use a part-worth model, it makes most sense to use either the concept evaluation. Since Ethos wants to sell their laptop online, the goal is to make the conjoint analysis as similar to this situation as possible. Since on a platform like Amazon, the laptops are usually indeed shown in a concept way, concept evaluation seems to be most fit. By making it similar, we can increase the probability that we can later on generalize it to the real case, e.g. students buying laptops from Ethos on an online platform like Amazon on day. Another thought is that when customers search platforms on online platforms, they do not buy them directly.
A second important aspect is that, according to the interviews conducted with potential customer prior to constructing the conjoint analysis, the customers do not make immediate decision about the purchase of the laptops. They rather first go through the laptops they can find online and make a first evaluation of them. This makes us believe that is makes sense to ask our customers to rate each alternative rather than let them make decisions.
Step 4: Experimental Design
Since there are any interaction effects, we will use a fractional factorial design that we can generate simply using the package “” in R. Using this package, it is possible to test out the optimal number of levels and variables for a fractional factorial design. The Python alternative for the package is “”. We use the following code to generate a fractional factorial design and insert our level descriptions.
CODE R for fractional factorial design
Design
Here, it becomes evident the advantage of the fractional factorial design. If we had to run a full factorial design, we would have needed let one person go through XXXX runs. Using a fractional factorial design, we managed to reduce it to XXXX runs. However, this is only possible if there are no interaction effects between our variables, which we tested out and ensured prior to the study.
Picture of example
Step 5: Presentation of Alternatives
Since it was clear from the very beginning, that Ethos will pursuit an online sales strategy, it was very important to design the presentation of an alternative as similar as if it was a realistic purchasing scenario. While for a physical shop you might showcase prototypes of different products in the real environment and then ask for a rating, you will also want to make it realistic for the online scenario. Since Ethos considered to sell its laptops on Amazon since it would be difficult to attract customers from the scratch, it was necessary to adapt the concept evaluation to the design of Amazon including the disadvantages and advantages it might offer. Therefore, I constructed an experimental homepage that resembled amazon for collecting data. Below you can find an example of how run xyz from the table in Step 4 would look on the homepage:
EXAMPLE PICTURE ON AMAZON OF AN RUN
Another consideration is that it might be useful to add a description of all attributes and why they might be important. A laptop purchase by a student can be considered an investment on which they will spend a considerable amount of time and inform themselves prior to the purchase. It cannot be compared to a drink in the supermarket or an ice cream. We need to make sure that the customers can fully inform themselves before they make decisions. Therefore, we include a description of all attributes, the importance and the relevance. For instance, we would explain that high RAM might be important if you edit videos, edit high resolution images or process high amounts of data. Furthermore, we would add a constraint in such that they have to read through the description and that they whole experiment cannot be completed under 30min. This forces the customer to think every option through and really engage with the alternatives in order to achieve as realistic and accurate ratings.
Step 6: Measurement Scale
Since we want the customer to rate each alternative, we will need a metric measurement, particularly a likert scale. Likert scales are per default interval scales, which means that we would only have the knowledge of how much the overall utility would increase by changing the level of an attribute. We are also restricted to an interval scales due to the fact that we chose a part-worth model as well as fractional factorial design. A continuous or ratio variable would generally not be possible with a fractional factorial design or part worth model unless we can make some assumption about linearity and interactions which are simply unrealistic.
Step 7: Estimation Method
Finally, there is not much room left to choose from the pool of estimation method. The best fitting estimation method for our case is to use multiple linear regressions to estimate our utility functions for each individual, because multiple linear regressions are perfectly able to estimate each factor and are well suited for fractional factorial design. What the method will do in our context, in a nutshell, is to look at an customers ratings and calculate the most likely utility function given the your ratings. Hence it tries to understand your choices and understand which attributes are the most relevant for your choices given the assumptions we made.
Solution
Reliability & Validity???
What is the story
Conclusion