Using Census, Social Security and Tax data to impute the complete Australian income distribution
Australian National University, Australia
Relevance & Research Question: Economists/governments are deeply interested in the income distribution, the level of movement across the income distribution, and how observable characteristics predict someone's position on the distribution. These topics are answered in different countries using a combination of cross-sectional surveys, panel studies, and administrative data. Australia has been well served by sample surveys on the income distribution, but these are limited for relatively small population groups or for precise points on the distribution. Australian researchers have made limited use of administrative data. Not because the administrative data doesn't exist, but because of privacy and practical challenges with linking individuals and making that data available to external researchers. In this paper, we apply machine learning and standard econometric techniques to develop synthetic estimates of the Australian income distribution, validate this data against high quality survey data, use this administrative dataset to measure movement across the income distribution longitudinally, and measure ethnic disparities (by Indigeneity and ancestry)
Methods & Data: The dataset used in this paper has at its core individually linked medical, cros-sectional Census (i.e. survey), social security and tax data for 6 financial years. None of this data alone is complete for all parts of the income distribution, but combined can generate high quality estimates. Broadly, we generate a continuous cross-sectional income estimate from Census bands in 2011, test various machine learning algorithms to predict income using observed tax and social security data in 2011, use parameter estimates from the algorithms to estimate income in the following 5 financial years (based on demographic, tax and social security data for those years), validate against survey data, and then analyse.
Results: We show that certain algorithms perform far better than others, and that we are able to generate highly accurate predictions that match survey data at the national level. We then derive new insights into income inequality in Australia.
Added Value: We outline a methodology and set of techniques for when income data needs to be combined across multiple sources, demonstrate a productive link between ML and econometric techniques, and shed new light on the Australian income distribution.
How to find potential customers on district level: Civey's innovative methodology of Small Area Estimation through Multilevel Regression with Poststratification
Relevance & Research Question: Reliable market research is the basis for making the right decisions. Market researchers understand customer interests or the perception of existing products. However, the question of how and where potential customers can be reached is difficult to answer precisely. To solve this problem, Civey has developed Small Area Estimation through multilevel regression with poststratification in a live system. Thus, customers recognize potential leads even in the smallest geographical areas such as districts (“Landkreise”).
Methods & Data: The basis for this is a MRP model (Multilevel Regression with Poststratification), which Civey has implemented for real-time calculations. Data is collected online on over 25,000 websites. This way, over fifteen million opinions are collected each month. With one million verified and active users monthly, Civey has established Germany's largest open access panel.
Based on a two-stage process developed by Civey, which combines hierarchical logistic regression models and poststratification with variable selection by LASSO, real-time applications of MRP are possible to provide Small Area Estimations. In addition to the user-based information, the model also accounts for publicly available auxiliary information on district level.
Results: The model can be used to predict the probability that a certain person will give a particular answer for any combination of sociodemographic information. The model "learns" based on all information available. This model-based approach enables fast valid results even in the smallest geographical areas.
Added Value: After a brief introduction to the methodology, Civey provides unique insights into their results. This includes interesting evaluations of potential customers in the automotive market, but also amusing examples to show the variety and depth of data that this innovation allows.
Platform moderated data collection: Experiences of combining data sources through a crowd science approach.
TU Berlin, Germany
Relevance & Research Question: The central idea of crowd-science is to engage a wide base of potential contributors who are not professional scientists into the process of conducting and/ or analyzing research data (z.B. Franzoni & Sauermann, 2014). Crowd science carries the potential to lift data treasures or to analyze data that is too large for a small research team, but at the same time too unstandardized for computational research methods. While such approaches have been used successfully in the natural sciences and the digital humanities, they are rare in the social sciences. Hence, we know only very little about the particular challenges of this approach, its fit to certain research questions or types of data (Scheliga et al 2018).
Methods & Data: In this talk, we report and reflect about our crowd-science approach that we used to utilize data on the social relationships among entrepreneurial groups (Ruef 2010). Starting from a core data set based on administrative data (Weinhardt and Stamm 2019), we designed a crowd science task that asks participants to research information on company websites and in news articles on predefined cases of entrepreneurs in order to enrich our overall data set. To implement this task, we set up our own crowd science platform that moderated task distribution and the collection of the researched information. In order to qualify the crowd, in our case students in the social sciences across Germany, for this task we offered a 45 min online training on the methodology of process-generated data. After completion, participating students could engage in the research task, and by doing so, collect points and win prizes.
Results: We discuss the methodological challenges, from extracting and combining the information from the different sources as well as pragmatic challenges from setting up a multi-purpose online platform to finding and motivating participants.
Added Value: These insights and reflections advance the methodological discussion on crowd science as digital method and initiate a discourse on the potentials and shortcomings of combining data sources via platform moderated data collection.
The Combination of Big Data and Online Survey Data: Displaying of Train Utilization on Bahn.de and its Implications
1University of Applied Sciences Europe, Germany; 2DB Fernverkehr AG, Germany; 3exeo Strategic Consulting AG, Germany
Relevance & Research Question:
In Germany, the utilization of trains in the long-distance traffic has risen in the last 10 years from about 44% (2008) to 55% (2018). Further demand growth is stipulated by the German government for the coming years. The goal is to double the number of passengers by 2030. While demand has so far primarily been controlled by a Revenue Management system (saver fare and super saver fare), the question arises whether controlling and smoothing demand is also possible through non-price measures.
Methods & Data:
Based on forecast data, capacity utilization for each journey is estimated. Using these data, a display system was developed (4 icons), which provides customer information on the expected utilization of a single train connection on bahn.de. After a concept phase, qualitative research as well as A/B testing was performed. Finally, in April 2019, the display system was introduced on all major distribution channels. Recently, ticket buyers have been surveyed: here, one study focused on ticket buyers ( Jan.-Oct 2019, n=>10.000), the other study surveyed visitors of bahn.de who did not buy a train ticket (Oct. 2019, n=2.000).
By using a multi-source multi-method approach, there are clear and consistent indicators for several positive effects of the utilization forecast icons: first, there is a shift in demand towards less utilized trains (thus achieving the goal of demand smoothing), secondly, seat reservation quota is increased and thirdly, the information leads to a comfort improvement for the travelers. However, it can also be seen that in time windows with overall high train utilization, sometimes a loss of customers takes place.
On the one hand, the combination of big data, experimental design and online surveys generates the database for displaying icons (load forcast) at the same level as train connections and fares on bahn.de, while on the other hand, during the period of market introduction (as of May 2019), key information can be obtained leading to a 360-degree perspective, generating deep insights into the effects for Deutsche Bahn as well as for railway customers. Furthermore, starting points for optimizing the displayed icons are identified.