Data, Algorithms & Infrastructure

Katy Börner

Victor H. Yngve Distinguished Professor of Information Science & Director, Cyberinfrastructure for Network Science Center, Indiana University Assistant Professor of Information Science, School of Informatics and Computing, Indiana University

Bio

James Onken

National Institutes of Health

Improving the Research Portfolio Data Infrastructure at the National Institutes of Health

Slides

Abstract & Bio

Ian Hutchins

National Institutes of Health

Utilizing citation networks to explore and measure scientific influence

Slides

Abstract & Bio

Richard Freeman

Harvard University

The Missing Link in How Science and Engineering Affect the Economy

Slides

Abstract & Bio

Abstract: Studies of the impact of R&D and the work of scientists and engineers on the economy typically relate some measure of R&D (usually a stock created from flows) or of patents to levels or growth of productivity, sales, or profits. But neither R&D nor a patent produces a final product for sale. They produce knowledge that might contain an idea for a new product or process. They are inputs or intermediate outputs that enter a production or profits equation – valuable indicators of innovative activity – but not measures of actual innovation as defined by Schumpeter or the Oslo manual, which require their implementation or commercialization in the market. Measures of actual innovations are the missing link in our understanding of how science and engineering shape the economy. I examine three different ways to gain insight into actual innovations: through questions about the introduction of new products and processes in the NSF’s Business Research and Development and Innovation Survey (BRDIS); through web-scraping the attributes, prices, and quantities sold of goods and services on websites; through an Innovation Hunter crowd-sourcing activity in which volunteers search announcements and reports of new products and processes along the lines of the 1982 Small Business Administration study of innovations that provided the data for Audretsch and Feldman’s (1996) analysis of the geography of innovation and production.[1] Measures of actual innovations in the form of new products and processes can be produced regularly on a world wide basis. They have the potential for increasing our understanding of company reports of innovation on BRDIS, the EU Community Innovation Survey, and comparable survey data for China, and of transforming discussions of innovation that rely on aggregate indicators, on the one hand, or on business school case studies of new goods and services introduced in markets. As Lord Kelvin famously said “when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind.” There is a lot new to be discovered in modeling science, technology, and innovation, that requires new micro-data. I will give explicit examples of the modes of measuring innovation and the models and research and policy questions they can illuminate.

[1] Audretsch, David B; Feldman, Maryann P R&D spillovers and the geography of innovation and production The American Economic Review; June1996;

Bio: Richard B. Freeman is Ascherman Professor of Economics at Harvard University. He directs the Science and Engineering Workforce Project at the National Bureau of Economic Research, is Faculty co-Director of the Labor and Worklife Program at the Harvard Law School, and Co-Director of the Harvard Center for Green Buildings and Cities. His research interests include the job market for scientists and engineers; the transformation of scientific ideas into innovations; Chinese labor markets; income distribution and equity in the marketplace; forms of labor market representation, and shared capitalism.

Nachum Shacham

PayPal

Data, platforms and predictive models to enable data-driven actions

Slides

Abstract & Bio

Abstract: Organizations strive to make data-driven decisions and tailor their actions to the individual needs of millions of customers, suppliers, and partners worldwide. Big data, storage and computation infrastructure, and predictive model algorithms enable this process. Making them work in harmony the planned actions successful, takes engineering, data science, and business skills. The ingredients are in place: abundance of data, cost-effective infrastructure, and variety of algorithms are available and improving at a rapid pace. Computers have always been recording every event, activity and signal they measure, making data “the exhaust pipe of computing”. These data can now be made available to analysis by new generation of technologies for collecting, storing, and processing. Streaming technologies shorten the time from data creation to storage. Massively parallel processing systems, like Hadoop and Enterprise Data Warehouse store petabytes of data; and in-memory processing engines, like Spark, support interactive massive computations needed for predictive model training.

Predictive modeling algorithms are considered the key to extracting measurable value from big data. Predicting future activity of each customer based on a wide variety of features is a promising field that is increasingly utilized by business, government, and education. Known cases include credit scoring and customer churn that predict a customer’s likelihood of paying a loan or terminating a service, respectively. Other models predict customer’s future events like growth, success rate, return value, and engagement. The predictions are often done by supervised learning algorithms that are trained on data comprising multiple aspects of the objects tagged with known results of the metrics of interest. A typical model training fits a function of unknown structure to the available data and score each new case based on the learned model. The scores affect actions like admit/reject, fee level, and incentives to offer.

Though technologies for data movement, storage and parallel computation for training large scale models are readily available, making the process successful often takes R&D, e.g., in the design of new models. Big data is often observational, distributed across multiple sources, sparse, redundant, partially unreliable or skewed. In short, messy; which represents a mismatch between Big data and what’s expected by the predictive model as input, thereby requiring data exploration, data munging, and feature engineering to transform the data to the right format.

Data and algorithm must be carefully selected to support the score-based actions. Algorithm selection is done under several tradeoffs like score interpretability vs. accuracy, e.g., single decisions tree vs ensemble. Other tradeoffs include precision vs recall to match the costs of different errors; and excluding data features to eliminate bias in the actions based on the scores.Technologies supporting large scale predictive models will be reviewed and case studies highlighting tradeoffs and approaches for constructing end-to-end models will be described.

Bio: Nachum Shacham is a Director, Data Science at PayPal where he is constructing models and leading a team of data scientists in identifying actionable patterns in large transactional, behavioral, and system performance datasets. Before, he was with eBay, analyzing performance of large data platforms. Prior, he was with SRI, leading research in internet technologies, generation of wireless internet and real-time voice and video communications over mobile networks. As co-founded CTO of Metreo, he developed models for B2B pricing and subsequently created revenue models for online display and search advertising. Nachum holds BScEE & MScEE from the Technion, and PHD in EECS from UC Berkeley. Dr. Shacham is a Fellow of the IEEE.

Grace Peng

National Institutes of Health - NIBIB

Challenges with Model and Data Sharing in Biomedical, Biological and Behavioral Systems

Slides

Abstract & Bio

Abstract: Over the last decade the number and types of computational models being developed for biomedical research has experienced a healthy increase. The biomedical community is beginning to recognize not only the usefulness of models, but the essential role models play to integrate disparate fields of knowledge, identify gaps and present testable hypothesis to drive experiments. Multiscale modeling, in particular, is at the forefront of making a significant impact in biomedical discoveries, applied science and medicine.

Over the last decade the U.S., Europe and Japan have promoted several government funding initiatives for modeling the physiome. In the U.S. a confluence of events resulted in the 2003 formation of the Interagency Modeling and Analysis Group (IMAG) and subsequent release of the first interagency solicitation for multiscale modeling of biomedical, biological and behavioral systems. That solicitation funded 24 projects, creating the Multiscale Modeling Consortium (MSM) in 2006. The IMAG MSM Consortium now in its 10th year has over 100 multiscale modeling related projects. During this time many other multiscale modeling initiatives have emerged from the 9 government agencies of IMAG, with over 80 program directors managing programs for modeling and analysis and biomedical, biological and behavioral systems.

One of the main activities of IMAG is to coordinate the MSM Consortium. The MSM Consortium is run by the investigators in the field. Its mission is to grow the field of multiscale modeling in biomedical, biological and behavioral systems, by 1) promoting multidisciplinary scientific collaboration among multiscale modelers; 2) encouraging future generations of multiscale modelers; 3) developing accurate methods and algorithms to cross the interface between multiple spatiotemporal scales; 4) promoting model sharing and the development of reusable multiscale models; and 5) disseminating the models and insights arrived from the models to the larger biomedical, biological, and behavioral research community.

The MSM Consortium is actively addressing many pressing issues facing the multiscale modeling community. Of particular focus are the challenges of model sharing and model translation. Some pertinent questions: How we improve the accessibility of models by the worldwide community? How can we reproduce published simulations? How can we facilitate model reuse? How do we build credible models? How do we integrate models into clinical practice? The presentation will describe some of the MSM activities around these questions, and the latest IMAG funding initiative, Predictive Multiscale Models for Biomedical, Biological, Behavioral, Environmental and Clinical Research will also be presented.

Bio: Grace C.Y. Peng received the B.S. degree in electrical engineering from the University of Illinois at Urbana, the M.S. and Ph.D. degrees in biomedical engineering from Northwestern University. She performed postdoctoral and faculty research in the department of Neurology at the Johns Hopkins University. In 2000 she became the Clare Boothe Luce professor of biomedical engineering at the Catholic University of America. Since 2002, Dr. Peng has been a Program Director in the National Institute of Biomedical Imaging and Bioengineering (NIBIB), at the National Institutes of Health. Her program areas at the NIBIB include mathematical modeling, simulation and analysis methods, and next generation engineering systems for rehabilitation, neuroengineering, and surgical systems. In 2003, she brought together the Neuroprosthesis Group (NPG) of program officers across multiple institutes of the NIH. Also in 2003, Dr. Peng lead the creation of the Interagency Modeling and Analysis Group (IMAG), which now consists of program officers from ten federal agencies of the U.S. government and Canada (www.imagwiki.org). IMAG has continuously supported funding specifically for multiscale modeling (of biological systems) since 2004. IMAG facilitates the activities of the Multiscale Modeling (MSM) Consortium of investigators (started in 2006). Dr. Peng is interested in promoting the development of intelligent tools and reusable models, and integrating these approaches in engineering systems and multiscale physiological problems.

Data, Algorithms, and Infrastructure

Katy Börner

James Onken

Improving the Research Portfolio Data Infrastructure at the National Institutes of Health

Ian Hutchins

Utilizing citation networks to explore and measure scientific influence

Richard Freeman

The Missing Link in How Science and Engineering Affect the Economy

Nachum Shacham

Data, platforms and predictive models to enable data-driven actions

Grace Peng

Challenges with Model and Data Sharing in Biomedical, Biological and Behavioral Systems