Modeling Science, Technology & Innovation Conference | Washington D.C. | May 17-18, 2016
Wednesday, May 18th 2016 | 8:00 AM – 9:30 AM
Data, Algorithms, and Infrastructure
High quality predictions require access to high quality and high coverage data. Just like local data is of little value for global weather predictions; data for just one institution or country is of limited value when aiming to make STI predictions.
Moderator
Katy Börner
Victor H. Yngve Distinguished Professor of Information Science & Director, Cyberinfrastructure for Network Science Center, Indiana University Assistant Professor of Information Science, School of Informatics and Computing, Indiana University
Katy Börner is the Victor H. Yngve Distinguished Professor of Information Science in the Department of Information and Library Science, School of Informatics and Computing, Adjunct Professor at the Department of Statistics in the College of Arts and Sciences, Core Faculty of Cognitive Science, Research Affiliate of the Center for Complex Networks and Systems Research and Biocomplexity Institute, Member of the Advanced Visualization Laboratory, Leader of the Information Visualization Lab, and Founding Director of the Cyberinfrastructure for Network Science Center at Indiana University in Bloomington, IN and Visiting Professor at the Royal Netherlands Academy of Arts and Sciences (KNAW) in The Netherlands. She is a curator of the international Places & Spaces: Mapping Science exhibit. She became an American Association for the Advancement of Science (AAAS) Fellow in 2012. View her website here.
Speakers
James Onken
National Institutes of Health
Improving the Research Portfolio Data Infrastructure at the National Institutes of Health
Abstract: Demonstrating the impact of research investments made years–and sometimes decades–earlier and using that information to predict future trends in science has never been easy. Now, the increased availability of relevant databases, new database technologies, and informatics capabilities create the potential to more readily establish linkages between federal investments in science and long-term outcomes. This talk describes an effort the NIH is making to create a data infrastructure that will facilitate the analysis of NIH research investments and the development predictive models.
Bio: James Onken is Senior Advisor to the NIH Deputy Director for Extramural Research and Director of the Office of Data Analysis Tools and Systems within the NIH Office of Extramural Research (OER). He has been conducting portfolio analyses and program evaluations at the NIH for over 27 years, holding positions at the National Institute of Mental Health and National Institute of General Medical Sciences before moving to OER. He previously held positions at AT&T Bell Laboratories, Decisions and Designs, Inc., and the U.S. Government Accountability Office. He holds M.S. and Ph.D. degrees in psychology from Northwestern University, and an MPH with a concentration in biostatistics from George Washington University.
Ian Hutchins
National Institutes of Health
Utilizing citation networks to explore and measure scientific influence
Abstract: The 2013 San Francisco Declaration on Research Assessment decried the widespread and invalid use of Journal Impact Factors for comparing the scientific output of scientists or institutions. The NIH Office of Portfolio Analysis has developed an improved method to quantify the influence of a research article by making novel use of its co-citation network to field-normalize the number of citations it has received. Article citation rates are divided by an expected citation rate that is derived from performance of articles in the same field and benchmarked to a peer comparison group. The resulting Relative Citation Ratio (RCR) is article-level and field independent, identifies influential papers independently of their publication venue, and thus provides an alternative to the invalid practice of using Journal Impact Factors. We demonstrate that the values generated by this method strongly correlate with the opinions of subject matter experts in biomedical research, and suggest that the same approach should be generally applicable to articles published in all areas of science. A beta version of iCite, our web tool for calculating the RCRs of articles listed in PubMed, is available at https://icite.od.nih.gov.
Bio: Ian Hutchins is a Data Scientist in the Office of Portfolio Analysis within the Division of Program Coordination, Planning, and Strategic Initiatives, Office of the Director, National Institutes of Health. He leads development teams to make scientific portfolio analysis tools; he conducts trans-NIH portfolio analysis using bibliometrics, statistical programming, and text mining; and he teaches scientific portfolio analysis courses for agency staff. Prior to this, he investigated the self-assembly of neural circuits in models of neurological disorders and of healthy brain development. He holds a Ph.D. in Neuroscience and a B.S. in Genetics from the University of Wisconsin-Madison.
Richard Freeman
Harvard University
The Missing Link in How Science and Engineering Affect the Economy
Abstract: Studies of the impact of R&D and the work of scientists and engineers on the economy typically relate some measure of R&D (usually a stock created from flows) or of patents to levels or growth of productivity, sales, or profits. But neither R&D nor a patent produces a final product for sale. They produce knowledge that might contain an idea for a new product or process. They are inputs or intermediate outputs that enter a production or profits equation – valuable indicators of innovative activity – but not measures of actual innovation as defined by Schumpeter or the Oslo manual, which require their implementation or commercialization in the market. Measures of actual innovations are the missing link in our understanding of how science and engineering shape the economy. I examine three different ways to gain insight into actual innovations: through questions about the introduction of new products and processes in the NSF’s Business Research and Development and Innovation Survey (BRDIS); through web-scraping the attributes, prices, and quantities sold of goods and services on websites; through an Innovation Hunter crowd-sourcing activity in which volunteers search announcements and reports of new products and processes along the lines of the 1982 Small Business Administration study of innovations that provided the data for Audretsch and Feldman’s (1996) analysis of the geography of innovation and production.[1] Measures of actual innovations in the form of new products and processes can be produced regularly on a world wide basis. They have the potential for increasing our understanding of company reports of innovation on BRDIS, the EU Community Innovation Survey, and comparable survey data for China, and of transforming discussions of innovation that rely on aggregate indicators, on the one hand, or on business school case studies of new goods and services introduced in markets. As Lord Kelvin famously said “when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind.” There is a lot new to be discovered in modeling science, technology, and innovation, that requires new micro-data. I will give explicit examples of the modes of measuring innovation and the models and research and policy questions they can illuminate.
[1] Audretsch, David B; Feldman, Maryann P R&D spillovers and the geography of innovation and production The American Economic Review; June1996;
Bio: Richard B. Freeman is Ascherman Professor of Economics at Harvard University. He directs the Science and Engineering Workforce Project at the National Bureau of Economic Research, is Faculty co-Director of the Labor and Worklife Program at the Harvard Law School, and Co-Director of the Harvard Center for Green Buildings and Cities. His research interests include the job market for scientists and engineers; the transformation of scientific ideas into innovations; Chinese labor markets; income distribution and equity in the marketplace; forms of labor market representation, and shared capitalism.
Nachum Shacham
PayPal
Data, platforms and predictive models to enable data-driven actions
Abstract: Organizations strive to make data-driven decisions and tailor their actions to the individual needs of millions of customers, suppliers, and partners worldwide. Big data, storage and computation infrastructure, and predictive model algorithms enable this process. Making them work in harmony the planned actions successful, takes engineering, data science, and business skills. The ingredients are in place: abundance of data, cost-effective infrastructure, and variety of algorithms are available and improving at a rapid pace. Computers have always been recording every event, activity and signal they measure, making data “the exhaust pipe of computing”. These data can now be made available to analysis by new generation of technologies for collecting, storing, and processing. Streaming technologies shorten the time from data creation to storage. Massively parallel processing systems, like Hadoop and Enterprise Data Warehouse store petabytes of data; and in-memory processing engines, like Spark, support interactive massive computations needed for predictive model training.
Predictive modeling algorithms are considered the key to extracting measurable value from big data. Predicting future activity of each customer based on a wide variety of features is a promising field that is increasingly utilized by business, government, and education. Known cases include credit scoring and customer churn that predict a customer’s likelihood of paying a loan or terminating a service, respectively. Other models predict customer’s future events like growth, success rate, return value, and engagement. The predictions are often done by supervised learning algorithms that are trained on data comprising multiple aspects of the objects tagged with known results of the metrics of interest. A typical model training fits a function of unknown structure to the available data and score each new case based on the learned model. The scores affect actions like admit/reject, fee level, and incentives to offer.
Though technologies for data movement, storage and parallel computation for training large scale models are readily available, making the process successful often takes R&D, e.g., in the design of new models. Big data is often observational, distributed across multiple sources, sparse, redundant, partially unreliable or skewed. In short, messy; which represents a mismatch between Big data and what’s expected by the predictive model as input, thereby requiring data exploration, data munging, and feature engineering to transform the data to the right format.
Data and algorithm must be carefully selected to support the score-based actions. Algorithm selection is done under several tradeoffs like score interpretability vs. accuracy, e.g., single decisions tree vs ensemble. Other tradeoffs include precision vs recall to match the costs of different errors; and excluding data features to eliminate bias in the actions based on the scores.Technologies supporting large scale predictive models will be reviewed and case studies highlighting tradeoffs and approaches for constructing end-to-end models will be described.
Bio: Nachum Shacham is a Director, Data Science at PayPal where he is constructing models and leading a team of data scientists in identifying actionable patterns in large transactional, behavioral, and system performance datasets. Before, he was with eBay, analyzing performance of large data platforms. Prior, he was with SRI, leading research in internet technologies, generation of wireless internet and real-time voice and video communications over mobile networks. As co-founded CTO of Metreo, he developed models for B2B pricing and subsequently created revenue models for online display and search advertising. Nachum holds BScEE & MScEE from the Technion, and PHD in EECS from UC Berkeley. Dr. Shacham is a Fellow of the IEEE.
Grace Peng
National Institutes of Health - NIBIB
Challenges with Model and Data Sharing in Biomedical, Biological and Behavioral Systems
Abstract: Over the last decade the number and types of computational models being developed for biomedical research has experienced a healthy increase. The biomedical community is beginning to recognize not only the usefulness of models, but the essential role models play to integrate disparate fields of knowledge, identify gaps and present testable hypothesis to drive experiments. Multiscale modeling, in particular, is at the forefront of making a significant impact in biomedical discoveries, applied science and medicine.
Over the last decade the U.S., Europe and Japan have promoted several government funding initiatives for modeling the physiome. In the U.S. a confluence of events resulted in the 2003 formation of the Interagency Modeling and Analysis Group (IMAG) and subsequent release of the first interagency solicitation for multiscale modeling of biomedical, biological and behavioral systems. That solicitation funded 24 projects, creating the Multiscale Modeling Consortium (MSM) in 2006. The IMAG MSM Consortium now in its 10th year has over 100 multiscale modeling related projects. During this time many other multiscale modeling initiatives have emerged from the 9 government agencies of IMAG, with over 80 program directors managing programs for modeling and analysis and biomedical, biological and behavioral systems.
One of the main activities of IMAG is to coordinate the MSM Consortium. The MSM Consortium is run by the investigators in the field. Its mission is to grow the field of multiscale modeling in biomedical, biological and behavioral systems, by 1) promoting multidisciplinary scientific collaboration among multiscale modelers; 2) encouraging future generations of multiscale modelers; 3) developing accurate methods and algorithms to cross the interface between multiple spatiotemporal scales; 4) promoting model sharing and the development of reusable multiscale models; and 5) disseminating the models and insights arrived from the models to the larger biomedical, biological, and behavioral research community.
The MSM Consortium is actively addressing many pressing issues facing the multiscale modeling community. Of particular focus are the challenges of model sharing and model translation. Some pertinent questions: How we improve the accessibility of models by the worldwide community? How can we reproduce published simulations? How can we facilitate model reuse? How do we build credible models? How do we integrate models into clinical practice? The presentation will describe some of the MSM activities around these questions, and the latest IMAG funding initiative, Predictive Multiscale Models for Biomedical, Biological, Behavioral, Environmental and Clinical Research will also be presented.
Bio: Grace C.Y. Peng received the B.S. degree in electrical engineering from the University of Illinois at Urbana, the M.S. and Ph.D. degrees in biomedical engineering from Northwestern University. She performed postdoctoral and faculty research in the department of Neurology at the Johns Hopkins University. In 2000 she became the Clare Boothe Luce professor of biomedical engineering at the Catholic University of America. Since 2002, Dr. Peng has been a Program Director in the National Institute of Biomedical Imaging and Bioengineering (NIBIB), at the National Institutes of Health. Her program areas at the NIBIB include mathematical modeling, simulation and analysis methods, and next generation engineering systems for rehabilitation, neuroengineering, and surgical systems. In 2003, she brought together the Neuroprosthesis Group (NPG) of program officers across multiple institutes of the NIH. Also in 2003, Dr. Peng lead the creation of the Interagency Modeling and Analysis Group (IMAG), which now consists of program officers from ten federal agencies of the U.S. government and Canada (www.imagwiki.org). IMAG has continuously supported funding specifically for multiscale modeling (of biological systems) since 2004. IMAG facilitates the activities of the Multiscale Modeling (MSM) Consortium of investigators (started in 2006). Dr. Peng is interested in promoting the development of intelligent tools and reusable models, and integrating these approaches in engineering systems and multiscale physiological problems.