Casey Law edited untitled.tex  about 10 years ago

Commit id: 7384b2c75273509511e580cd2092264b321cdb65

deletions | additions      

       

%Describe %A maximum two-page narrative (as a pdf) that describes  the intended applicant’s Major Accomplishments and Future Research Direction. Specifically, what are the applicant’s past major accomplishments, breakthroughs, and barriers overcome? What would be the future  research project in more detail, including direction with  the anticipated impact funding, taking advantage  of your proposed project on advancing scientific discovery the longer term, flexible and risk-taking investment provided by the Foundation that is not readily available from other sources? Identify dramatic potential payoffs, either in discoveries  in your field. (3 page maximum) the natural sciences or improvements in data science methodologies, or ideally, both. As an example, enhanced methodologies that enable discoveries from actual scientific data sets.  % scalable anomaly detection on massive data streams  % wavefront  %Radio interferometers are useful in astronomy because they simultaneously image large areas of the sky and have fine spatial resolution. I am developing interferometers like the Very Large Array for data-intensive surveys for millisecond radio transients. At millisecond timescales, the brief pulses from neutron stars and other exotic objects are detectable and can be used to probe the interstellar and intergalactic media. This observing mode produces 1 TB/hour of data that we search for rare, weak signals on compute clusters. Our goal is to refine our solution to the "needle "needle  in the haystack" haystack"  problem such that we can process this massive data stream in real time. Real-time processing will allow dynamic, data-driven decisions that can extract previously-inaccessible science from large data streams. The algorithmic, statistical, and computing infrastructure needed to support this effort will have application to other efforts to find anomalies in massive, noise-dominated data streams. %\vspace{-1cm} 

I am approaching this problem with a new and data-intensive   I am an astronomer with an interest in applying radio interferometers to the study of fast transients. Fast (i.e., subsecond duration) radio transients are generated by pulsars and stellar/exoplanet stellar/exoplanet  magnetospheres. At timescales faster than 1 second, propagation through plasma induces a measurable dispersion (frequency-dependent arrival time). Radio interferometers will be transformative because they simultaneously measure dispersion \emph{and} localize sources orders of magnitude better than traditional single-dish radio telescopes. That localization helps associate the radio emission with other objects, such as host galaxies or stars, to help us understand the transient and use it to probe the interstellar/intergalactic interstellar/intergalactic  media. The localization precision of a radio interferometer comes at the cost of managing a torrential data stream. My work with the Very Large Array (VLA) has commissioned an observing mode that produces data rates of 1 TB hour$^{-1}$. I have written an extensive parallelized software system to search VLA data for transients. My collaborators and I have observed for 100 hours to produce 100 TB of data in the search for fast radio transients of various types. This new observing mode is pushing the VLA beyond its intended use and finding new, compelling science at those limits.  The challenges of this effort are increasingly common in the sciences. My current efforts are focused on improving the parallelization and robustness of my radio transient search. More broadly, I am interested in developing the concept of \emph{real time anomaly detection} for massive data streams. In the study of radio transients, real-time detection would allow us to throttle the data stream by saving data only for the brief moments of interest. This process of "data triage" "data triage"  will be a key strategy to extracting science in data-intensive fields. \section{Science}  An exciting new class of radio transients is the "fast "fast  radio burst" burst"  (FRB; Thornton et al. 2013, Science, 341, 53). Discovered in all-sky pulsar surveys by single-dish telescopes, their dispersion is an order of magnitude larger than expected from the Galaxy and consistent with propagation through the intergalactic medium. If FRBs lie at cosmological distances, their dispersion can be used to measure properties of the intergalactic medium. Beyond using FRBs as probes, understanding the origin of FRBs may have relevance to gamma-ray bursts and sources of gravitational waves. The most distant pulsar known was recently detected in Andromeda (Rubio-Hererra et al. 2013, MNRAS, 428, 2857). Dispersion of a sample of such transients will directly measure the baryons in the outer fringes (the "halo") "halo")  of the Milky Way and M31. Roughly 50\% of baryons in the local universe have not been directly detected and fast radio transients may help solve this "missing "missing  baryon problem". problem".  Nearer to our own Galaxy, pulsar surveys have discovered the "rotating "rotating  radio transient" transient"  (RRAT; McLaughlin et al. 2006, Nature, 439, 817), a spinning neutron star that sporadically pulses. While a few dozen RRATs are now known, it is unclear whether they are tied to extreme objects like magnetars or simply ordinary pulsars that emit bright pulses detectable individually. Much closer to earth, we know that Jupiter emits intense radio bursts that make it the brightest astronomical object at low radio frequencies. Coronal mass ejections (much as seen in the Sun), also drive radio fast, coherent radio flares. These processes could be used to measure magnetism and plasma properties of other stars and should profoundly affect the habitability of orbiting exoplanets. Both of these mechanisms should be detectable as subsecond transients.   \section{Real-Time Detection as Solution to the Big Data Challenge}  The technical requirements for our radio transient searchs are extreme in astronomy, but are becoming more common (e.g., see plans for the SKA and LSST). Lessons learned from our project will have increasing relevance to scientists working to solve the "needle "needle  in a haystack" haystack"  problem. The transient search problem I am describing here is distinguished from that of the optical transient community in a few ways. First, radio interferometers are computationally dominated by the process of generating images and detecting sources. This process is inherently more parallelizable than the source classification process that is the limiting step in optical transient searches. Second, radio transients are rarer and fainter than typical optical transients. So, while optical transients focus their efforts on classifying transients based on an abundance of information, radio transient searches are more concerned with detection.  Currently, we are recording data to disk at a rate of 1 TB hour$^{-1}$ and processing it on compute clusters near the VLA, at Los Alamos National Lab, and NERSC. The internet is too slow to transport the 1 TB hour$^{-1}$ data stream, so we ship disks to our computing centers. This approach is complex and not sustainable in the large campaigns needed to find many fast radio transients.   I am interested in thinking about how real-time detection can help solve the challenges of big data. By bringing computational support closer to the telescope, real-time detection makes it possible to decide whether a given segment of data is worth saving or not. This "data triage" "data triage"  may cheapen data, but it is necessary to access science in some high data rate streams.