Anomaly Detection in Massive Radio Interferometry Data Streams


The study of millisecond radio transients is important for a number of fundamental problems in astrophysics, including the characterization of the intergalactic medium, discovering exoplanets, and understanding the lifecycle of neutron stars. These transients are rare and unpredictable, requiring extensive blind surveys for a chance to detect a single event. However, even a single detection can have huge science payoffs, since they can help understand exotic states of matter or illuminate distant corners of the universe.

Recent technological advances in radio astronomy, particularly the use of large arrays of antennas known as interferometers, enable data collection at time resolutions sufficient to study these phenomena with exquisite sensitivity, resolution, and flexibility. This power comes with the cost of handling data streams of 1 TB hour\(^{-1}\), far faster than transportation and archiving infrastructure can support. Next generation radio telescopes will increase this data flow and requisite computing requirements by orders of magnitude. Evolutionary changes to data analysis will not save radio astronomers from this data deluge. A revolutionary approach is needed to do science with massive data streams. I am interested in developing the concepts of real-time anomaly detection and data triage as solutions to this big data challenge.

Image of a millisecond radio transient found in a blind survey with the VLA. I describe this observation in more detail in the article at \label{rratimg}


My concept for the study of radio transients and high data rate interferometry has been developed through an iterative approach over the past five years. The science, algorithms, and hardware we use today have evolved based on real-world experience.

I began this effort by leading the construction of the first instrument for millisecond imaging with an interferometer (Law et al. 2011, Astrophysical Journal, 742, 12). This instrument was installed at the Allen Telescope Array, a radio interferometer in northern California, where we used it to observe known millisecond radio transients. Standard radio astronomy software packages were not designed for millisecond timescale data, so I built a new data analysis system in Python.

With our first terabyte of data on disk, we began to think seriously about algorithms for efficiently searching for radio transients. The traditional data analysis systems required human interaction at every stage. This approach was not feasible when analyzing many millions of images. Our solution was a novel statistical test that automatically found transient candidates; the number of candidate events was small enough to be manually inspected by a person (Law et al. 2012, Astrophysical Journal, 749, 7). This algorithm is now being tested at new, powerful radio interferometers under construction around the world.

Based on that success, I began collaborating with the National Radio Astronomy Observatory to develop the world’s most powerful radio interferometer, the Very Large Array (VLA), for millisecond imaging. After a 3-month residency project, our team unveiled its first fruits: the first blind detection of a millisecond radio transient (See Figure \ref{rratimg}; Law et al. 2012, Astrophysical Journal, 760, 6). This transient was a rare kind of neutron star that pulses sporadically and has traditionally been studied by large, single-dish radio telescopes. By using an interferometer, we precisely localized the neutron star and could search for counterparts in optical surveys. The lack of an optical counterpart gave us insight into how the neutron star formed.

We have continued to develop the VLA for millisecond imaging and now routinely use it to observe at data rates of 300 MB s\(^{-1}\) or 1 TB hour\(^{-1}\). My software now incorporates new algorithms for high throughput radio transient searches and is run on compute clusters. I am leading a collaboration to search tens of terabytes of data using clusters at the VLA, at Los Alamos National Lab, and National Energy Research Scientific Computing Center (NERSC).

Research Direction

My collaborators and I are using this observing mode in the first efforts to detect and image a new class of radio transient called “fast radio bursts” (FRBs; Thornton et al. 2013, Science, 341, 53). FRBs are believed to be cataclysmic events that originate from far outside our Galaxy. If so, FRBs will be exquisite probes of the tenuous gas believed to reside in the fringes of galaxies (McQuinn 2014, ApJ, 780, L33) and may help in the search for gravitational waves (Zhang 2014, ApJ, 780, L21). However, their great distance has only been inferred; no direct distance measurement exists, since FRBs have only been detected with telescopes with a poor ability to localize sources. Our work has made the VLA into the ideal platform to find and localize FRBs. We are in the midst of a 150 TB survey designed to find a sample of FRBs to determine what causes them and whether they can be used as cosmic probes.

The challenge posed by this 1 TB hour\(^{-1}\) observing mode and 150 TB project are increasingly common in the sciences. In our case, the internet is too slow to transport our data, so we ship disks to computing centers running our transient detection code. This approach is too complex to be applied more broadly. However, if we could support continuous surveys for millisecond transients, we could find hundreds of transients for novel statistical tests of their origin and the composition of the interstellar and intergalactic media.

Therefore, a long-term interest of mine is to develop the concept of real-time anomaly detection for massive data streams from radio interferometers. In the study of radio transients, real-time detection allows us to throttle the data stream to only the brief moments of interest. This process of data triage is common in the particle physics community, where they have long been capable of building instruments that produce a deluge of data. However, in astronomy and other fields, scientists still treat every byte of data as precious, driving huge archiving costs and limiting our access to high data rate applications. Data triage will be a key strategy to extracting science in data-intensive searches for “a needle in the haystack”.