Keywords    Differential privacy, Histogram, Randomized quicksort, Dynamic programming
1 Introduction
       The primary task of data query and analysis is capable of getting data publishing information precisely, efficiently, speedily and comprehensively to acquire the specific results. The histogram, a kind of bar graph, is one of the essential tools for approximate estimation of data publishing and was first introduced by Kari Pearson [1]. To construct a histogram is to bin the range of values, i.e., split the entire range of values into a series of an interval, then count how many values fall into each interval. It usually specifies the bins as consecutive, non-overlapping intervals of a variable. The bins intervals must be adjacent and are often of equal width.
       The general purpose of publishing histogram is to support the application of aggregational queries, range-count queries, and data mining, and so on. However, the counts of the bin may leak one's sensitive information if we directly release the histogram with raw data. Now suppose Fig. 1a manifests imaginary sample records of HIV-positive patients with sensitive information and Fig. 1b illustrates the age distribution of such patients as its corresponding histogram; the adversary can infer that Jacob is infected by HIV if s/he knows the HIV status of all the rest of patients (except Jacob) in the histogram. To prevent this privacy concerns,  we must ensure privacy protection before publishing. 
       Differential privacy, as a new paradigm,  has widely applied to data publishing (i.e., safeguard histograms) as presented in [2]. There have been some recent works on differentially private histogram publication [3, 4, 5, 6, 7, 8, 9, 10, 11] with the underlying goal of improving the accuracy of histograms released. Amid all these works,  the methods based on grouping [9, 10, 11] give more accuracy than others. The general idea of based on grouping approach is to gather bins with the adjacent position and replace their counts by the group's mean so that we partly can reduce error.  In this process, it will produce two errors [12]: 1) the Laplace noise error by injected Laplace noise, 2) the grouping error by replaced the count of each bin with its group's mean. The trade-off between these two errors is crucial to the accuracy of the finally published histogram.
       Nevertheless, the existing studies [9, 10, 11] do not solve this trade-off problem in a better way. Xu et al. propose two mechanisms: NoiseFirst and StructureFirst in [10]. However,  both have the following  drawbacks.: First, they only consider local grouping, i.e., a bin can only merge with its adjacent bins,  despite there may exist many bins which count is quite close far apart. For example, Fig. 1b as a histogram with seven bins 10-20, 20-30, 30-40 and so on. Base on the method, the bin 10-20 cannot be in the same group with 30-40 and 60-70 unless 20-30, 40-50 and 50-60 are in this group as well. Therefore, either large grouping error (i.e., merging bins from 10-20 to 60-70) or large Laplace noise error (i.e., creating different serval groups) will be unavoidable. Kellaris and Papadopoulos [11] partially describe the shortcomings of the above method by using private sorting. The advantage of sorting is to allow global grouping. For instance, these bins 10-20, 30-40 and 60-70 can merge into a group without including 20-30, 40-50 and 50-60 in Fig. 1c. However, their approach does not explicitly consider the grouping error and merely demand each group to be of the same width. Second, NoistFirst and StructureFirst both take the dynamic programming solution [13] based on two-dimensional error table to handle optimal grouping problem to reduce the grouping error. However, a flaw exists, that need to determine the number of groups in advance and then find each optimal group by tracing back the selections of bin boundaries backward, whether NoiseFirst to group the data injected Laplace noise or StructureFirst to the raw data. This method will lead to non-optimal grouping influence the accuracy of the published histogram and the efficiency of the algorithm.
       Motivated by above, we propose a new novel mechanism that publishes more accurate histograms while satisfying differential privacy. To begin with, we take the Randomized quicksort algorithm to order the histogram injected Laplace noise so that lower the errors. The randomized method can efficiently resist malicious attack from an adversary who knows the sorting algorithm and makes use of specific data. This reason is why we use it. Moreover, on the grouping problem, we propose a minimum error cost solution based on dynamic programming idea for further improve the accuracy of the histogram published.
      Finally,  we provide an exhaustive experimental evaluation using two real datasets, ranging from thousands to millions of records.  The datasets come from various domains(e.g., the IPUM's census of Brazil, the Hillary Clinton and Donald Trump Tweets) on the Internet. Our results give the data comparison and confirm the superiority of our mechanism over other similar methods.
      The rest of the paper is organized as follows. Section 2 summarizes the related works. Section 3 gives preliminary information. We analyze the NoiseFirst and StructureFirst in Section 4. In Section 5, we present our optimal mechanism in detail. Section 6 shows the experimental results and comparison. In the end, section 7 concludes our work.
2 Related Work
      Base on the importance of histograms, there have been several recent works [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] studying histogram release under differential privacy.
      Dwork et al.[2] propose the Laplace mechanism, which establishes the technical foundation for all following works. As a simple solution, it can differentially privately release a histogram by adding independent Laplace noise to each bin. However, the resulting utility of this mechanism is often poor.
      Barak et al.[3] introduce the idea of post-processing the output of a differentially private mechanism to ensure consistency,  they apply Laplace mechanism on the Fourier coefficients of an input dataset and employ linear programming to create non-negative contingency tables. However, unlike the above work, the post-processing is not shown to improve accuracy.
      Xiao et al.[4] present a histogram published technique called Privelet, which applies the Haar Wavelet Transform on the input data and adds polylogarithmic noise to the transformed data, then releases a noise version of its coefficient tree. They prove that higher nodes in the tree need relatively smaller noise. Meanwhile,  Through using deeper tree nodes, it can directly answer the count queries with longer ranges.
      Rastogi and Nath [5] exploit adaptation of the Fourier Perturbation Algorithm, which mainly targets at time-series data. This algorithm perturbs the Discrete Fourier Transform of the query results and improves the expected error from Q(n) to roughly Q(k) where k is the number of Fourier coefficients that can reconstruct all this n query answers.
      Li et al. [6] propose a two-stage matrix mechanism for answering a workload of predicate counting queries. The mechanism requests answers to a different set of queries (i.e., a query strategy) which are answered using the standard Laplace mechanism for given a workload. Meanwhile, the noisy answers to the strategy queries can contain the noisy answers to the workload queries. The entire process can make for a more complex correlated noise distribution that preserves differential privacy but increases accuracy. Recently Yuan et al. [8] argue this mechanism features considerable computational overhead and sub-optimal strategies and introduce an improved mechanism based on the theory of low-rank approximation.
      Hay et al. [7] employ the public constraints on the query answers to perform constrained inferences which as a post-processing step and can improve the accuracy of histogram queries. Their method, as with our solutions, is designed only for one-dimensional data.
      Acs et al. [9] propose two sanitization technique for publishing histogram. The first scheme is an optimization of the Fourier Perturbation Algorithm presented in [5]. This scheme improves the accuracy of the original Fourier Perturbation Algorithm by a factor of 10. The other scheme depends on clustering and exploits the redundancy between bins. They design a hierarchical bisection procedure over a histogram to identify suited groups.
      Xu et al. [10] introduce two iconic mechanisms, i.e., NoiseFirst and StructureFirst for computing DP-compliant histograms. Their primary difference rest with the relative order of the noise injection and the histogram structure computation steps.  NoiseFirst has the additional benefit that it can improve the accuracy of a published DP-compliant histogram computed using a simple method.  Xu et al. have proven NoiseFirst outperforms the approaches of Privelet [4] and Boost [7] by contrast experimental data according to query accuracy. In section 4, we cover the NoistFirst and StructureFirst in more detail. However,  we propose the optimal mechanism is superior to NoiseFirst in terms of accuracy.
      Kellaris and Papadopoulos [11] present a method (i.e., GS) that pre-processes the counts by elaborate grouping and smoothing them via averaging to improve low utility. The purpose of this diminishes sensitivity and enables GS to satisfy differential privacy through low Laplace noise injection. The grouping strategy of GS is dictated by a sampling mechanism, which minimizes the smoothing perturbation.
      Zhang et al. [12] introduce a clustering framework that features a sophisticated evaluation of the balance between the grouping error and the Laplace noise error. At the same time, they also propose three clustering strategies with different orders of tun-time complexities.
3 Background