casacore
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
StatisticsAlgorithm.h
Go to the documentation of this file.
1 //# Copyright (C) 2000,2001
2 //# Associated Universities, Inc. Washington DC, USA.
3 //#
4 //# This library is free software; you can redistribute it and/or modify it
5 //# under the terms of the GNU Library General Public License as published by
6 //# the Free Software Foundation; either version 2 of the License, or (at your
7 //# option) any later version.
8 //#
9 //# This library is distributed in the hope that it will be useful, but WITHOUT
10 //# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
11 //# FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public
12 //# License for more details.
13 //#
14 //# You should have received a copy of the GNU Library General Public License
15 //# along with this library; if not, write to the Free Software Foundation,
16 //# Inc., 675 Massachusetts Ave, Cambridge, MA 02139, USA.
17 //#
18 //# Correspondence concerning AIPS++ should be addressed as follows:
19 //# Internet email: aips2-request@nrao.edu.
20 //# Postal address: AIPS++ Project Office
21 //# National Radio Astronomy Observatory
22 //# 520 Edgemont Road
23 //# Charlottesville, VA 22903-2475 USA
24 //#
25 
26 #ifndef SCIMATH_STATISTICSALGORITHM_H
27 #define SCIMATH_STATISTICSALGORITHM_H
28 
29 #include <casacore/casa/aips.h>
36 
37 #include <map>
38 #include <set>
39 #include <vector>
40 
41 namespace casacore {
42 
43 // Base class of statistics algorithm class hierarchy.
44 
45 // The default implementation is such that statistics are only calculated when
46 // methods that actually compute statistics are called. Until then, the
47 // iterators which point to the beginning of data sets, masks, etc. are held in
48 // memory. Thus, the caller must keep all data sets available for the statistics
49 // object until these methods are called, and of course, if the actual data
50 // values are changed between adding data and calculating statistics, the
51 // updated values are used when calculating statistics. Derived classes may
52 // override this behavior.
53 //
54 // PRECISION CONSIDERATIONS
55 // Many statistics are computed via accumulators. This can lead to precision
56 // issues, especially for large datasets. For this reason, it is highly
57 // recommended that the data type one uses as the AccumType be of higher
58 // precision, if possible, than the data type pointed to by input iterator. So
59 // for example, if one has a data set of Float values (to which the
60 // InputIterator type points to), then one should use type Double for the
61 // AccumType. In this case, the Float data values will be converted to Doubles
62 // before they are accumulated.
63 //
64 // METHODS OF PROVIDING DATA
65 // Data may be provided in one of two mutually exclusive ways. The first way is
66 // simpler, and that is to use the setData()/addData() methods. Calling
67 // setData() will clear any previous data that was added via these methods or
68 // via a data provider (see below). Calling addData() after having called
69 // setData() will add a data set to the set of data sets on which statistics
70 // will be calculated. In order for this to work correctly, the iterators which
71 // are passed into these methods must still be valid when statistics are
72 // calculated (although note that some derived classes allow certain statistics
73 // to be updated as data sets are added via these methods. See specific classes
74 // for details).
75 //
76 // The second way to provide data is via an object derived from class
77 // StatsDataProvider, in which methods are implemented for retrieving various
78 // information about the data sets to be included. Such an interface is
79 // necessary for data structures which do not easily lend themselves to be
80 // provided via the setData()/addData() methods. For example, in the case of
81 // iterating through a Lattice, a lattice iterator will overwrite the memory
82 // location of the previous chunk of data with the current chunk of data.
83 // Therefore, if one does not wish to load data from the entire lattice into
84 // memory (which is why LatticeIterator was designed to have the behavior it
85 // does), one must use the LatticeStatsDataProvider class, which the statistics
86 // framework will use to iterate through the lattice, only keeping one chunk of
87 // the data of the lattice in memory any given moment.
88 //
89 // STORAGE OF DATA
90 // In order to reduce maintenance costs, the accounting details of the data sets
91 // are maintained in a StatisticsDataset object. This object is held in memory
92 // at the StatisticsAlgorithm level in the _dataset private field of this class
93 // when a derived class is instantiated. A StatisticsDataset object should never
94 // need to be explicitly instantiated by an API developer.
95 //
96 // QUANTILES
97 // A quantile is a value contained in a data set, such that, it has a zero-based
98 // index of ceil(q*n)-1 in the equivalent ordered dataset, where 0 < q < 1
99 // specifies the fractional location within the ordered dataset and n is the
100 // total number of valid elements. Note that, for a dataset with an odd number
101 // of elements, the median is the same as the quantile value when q = 0.5.
102 // However, there is no such correspondence between the median in a dataset with
103 // an even number of elements, since the median in that case is given by the
104 // mean of the elements of zero-based indices n/2-1 and n/2 in the equivalent
105 // ordered dataset. Thus, in the case of a dataset with an even number of
106 // values, the median may not even exist in the dataset, while a generic
107 // quantile value must exist in the dataset by definition. Note when calculating
108 // quantile values, a dataset that does not fall in specified dataset ranges,
109 // is not included via a stride specification, is masked, or has a weight of
110 // zero, is not considered a member of the dataset for the purposes of quantile
111 // calculations.
112 //
113 // CLASS ORGANIZATION
114 // In general, in the StatsFramework class hierarchy, classes derived from
115 // StatisticsAlgorithm and its descendants contain methods which calculate the
116 // relevant statistics which are computed via accumulation. These classes also
117 // contain the top level methods for computing the quantile-like statistics, for
118 // the convenience of the API developer. Derived classes of StatisticsAlgorithm
119 // normally will have a private field which is an object that contains methods
120 // which compute the various quantile-like statistics. These so-called
121 // QuantileComputer classes have been created to reduce maintainability costs;
122 // because putting all the code into single class files was becoming unwieldy.
123 // The concrete QuantileComputer classes are ultimately derived from
124 // StatisticsAlgorithmQuantileComputer, which is the virtual base class of this
125 // hierarchy. StatisticsAlgorithm objects do not contain a
126 // StatisticsAlgorithmQuantileComputer private field, since StatisticsAlgorithm
127 // is also a virtual base class and hence no actual statistics are computed
128 // within it. The design is such that the only classes an API developer should
129 // over instantiate are the derived classes of StatisticsAlgorithm; the
130 // QuantileComputer classes should never be explicitly instantiated in code
131 // which uses the StatsFramework API.
132 
133 template <
134  class AccumType, class DataIterator, class MaskIterator=const Bool *,
135  class WeightsIterator=DataIterator
136 >
138 
139 public:
140 
141  virtual ~StatisticsAlgorithm();
142 
143  // Clone this instance
144  virtual StatisticsAlgorithm<CASA_STATP>* clone() const = 0;
145 
146  // <group>
147  // Add a dataset to an existing set of datasets on which statistics are to
148  // be calculated. nr is the number of points to be considered. If
149  // <src>dataStride</src> is greater than 1, when
150  // <src>nrAccountsForStride</src>=True indicates that the stride has been
151  // taken into account in the value of <src>nr</src>. Otherwise, it has not
152  // so that the actual number of points to include is nr/dataStride if
153  // nr % dataStride == 0 or (int)(nr/dataStride) + 1 otherwise. if one calls
154  // this method after a data provider has been set, an exception will be
155  // thrown. In this case, one should call setData(), rather than addData(),
156  // to indicate that the underlying data provider should be removed.
157  // <src>dataRanges</src> provide the ranges of data to include if
158  // <src>isInclude</src> is True, or ranges of data to exclude if
159  // <src>isInclude</src> is False. If a datum equals the end point of a data
160  // range, it is considered good (included) if <src>isInclude</src> is True,
161  // and it is considered bad (excluded) if <src>isInclude</src> is False.
162 
163  void addData(
164  const DataIterator& first, uInt nr, uInt dataStride=1,
165  Bool nrAccountsForStride=False
166  );
167 
168  void addData(
169  const DataIterator& first, uInt nr,
170  const DataRanges& dataRanges, Bool isInclude=True, uInt dataStride=1,
171  Bool nrAccountsForStride=False
172  );
173 
174  void addData(
175  const DataIterator& first, const MaskIterator& maskFirst,
176  uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False,
177  uInt maskStride=1
178  );
179 
180  void addData(
181  const DataIterator& first, const MaskIterator& maskFirst,
182  uInt nr, const DataRanges& dataRanges, Bool isInclude=True,
183  uInt dataStride=1, Bool nrAccountsForStride=False, uInt maskStride=1
184  );
185 
186  void addData(
187  const DataIterator& first, const WeightsIterator& weightFirst,
188  uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False
189  );
190 
191  void addData(
192  const DataIterator& first, const WeightsIterator& weightFirst,
193  uInt nr, const DataRanges& dataRanges, Bool isInclude=True,
194  uInt dataStride=1, Bool nrAccountsForStride=False
195  );
196 
197  void addData(
198  const DataIterator& first, const WeightsIterator& weightFirst,
199  const MaskIterator& maskFirst, uInt nr, uInt dataStride=1,
200  Bool nrAccountsForStride=False, uInt maskStride=1
201  );
202 
203  void addData(
204  const DataIterator& first, const WeightsIterator& weightFirst,
205  const MaskIterator& maskFirst, uInt nr, const DataRanges& dataRanges,
206  Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False,
207  uInt maskStride=1
208  );
209  // </group>
210 
211  // get the algorithm that this object uses for computing stats
212  virtual StatisticsData::ALGORITHM algorithm() const = 0;
213 
214  virtual AccumType getMedian(
215  CountedPtr<uInt64> knownNpts=nullptr,
216  CountedPtr<AccumType> knownMin=nullptr,
217  CountedPtr<AccumType> knownMax=nullptr,
218  uInt binningThreshholdSizeBytes=4096*4096,
219  Bool persistSortedArray=False, uInt nBins=10000
220  ) = 0;
221 
222  // The return value is the median; the quantiles are returned in the
223  // <src>quantileToValue</src> map.
224  virtual AccumType getMedianAndQuantiles(
225  std::map<Double, AccumType>& quantileToValue,
226  const std::set<Double>& quantiles,
227  CountedPtr<uInt64> knownNpts=nullptr,
228  CountedPtr<AccumType> knownMin=nullptr,
229  CountedPtr<AccumType> knownMax=nullptr,
230  uInt binningThreshholdSizeBytes=4096*4096,
231  Bool persistSortedArray=False, uInt nBins=10000
232  ) = 0;
233 
234  // get the median of the absolute deviation about the median of the data.
235  virtual AccumType getMedianAbsDevMed(
236  CountedPtr<uInt64> knownNpts=nullptr,
237  CountedPtr<AccumType> knownMin=nullptr,
238  CountedPtr<AccumType> knownMax=nullptr,
239  uInt binningThreshholdSizeBytes=4096*4096,
240  Bool persistSortedArray=False, uInt nBins=10000
241  ) = 0;
242 
243  // Purposefully not virtual. Derived classes should not implement.
244  AccumType getQuantile(
245  Double quantile, CountedPtr<uInt64> knownNpts=nullptr,
246  CountedPtr<AccumType> knownMin=nullptr,
247  CountedPtr<AccumType> knownMax=nullptr,
248  uInt binningThreshholdSizeBytes=4096*4096,
249  Bool persistSortedArray=False, uInt nBins=10000
250  );
251 
252  // get a map of quantiles to values.
253  virtual std::map<Double, AccumType> getQuantiles(
254  const std::set<Double>& quantiles, CountedPtr<uInt64> npts=nullptr,
256  uInt binningThreshholdSizeBytes=4096*4096,
257  Bool persistSortedArray=False, uInt nBins=10000
258  ) = 0;
259 
260  // get the value of the specified statistic. Purposefully not virtual.
261  // Derived classes should not implement.
262  AccumType getStatistic(StatisticsData::STATS stat);
263 
264  // certain statistics such as max and min have locations in the dataset
265  // associated with them. This method gets those locations. The first value
266  // in the returned pair is the zero-based dataset number that was set or
267  // added. The second value is the zero-based index in that dataset. A data
268  // stride of greater than one is not accounted for, so the index represents
269  // the actual location in the data set, independent of the dataStride value.
271 
272  // Return statistics. Purposefully not virtual. Derived classes should not
273  // implement.
275 
276  // reset this object by clearing data.
277  virtual void reset();
278 
279  // <group>
280  // setdata() clears any current datasets or data provider and then adds the
281  // specified data set as the first dataset in the (possibly new) set of data
282  // sets for which statistics are to be calculated. See addData() for
283  // parameter meanings. These methods are purposefully not virtual. Derived
284  // classes should not implement.
285  void setData(
286  const DataIterator& first, uInt nr, uInt dataStride=1,
287  Bool nrAccountsForStride=False
288  );
289 
290  void setData(
291  const DataIterator& first, uInt nr, const DataRanges& dataRanges,
292  Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False
293  );
294 
295  void setData(
296  const DataIterator& first, const MaskIterator& maskFirst, uInt nr,
297  uInt dataStride=1, Bool nrAccountsForStride=False, uInt maskStride=1
298  );
299 
300  void setData(
301  const DataIterator& first, const MaskIterator& maskFirst,
302  uInt nr, const DataRanges& dataRanges, Bool isInclude=True,
303  uInt dataStride=1, Bool nrAccountsForStride=False, uInt maskStride=1
304  );
305 
306  void setData(
307  const DataIterator& first, const WeightsIterator& weightFirst, uInt nr,
308  uInt dataStride=1, Bool nrAccountsForStride=False
309  );
310 
311  void setData(
312  const DataIterator& first, const WeightsIterator& weightFirst, uInt nr,
313  const DataRanges& dataRanges, Bool isInclude=True, uInt dataStride=1,
314  Bool nrAccountsForStride=False
315  );
316 
317  void setData(
318  const DataIterator& first, const WeightsIterator& weightFirst,
319  const MaskIterator& maskFirst, uInt nr, uInt dataStride=1,
320  Bool nrAccountsForStride=False, uInt maskStride=1
321  );
322 
323  void setData(
324  const DataIterator& first, const WeightsIterator& weightFirst,
325  const MaskIterator& maskFirst, uInt nr, const DataRanges& dataRanges,
326  Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False,
327  uInt maskStride=1
328  );
329  // </group>
330 
331  // instead of setting and adding data "by hand", set the data provider
332  // that will provide all the data sets. Calling this method will clear
333  // any other data sets that have previously been set or added. Method
334  // is virtual to allow derived classes to carry out any necessary
335  // specialized accounting when resetting the data provider.
336  virtual void setDataProvider(StatsDataProvider<CASA_STATP> *dataProvider);
337 
338  // Provide guidance to algorithms by specifying a priori which statistics
339  // the caller would like calculated.
340  virtual void setStatsToCalculate(std::set<StatisticsData::STATS>& stats);
341 
342 protected:
344 
345  // use copy semantics, except for the data provider which uses reference
346  // semantics
348 
349  // use copy semantics, except for the data provider which uses reference
350  // semantics
352 
353  // Allows derived classes to do things after data is set or added.
354  // Default implementation does nothing.
355  virtual void _addData() {}
356 
357  // <group>
358  // These methods are purposefully not virtual. Derived classes should
359  // not implement.
361  return _dataset;
362  }
363 
365  // </group>
366 
367  virtual AccumType _getStatistic(StatisticsData::STATS stat) = 0;
368 
369  virtual StatsData<AccumType> _getStatistics() = 0;
370 
371  const std::set<StatisticsData::STATS> _getStatsToCalculate() const {
372  return _statsToCalculate;
373  }
374 
375  virtual const std::set<StatisticsData::STATS>&
377  return _unsupportedStats;
378  }
379 
380  // Derived classes should normally call this in their constructors, if
381  // applicable.
383  const std::set<StatisticsData::STATS>& stats
384  ) {
385  _unsupportedStats = stats;
386  }
387 
388 private:
389  std::set<StatisticsData::STATS> _statsToCalculate{}, _unsupportedStats{};
392 
393  void _resetExceptDataset();
394 
395 };
396 
397 }
398 
399 #ifndef CASACORE_NO_AUTO_TEMPLATES
400 #include <casacore/scimath/StatsFramework/StatisticsAlgorithm.tcc>
401 #endif
402 
403 #endif
virtual void setStatsToCalculate(std::set< StatisticsData::STATS > &stats)
Provide guidance to algorithms by specifying a priori which statistics the caller would like calculat...
virtual AccumType getMedianAbsDevMed(CountedPtr< uInt64 > knownNpts=nullptr, CountedPtr< AccumType > knownMin=nullptr, CountedPtr< AccumType > knownMax=nullptr, uInt binningThreshholdSizeBytes=4096 *4096, Bool persistSortedArray=False, uInt nBins=10000)=0
get the median of the absolute deviation about the median of the data.
virtual const std::set< StatisticsData::STATS > & _getUnsupportedStatistics() const
virtual void _addData()
Allows derived classes to do things after data is set or added.
void setData(const DataIterator &first, uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False)
setdata() clears any current datasets or data provider and then adds the specified data set as the fi...
virtual StatsData< AccumType > _getStatistics()=0
struct Node * first
Definition: malloc.h:330
StatisticsDataset< CASA_STATP > _dataset
void _setUnsupportedStatistics(const std::set< StatisticsData::STATS > &stats)
Derived classes should normally call this in their constructors, if applicable.
LatticeExprNode max(const LatticeExprNode &left, const LatticeExprNode &right)
virtual LocationType getStatisticIndex(StatisticsData::STATS stat)=0
certain statistics such as max and min have locations in the dataset associated with them...
virtual AccumType getMedian(CountedPtr< uInt64 > knownNpts=nullptr, CountedPtr< AccumType > knownMin=nullptr, CountedPtr< AccumType > knownMax=nullptr, uInt binningThreshholdSizeBytes=4096 *4096, Bool persistSortedArray=False, uInt nBins=10000)=0
virtual AccumType _getStatistic(StatisticsData::STATS stat)=0
AccumType getQuantile(Double quantile, CountedPtr< uInt64 > knownNpts=nullptr, CountedPtr< AccumType > knownMin=nullptr, CountedPtr< AccumType > knownMax=nullptr, uInt binningThreshholdSizeBytes=4096 *4096, Bool persistSortedArray=False, uInt nBins=10000)
Purposefully not virtual.
ALGORITHM
implemented algorithms
StatisticsDataset< CASA_STATP > & _getDataset()
Referenced counted pointer for constant data.
Definition: CountedPtr.h:80
LatticeExprNode min(const LatticeExprNode &left, const LatticeExprNode &right)
virtual void setDataProvider(StatsDataProvider< CASA_STATP > *dataProvider)
instead of setting and adding data &quot;by hand&quot;, set the data provider that will provide all the data se...
StatisticsAlgorithm & operator=(const StatisticsAlgorithm &other)
use copy semantics, except for the data provider which uses reference semantics
std::pair< Int64, Int64 > LocationType
double Double
Definition: aipstype.h:55
virtual AccumType getMedianAndQuantiles(std::map< Double, AccumType > &quantileToValue, const std::set< Double > &quantiles, CountedPtr< uInt64 > knownNpts=nullptr, CountedPtr< AccumType > knownMin=nullptr, CountedPtr< AccumType > knownMax=nullptr, uInt binningThreshholdSizeBytes=4096 *4096, Bool persistSortedArray=False, uInt nBins=10000)=0
The return value is the median; the quantiles are returned in the quantileToValue map...
virtual StatisticsAlgorithm< CASA_STATP > * clone() const =0
Clone this instance.
const StatisticsDataset< CASA_STATP > & _getDataset() const
These methods are purposefully not virtual.
virtual StatisticsData::ALGORITHM algorithm() const =0
get the algorithm that this object uses for computing stats
#define DataRanges
bool Bool
Define the standard types used by Casacore.
Definition: aipstype.h:42
const std::set< StatisticsData::STATS > _getStatsToCalculate() const
const Bool False
Definition: aipstype.h:44
void addData(const DataIterator &first, uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False)
Add a dataset to an existing set of datasets on which statistics are to be calculated.
StatsData< AccumType > getStatistics()
Return statistics.
std::set< StatisticsData::STATS > _statsToCalculate
std::set< StatisticsData::STATS > _unsupportedStats
AccumType getStatistic(StatisticsData::STATS stat)
get the value of the specified statistic.
virtual std::map< Double, AccumType > getQuantiles(const std::set< Double > &quantiles, CountedPtr< uInt64 > npts=nullptr, CountedPtr< AccumType > min=nullptr, CountedPtr< AccumType > max=nullptr, uInt binningThreshholdSizeBytes=4096 *4096, Bool persistSortedArray=False, uInt nBins=10000)=0
get a map of quantiles to values.
virtual void reset()
reset this object by clearing data.
Base class of statistics algorithm class hierarchy.
const Bool True
Definition: aipstype.h:43
unsigned int uInt
Definition: aipstype.h:51