Home

Climate Change Project

Table of Contents

Courses

Search


Articles on Law, Science, and Engineering

Legal Defenses to the Use of Data Digits to Identify Fabricated Data

Charles F. Walter and Edward P. Richards, III.

Introduction

In the previous article, we explained how the Office of Research Integrity (ORI) at the U.S. Public Health Service (PHS) uses "insignificant" digits in research data to catch researchers falsifying or fabricating data. Basically, ORI assumes that certain digits reported in data are not significant to the measurement they purport to represent. Moreover, ORI assumes that such digits occur in falsified or fabricated data will not be selected randomly. It therefore applies the Chi Square test to the "insignificant" digits in questionable data and estimates the probability that the digits were generated randomly. Where bias is found in digits that should be random, the questionable data is further suspect, and the researcher who reported them may be convicted of scientific misconduct. 

In applying this theory, ORI assumes that any digit to the right of the leading one that is not distributed randomly in the data raises the possibility that the data is falsified. However, any experimental scientist would intuit that the use of significant digits for this purpose is far less probative evidence that data is fabricated than is the use of truly insignificant digits. In what follows, rather than adopt ORI's terminology, we refer to a left-most digit as the "most significant digit," the right-most significant digit as the "least significant digit," the rest of the significant digits (if any) as "significant digits," and any other digits in a datum as "insignificant digits."

Discussion

1. Is Brady, Texas the Center of the Lone Star State?

It is often said that Brady, Texas is the geographical "center" of the Lone Star State. Of course, we know from elementary geometry that Texas does not have a "center." But what if the Brady Chamber of Commerce asked you to prove their boast? How would you do it?

One way is to draw a series of straight lines extending from a Texas border point through Brady and thence to another Texas border point. If Brady is indeed at the "center" of Texas, the two distances from Brady to the two borders along each of these lines should be nearly equal.

Should we need to do so, we can improve on the facts by "forcing" Brady to the midpoints of the lines described above. If we draw enough lines and all the differences between the distances on all of the lines is zero, Brady is exactly in the center of the Lone Star State. Perhaps we shall need to falsify some data to make it come out just right....

Tables 1 and 2 contain distances from Brady to fifty points on the Texas border (identified by the names of towns nearby).

TABLE 1

Distances from Brady to:

City

Distance

City

Distance

City

Distance

Acala

16.98

Affred

11.97

Author City

12.69

Big Bend NP

12.48

Bloomburg

14.67

Bon Wier

14.77

Brazosport

12.39

Burr Ferry

14.89

Comstock

6.48

Corpus Christi

12.11

De Kalb

14.39

Del Rio

6.84

Dennison

10.42

Eagle Pass

8.00

El Paso

18.66

El Indio

8.50

Fargo

9.73

Farwell

13.47

Follett

16.27

Galveston

13.05

Glen Rio

15.20

High Island

13.66

Hobbs (NM)

10.58

Jal (NM)

10.05

Lajitas

12.83

L. Anistad

7.06

Langtree

7.05

Lindsay

9.53

Logansport

13.83

Matagorda

11.73

McAllen

15.62

McLeod

14.48

McNary

16.59

Milam

14.48

Orange

14.86

Pt Isabell

16.70

Pt Mansfield

14.73

Presidio

13.92

Ringgold

8.89

Rockport

11.45

Ruidosa

14.30

San Ygancio

12.48

Seadrift

11.27

Stafford

17.48

Texhoma

17.14

Texico

13.45

Texline

18.22

Valentine

14.41

Washomi

14.08

Wichita Fs

9.05

Average

12.86

Distances are in inches estimated to the nearest 1/64"

TABLE 2

Distance from Brady to:

City

Distance

City

Distance

City

Distance

Acala

15.96

Affred

11.41

Author City

9.70

Big Bend NP

13.99

Bloomburg

13.59

Bon Wier

15.29

Brazosport

12.78

Burr Ferry

15.61

Comstock

9.65

Corpus Christi

15.48

De Kalb

11.85

Del Rio

7.85

Dennison

9.12

Eagle Pass

8.02

El Paso

16.36

El Indio

8.29

Fargo

10.31

Farwell

12.94

Follett

16.82

Galveston

12.69

Glen Rio

13.06

High Island

12.44

Hobbs (NM)

11.41

Jal (NM)

11.67

Lajitas

13.96

L Anistad

8.98

Langtree

10.65

Lindsay

8.85

Logansport

13.90

Matagorda

12.93

McAllen

15.76

McLeod

13.48

McNary

15.99

Milam

14.00

Orange

16.27

Pt Isabell

16.33

Pt Mansfield

15.33

Presidio

13.40

Ringgold

8.73

Rockport

13.70

Ruidosa

14.78

San Ygancio

11.99

Seadrift

12.66

Stafford

16.64

Texhoma

17.88

Texico

12.98

Texline

15.57

Valentine

14.09

Washomi

14.46

Wichita Fls

9.97

Average

12.99

Distances are in inches estimated to the nearest 1/64"

The points on the border have been chosen such that it is possible to draw a straight line through Brady connecting one of the border points in the two quadrants to the north of Brady to one of the border points in the two quadrants to the south of Brady. One of the tables contains actual data obtained using a ruler and a map of Texas. Table 2 was constructed without a ruler with whole inches falsified to minimize the differences between the distances from Brady to each pair of border towns and all data to the right of the decimal point completely fabricated.

2. Not Catching the Guilty

Can we identify which table contains the falsified data by examining the distribution of significant digits in the data in both tables? Both tables allegedly contain raw data in inches estimated to the nearest 1/64". According to ORI the distribution of the digits to the right of the leading digit should be uniform in the authentic data, and not uniform in the falsified data.

As described above, ORI applies the distribution test by calculating Chi-Square ("CS") for the least significant digit, the next least significant digit, and so on, but excluding the leading digit from the calculations. ORI also calculates the "aggregate" Chi-Square ("CS(ag)") for all the digits except the leading digit. ORI assumes that there are 9 degrees of freedom, and using a probability test of .05, ORI concludes that sets of digits are uniformly distributed or not. Data containing digit sets that are not uniformly distributed according to this test may be used as evidence that the data was falsified.

As before, we refer to the least significant digit as "R1," the digit to the left of R1 as "R2," the digit to the left of R2 as "R3," etc. Using this convention we find CS(R1) = 8, CS(R2) = 12, CS(R3) = 21.5, and Chi(ag) = 12.7 for the data in Table 1. Assuming nine degrees of freedom, there is a very high probability that the R1, R2 and aggregate digits are distributed randomly, but CS(R3) = 21.5 for the 1's digit in the data corresponds to a probability that these digits are randomly distributed that is much less than .05.

In order to understand what's going on with this digit (and why it can be inappropriate for ORI to use it in its analysis), consider the following example.

Assume that, instead of Texas, Brady is in the center of a state called "Circle-Radius 460." Then Brady, CR460" would be 460 km from every point on its border. If I measured fifty of these distances on the map of Circle wherein 1" equals 36 km with a precision of 0.2", I'd find fifty numbers, all approximately the same, say 12.8." The right-most digit would always be even, so its distribution would not be random. The next digit would be nearly always be "2," sometimes "3," so its distribution would not be random either.

What if Brady is the center of a state called "Ellipse-468/432," where the major axis is 936 km and the minor axis is 864 km? Then Brady, E468/432 would be 468 km from its most distant border, and 432 km from its nearest. If I measure fifty of these distances as before, I find fifty numbers, all between 13.0" and 12.0. As before, the right-most digit would always be even, so its distribution would not be random. The next digit would nearly always be "2," but a few times it would be "3" (as before), and it might even be "1" or "4." Thus, the second digit would not be randomly distributed either.

What if Brady is the center of a state called "Ellipse-576/288"? Now if I measure fifty of these distances as before, I find fifty numbers between 8.0" and 16.0." The right-most digit is still always even, so it is not random. The middle digit, if there is one, is 0 through 6, with 6 not appearing much. So these digits are not random either.

What if Brady is the center of a state called "Ellipse-108/1080"? Now the measurements are between 3.0" and 30.0" The right-most digit is still always even, so it is not random. The middle digit, if there is one, is 0 through 9, but they will not be randomly distributed unless I designed the experiment so that they would be. For example, if I made 25 measurements for border points that happened to be evenly spaced between 360 and 540 km from Brady, and another 25 that happened to be evenly spaced between 720 and 900 km from Brady, Chi-Square for the 50 middle digits would be about 50. This corresponds to an infinitesimally small probability that the middle digits are uniformly distributed, and could be used to support an accusation that the data were falsified when in fact they weren't. If, on the other hand, I made the 50 observations at points evenly spaced along the border, Chi-Square would be about .57.

So of course the R3 digits in Figure 1 are not randomly distributed amongst all the possible digits. Texas is limited in size. No town in Table 1 is more than 18.66" from Brady. So there can be no 9's amongst the R3 digits. If we redo the calculation for the nine possible digits, CS(R3) is 15.4. Since there are really only eight degrees of freedom, the digits in R3 meet the ORI criteria for being distributed randomly. This is somewhat surprising, however, because the Texas border has a certain shape which, together with my choice of border points to measure, influences the frequency of digits in R3.

If we do the same calculation with the digits in Table 2, we find CS(R1) = 6.8, CS(R2) = 12.8, CS(R3) = 22, and CS(ag) = 5.7. If we correct R3 to take into account the experimental fact that there are no 8's or 9's in a state the size and shape of Texas, CS(R3) = 9.6. Even taking into account that there are only seven degrees of freedom for the digits in R3, the probability that the digits are uniformly distributed is substantially greater than .05.

We can't tell from looking at Table 2 that the data are falsified, and ORI can't tell using its Chi-Square test. Since we have never falsified data before, we do not consider ourselves particularly good at it. If we can avoid the flunking the test, so can just about anyone. Yet the data in Table 2 is falsified. All of the digits in R1 and R2 are without any basis in measurement, and the digits in R3 have been chosen to minimize the real differences between border points on a straight line through Brady. In fact the sum of the differences between these fifty points for the data in Table 2 is 0.251% of their total.

The data in Table 1 is not falsified. There, the sum of all the differences between the distances from Brady to border points on the same straight line is about 1% of the total distances involved. Thus, Brady is indeed close to the geographical "center" of Texas.

3. Are the Innocent Safe?

If ORI's test will not nail all of the miscreants who falsify data, will it accidentally snare some of the innocent? We have already seen that using the digit immediately to the right of the leading digit can do just that. Next, we illustrate some of the other pitfalls associated with using ORI's uniform distribution test without looking closely at the experimental design and other factors.

One important issue is the distribution of the least significant digit after rounding. What if we are actually unable to estimate distances on my map of Texas with the 1/64" precision we indicated in Table 1? When we estimate the same distances with a precision of 1/4", CS(R1) = 89.6, CS(R2) = 21, and CS(ag) = 50.4. The high CS(R1) and CS(ag) occur because not all digits are possible. Even if we correct the calculations for CS(R1) to take into account that there are six digits excluded from the least significant position and three degrees of freedom, the probability that the digits are uniformly distributed is less than .05. However, taking into account that the 1's column cannot contain a "9," leads to CS(R2) = 14.9. Since, as before, there are only eight degrees of freedom, the probability that the 1's column digits are randomly distributed is greater than .05.

Another important issue is what happens to the Chi-Square test when raw data is massaged. For example, what if we convert the raw data from inches to kilometers? Does the result depend on what the conversion factor happens to be?

The scale for the map we used to obtain the data in Table 1 is about 1,400,000 to 1. If we convert the raw data in Table 1 to kilometers, CS(R1) = 10.8, CS(R2) = 4.0, and CS(ag) = 7.4. In this example, massaging the raw data in Table 1 by converting it to kilometers more than doubles the probability that the data is uniform, as judged by CS(ag).

If we convert raw data estimated with a precision of 1/4" to kilometers, CS(R1) = 12, CS(R2) = 14.4, and CS(ag) = 13.2. In this example, massaging the raw data by converting it to kilometers increases the probability that the data is uniform by several orders of magnitude, but, not above .05. Thus, raw data converted to kilometers that was not falsified can be highly suspect according to the ORI Chi-Square test when the real digit frequencies and reduced degrees of freedom apparent in the raw data are masked by the calculation.

These calculations of Chi-Square illustrate that experimental design can determine whether two sets of data from the same experiment and gathered by the same scientist can suggest that data from the better experimental protocol may fabricated. For example, the unfalsified raw Brady data recorded with less precision might appear to be falsified when compared to unfalsified data from the same experiment reported with more precision than is justified by the experimental design.

If we convert the falsified data in Table 2 to kilometers, CS(R1) = 7.2, CS(R2) = 9.2, and CS(ag) = 3.6. In this case, massaging the falsified "raw data" increases the probability that it is uniformly distributed in every digit and in their aggregate Chi-Square.

If the scale for the map we used had been 1,500,000 to 1, then converting the raw data in Table 1 to kilometers provides, CS(R1) = 21.2, CS(R2) = 10.4, and CS(ag) = 19.0. This illustrates dramatically how the choice of even a constant multiplier can cause the distribution of uniform raw data to change to non-uniform calculated data.

If we use the 1,500,000 to 1 scale for the falsified raw data in Table 2, CS(R1) = 8.4, CS(R2) = 10, and CS(ag) = 13.2. This illustrates dramatically how numbers calculated from falsified raw data can actually pass ORI's Chi-Square test while numbers calculated from authentic raw data by the same means and for the same experiment fail the same test.

Conclusions: Legal Defenses

ORI emphasizes the need to compare "unquestioned" data and "questioned" data from the same laboratory or individuals. Absent such comparison, it is impossible to tell whether a lack of uniformity of digits in experimental data is for some reason other than data falsification. Accordingly, the absence of a clear distinction between the Chi-Squares for unquestioned data and the allegedly falsified data is a legal defense to an analysis indicating that data is falsified.

Sometimes this distinction can be very subtle. It may even be impossible to detect without knowing the outcome of the experiment in question. Consider, for example, the number of sections in a grapefruit. Assume the number of sections in grapefruit range from 10 through 14. In one experiment 130 grapefruit from Texas were counted and the average number of sections was 12.0 plus or minus 1.0. In the other experiment the same number of grapefruit from California were counted and the average number of sections was 12.0 plus or minus 1.3. Should the investigator be charged with falsifying the Texas data on the basis that Chi-Square was 2.46 for the penultimate digits in the California grapefruit data and 196 for the Texas grapefruit? What could be more clear? Each sampling had 4 degrees of freedom, so ORI would conclude that there was a probability much less than .05 that the penultimate digits in the Texas data were uniform in their permitted range (0-4) and a probability much greater than .05 that these digits in the California data are uniform. However, Texas is an odd state, and so are our grapefruit. The California data were: 22 grapefruit with 10 or 14 sections, 30 with 11 or 13 sections, and 26 with 12 sections, but the Texas data were: no grapefruit with an even number of sections and 65 grapefruit with 11 or 13 sections. The bimodal distribution the experiment was designed to observe would result in the Chi-Square "evidence" of data falsification. Obviously, a legal defense to a Chi-Square analysis indicating that data were fabricated is that the data are not supposed to be uniform because, for example, nature favors odd numbers in some circumstances and even numbers in others.

Clearly, using "inconsequential" digits to test data may work when the raw data is recorded to more significant figures than the individual recording it can actually observe. It is equally clear, however, that observables in scientific experiments are not be random numbers. Therefore, properly recorded raw data of observable events may appear as falsified data when compared with data recorded with random insignificant figures. Therefore, a legal defense to a Chi-Square analysis indicating that data were fabricated is that the data are being compared with data containing more significant figures than are justified by the experiment.

We have shown also that the distribution of "insignificant" digits depends on experimental design in other ways. For example, a relative lack of uniformity may reflect a sampling protocol that uses a less random selection of measurements. This, in turn may be dictated by the experimental system itself, by constraints on experimental protocol, or by choice of the investigator. Thus, a legal defense to an analysis indicating that data was falsified is that the less significant digits analyzed are not randomly distributed due to the experimental design.

Another legal defense to a finding of falsified data based on inconsequential digits is that the reported data is the result of a calculation or machine interpretation of raw data. For example, using the ANSI/IEEE Standard 854-1987 for rounding (when the number to be rounded is exactly half way between successive digits, the figure is rounded off to the nearest even digit) causes even terminal digits in the rounded figure to exceed odd terminal digits by 20%. This effect can have a very large effect on the uniformity probabilities estimated from Chi-Square for digits in the rounded versus truncated data. In one example, we compared Chi-Square for the terminal digits of falsified data that had been manufactured by averaging real data (1) when the falsified data is rounded using the ANSI/IEEE standard, versus (2) when the mid-point numbers are truncated. In one example, Chi-Square for the truncated data was twice the value for the rounded data, and the probability that the data are uniformly distributed was increased about 300-fold.

We have also illustrated that the Chi-Square distribution test does not work when an experimental measurement is reported with the correct precision and the observable is a constant, or even when it varies over a modest range. Moreover, we showed that the test as applied by ORI does not work when the observable is varied 100%. Thus a legal defense to the claim data lacking uniformity is fabricated is that the range of variation in the experiment precludes application of the test to the experimental data.

In general, a legal defense to the claim that non-uniform "inconsequential" digits indicates that data has been falsified is that the statistical approach cannot indicate the true cause for the lack of uniformity. The cause could be a bias introduced by a calculation, a machine, or an innocent characteristic of the individual collecting the data. It could be due to the nature of the observable, whether it varies in nature and the accuracy with which it can be observed over that range. It could be due to rounding error or just adding one's favorite digits in places to the right of the significant ones. The possibilities are endless. Except in the very unlikely case where all other sources of non-uniformity can be excluded for the specific data in question, the statistical evidence is not very probative.

End of document

 

The Law, Science & Public Health Law Site
The Best on the WWW Since 1995!
Copyright as to non-public domain materials
See DR-KATE.COM for home hurricane and disaster preparation

See WWW.EPR-ART.COM for photography of southern Louisiana and Hurricane Katrina
Professor Edward P. Richards, III, JD, MPH - Webmaster