Acessibilidade / Reportar erro

Detection of determinant genes and diagnostic via Item Response Theory

Abstract

This work presents a method to analyze characteristics of a set of genes that can have an influence in a certain anomaly, such as a particular type of cancer. A measure is proposed with the objective of diagnosing individuals regarding the anomaly under study and some characteristics of the genes are analyzed. Maximum likelihood equations for general and particular cases are presented.

Item Response Theory (IRT); predisposition; propensity; logistic model


BIOINFORMATICS

RESEARCH ARTICLE

Detection of determinant genes and diagnostic via Item Response Theory

Héliton Ribeiro TavaresI; Dalton Francisco de AndradeII; Carlos Alberto de Bragança PereiraIII

IUniversidade Federal do Pará, Departamento de Estatística, Belém, PA, Brazil

IIUniversidade Federal de Santa Catarina, Departamento de Informática e Estatística, Florianópolis, SC, Brazil

IIIUniversidade de São Paulo, Departamento de Estatística, São Paulo, SP, Brazil

Correspondence Correspondence to Heliton R. Tavares Universidade Federal do Pará, Departamento de Estatística 66075-900 Belém, Pará, Brazil E-mail: heliton@ufpa.br

ABSTRACT

This work presents a method to analyze characteristics of a set of genes that can have an influence in a certain anomaly, such as a particular type of cancer. A measure is proposed with the objective of diagnosing individuals regarding the anomaly under study and some characteristics of the genes are analyzed. Maximum likelihood equations for general and particular cases are presented.

Key words: Item Response Theory (IRT);predisposition; propensity; logistic model.

Introduction

In many practical situations, decisions have to be taken based upon individual quantities that cannot be observed directly. These quantities are referred to by latent variables that are given different names according to the areas in which they are applied: ability or proficiency in educational and psychological areas; purchasing power in marketing; life quality or predisposition to a certain disease in the biological and medical areas (see Andrade et. al., (2000), Paas (1998), for example). These types of analysis are, in general, based upon the responses of a set of variables often referred to as items that comprise the measuring tool. In educational evaluations, for example, items are represented by questions in a test that might have their answers categorized like as right/wrong, A/B/C/D/E with only one correct alternative or when only one alternative is correct, in a fashion way where the firstA is the least correct, and E is the most correct alternative. Other extensions are available, such as for each item a weight can be attached like 1 (right) and or 0 (wrong) is attached. These types of study were, for sometime, based upon scores for each individual, that is, in upon the number of items with weight one. However, this type of approach has many drawbacks mainly because it does not likely make a differences among the items and thatwhich lead to the development of a theory based upon the items themselves and not upon the overall results, called thenamed Item Response Theory. In such a a theory each item has a set of well defined characteristics that are estimated. The estimation procedure of the latent variable of an individual takes into account each one of the items of the test and reveals, for example, the level of knowledge of that individual in a certain area or his purchasing power as related to a certain product.

Some times there is more than one population in being studystudied. For instance, in the educational area the interest can stay be in the estimation of the average proficiencies regarding sex or geographical location.

In a similar situation, a set of genes are is studied trying in order to appraise the predisposition of an individual related to a certain illness. A set of items (genes) are taken into account and their answers can be activated or deactivated or in the categorized form for as A/B/C/D/E representing different levels of activity of the genes. Genes have peculiar characteristics that need to be incorporated into a model so that they can be evaluated. Suggestions have been advanced on the way to pinpoint genetic influences (Vanyukov and Tarder, 2000), but with some shortcomings. For example, reachedthe conclusions reached depend upon the sample chosen.

Models for Response Functions

The Item Response Theory is based upon models that represent the probability for an individual to adjust to eachof response category ofto an item as function of the parameters of the item and of the individual predisposition. These functions are treated as Item Response Functions (IRF) or Item Charasteristic Curve (ICC). The different models proposed in the literature depend basically upon the type of item.

For explanatory reasons we will consider that there are K populations in study and each of them has the same n genes being analyzed. The sample related to the population k is composed by Nk individuals, k=1,...,K. Following, the model used in this paper is the unidimensional logistic model of 4 parameters for each item of two categories (of the type activated/deactivated). Its expression is given by

with and k = 1, 2,..., K, where

Uijk is a dichotomous variable that takes on the values 1,when the individual j of the population k has gene i activated, or 0 when the gene is deactivated.

qjk represents a predisposition of the jth individual of the population k.

bi is the inactivity (or position) parameter of the gene i, measured at the same scale of the predisposition

ai is the discrimination (or of inclination) parameter of the gene i

ci is the probability of gene i being active for individuals with low predisposition,

gi is the probability of gene i being deactivated for individuals with high predisposition,

D is a scale factor, constant and equal to 1. The 1.7 value is used when it is desiredwished that the logistic function yield results similar to that of the normal function.

N is the number of individuals involved in the study.

Defining the Parameters of the Genes

In a general way, the proposed model is based upon the fact that predisposed individuals are more likely to have the gene i activated, and that this relation is not linear. As a matter of fact, it can be perceived from Figure 1 that the IRF has the form of "S" with inclination and displacement defined by the gene parameters. However, only a subset of genes have has to satisfy this situation that occurs only when ai > 0. Chances are that some genes are deactivated in high propensity individuals, and therefore the IRF curve should have an inverted form, expressing that individuals with high propensity are less likely to get the gene activated, and this is expressed by ai < 0. When ai = 0, we have that constant for all q, indicating that the gene i does not interfere in the occurrence of the anomaly.


Parameter bi is, perhaps, the most important of the four. The greater this parameter is, the less likely it is is tothat a given individual have has the gene i activated. This is a valid conclusion only for ai > 0, and the opposite is true for ai < 0.

It is safe to say that individuals with low predisposition are prone to have the gene i active, and this information is conveyed by the parameter ci. On the other hand, high predisposition individuals can also have the gene i inactive, and this information is conveyed by 1 - gi. These conclusions are valid only for ai > 0 , and the opposite is valid for ai < 0.

Scale of Measurement/Indetermination

Predisposition can theoretically take any real value between -¥ e +¥. Thus, it is necessary to establish an origin and an unit of measurement for defining the scale. When only one population is under study the scale of measurement can be defined in such a way as to represent the mean value and the standard deviation of the individuals predispositions of the population under study. For the graphs shown earlier the scale used had a mean of 0 and a standard deviation of 1, that will be referred from now on as scale (0,1). In practice, it does not make the leastany difference to set these or any other values. What is paramount are the existent order relations between scale points. For example, in the scale used above an individual with a predisposition of 1.2 in fact is 1.2 standard deviations above the predisposition mean. This same individual would have a predisposition of 92, and therefore would also be 1,2 standard deviations above the predisposition mean, if the scale used for this population would have been the scale (80,10).

When various populations are present, one of them can be adopted as a Reference Population, and only the scale for this population will have to be refereed. The obtained predisposition values for other populations will have to be directly compared with those of the Reference Population. One such example consists of taking healthy individuals in the Reference Population and the population with a certain anomaly as the other. Other populations can be taken into account.

Local Independence

An often used hypothesis in IRT is the local independence (or conditional independence). It states that the probability that a certain gene beis active depends only on its predisposition,; that is, it offers all the necessary information to determine an activation/deactivation of the gene. In this fashion it does not mean that the quantities Ukji e Ukjl, i ¹ l, are independent, but given the individual predisposition qjk they will be considered conditionally independent. However, there are models for the case when conditional independence is not met, but we have to model this possible dependence.

Parameter Estimation of the Genes and Predispositions

One of the most important stages of the IRT is the parameter estimation of the genes and/or of the individual predispositions. In some cases we can consider that the parameters of the genes are already known and what is wanted is to estimate the predispositions; in other, less common, predispositions of the individuals are known and what is wanted is the estimation of the parameter of the genes. However, the most common cases are those in which not only the parameters of the genes are to be estimated but also the individual predisposition simultaneously. In all these cases, the proposed model is assumed as true, and from the set of responses obtained for a certain number of individuals from one or more populations, parameters and/or predispositions are estimated using either likelihood or Bayesian methods. Both methods require iterative procedures involving very complex calculations and, therefore, specific computer codes. It is important to point out that, in any of these cases, the predisposition values and those of the gene parameters will all be in the same scale of measurement and therefore they can be compared.

Before outlining some points about the estimation process, some arrangements are in order. The set of genes involved in the analysis will be ordered in a fashion such that they will be represented by . Let Ukj. = (Ukj1, Ukj2,...,Ukjn) be a random vector of answers from individual j from group k; Uk.. = (Uk1, Uk2,...,Ukn) the random vector of answers from group k and U... = (U1,.. U2,...,Uk..) the whole vector of answers. In a similar fashion, observed answers will be represented by ukji, ukj, uk and u... . This notation and local independence allow us to write the probability associated with the vector of answers Ukj as

Generally, it is considered that the predispositions of the individuals of population k, qjk, j = 1,..., Nk, are accomplishments of a random variable qk, with continuous distribution and probability density function g(q|hk), twice differentiable, with the components of hk finite. In the case where qk has a Normal distribution, we have hk = where mkis the mean and the variance of the predispositions of the individuals of the population k, k=1,...,K. This hypothesis carries a great advantage: only the parameters of the genes have to be estimated, as the likelihood will not depend on the individuals' predispositions. Therefore, the estimation is a two-stage process, where in the first only the parameters of the genes are estimated, after which these parameters are considered as known for the estimation of the predispositions.

Estimation of Gene Parameters

With the above defined notations we have determined that the marginal probability of Ukj is given by

where in the last inequality we use that the distribution of Ukj is not a function of parameters hk.

Utilizing the independence between answers of different individuals, we can see that the associated probabilities to the vector of answers U ... as

Even though the likelihood can be written as (2), the approach has often been used of Response Patterns. As we have n genes, with two possible answers for each item (0 or 1), there are S = 2n possible response vectors (response patterns). Let rkj be the number of distinct occurrences of the answer pattern j in group k, and yet Sk< min (Nk, S) the number of response patterns with rkj > 0 . It follows that

By the independence between the answers of different individuals, we have that the data follows a Product-Multinomial distribution, that is,

And, therefore, the log-likelihood is

The estimation equations for the item parameters are given by

with

where

and Pi represents the IRF adopted. The specific equations for each parameter of the vector can thus be obtained from above.

Application to the 4-parameter Logistic Model

For convenience, let

where

In sum, the estimation equations for ai, bi, ciand gi, are respectively,

which do not have explicit solutions. Therefore, such estimations are arrived at by iterative processes, such as Newton-Raphson, BFGS, Fisher's Scoring or EM algorithm.

Estimation of the Population Parameters

Considering the log-likelihood obtained in (3), the estimation equations for the mean predispositions and population variances are obtained by

and

k = 1,..., k. However,

If we use the distribution N for qk, we have

and

Thus, the final forms of the estimation equations for are µk and are, respectively,

Estimation of the Predispositions

Once the parameters of the genes are set, individual predispositions can be estimated. In addition, such predispositions can also be estimated for individuals for whichwhose data were not used considered in estimating the item parameters estimation. The usual methods for estimating the predispositions are the maximum likelihood (ML) as well as Bayesian methods such as maximum a posteriori (MAP) and the expected a posteriori (EAP).

Estimation by ML

In this case, the estimation of the predispositions is done iteratively by the Newton-Raphson algorithm maximizing the likelihood in (2), or of the equivalent form, the function

The Maximum Likelihood Estimator (MLE) of qkj is that which maximizes the likelihood, or equivalently, is the solution of the equation

Note, from (5), that

where the last equality follows by (4), and when plugged in the respective quantities. As

it is obtained

It follows then that the estimation Eq. (5) for qkj, j = 1,..., Nk, is

Again, this equation does not have an explicit solution for qkj and, for this reason it is necessary for some iterative method in order to obtain the desired estimation. Following, the necessary expressions are obtained for applications of the Newton-Raphson iterative processes.

Considering an estimation of qkj in iteration t, then, in iteration t+1 we have

where

with

and

Estimation by MAP

Such as in the marginal likelihood estimation, the Bayesian estimation of the predispositions is done on the second stage, considering the fixed parameters of the genes. Through the hypothesis of independence between the predispositions of different individuals, estimations can be done in separately for each individual.

Let us assume that the prior distribution for qkj, j = 1,..., Nk, is Normal with known vector hk = of parameters. The posterior distribution for the ability of the individual j of the population k can be written as

Some characteristic of g*kj(qkj) can be adopted as estimator of qkj, where the most frequently adopted are the mean and the mode. Following, we deal with how to obtain each of these characteristics.

Estimation of the mode of the posterior distribution - MAP

The Bayesian modal estimation consists in obtaining the maximum of (9). For easiness, we work with the logarithm of the a posteriori

where C is a constant. It follows that the estimation equation for qkj is

By local independence, we have that

Therefore,

Keeping in mind that and using the development under (5), we have that

As we have adopted the prior distribution Normal for qkj, the second portion of (10) is

By (11) and (12), we have that the estimation equation for qkj is

As this equation does not have an explicit solution, some iterative method can be used to solve it. For thisTo do that it is necessary the second derivative of log g*kj (qkj) with relation to qkj, whose expression is

where hkji and Hkji are given by (7) and (8), respectively.

Estimation by EAP

The Bayes expected a posteriori (EAP) consists in obtaining the expectation of the posterior distribution, that can be written as

It follows that the estimator is given by

This form of estimation has the advantage of being calculated directly, not being necessary the application of iterative methods.

Simulation Results

In this section we present one application of the proposed methodology in simulated data. The data were generated based on N = 5000 individuals and to n=5 genes. The total simulation consisted of 1000 replications. The known gene parameters are presented below. All the calculations were done via a computer program developed by the authors using the computer language Ox (see Doornik, 20001998) using the BFGS routine for maximization.

The Genes Parameters

In order to generate the data it was assumed that the genes parameters are those presented in Table 1. It was adopted the 4 parameter logistic model with D = 1.7. The values for parameter a (discrimination) varied from 0.8 (low discrimination) to 1.2 (high discrimination) and the values for parameter b (predisposition) varied from -0.5 to 3.0. For the paramenters c it was considered only one value (0.20) and for the g, only 0.9. It was adopted the 4 parameter logistic model with D = 1.7.

From Table 2 we see that the average estimates obtained from 1000 replicates are very accurate for all genes. We see that the estimations procedure works very well, still when we have a relatively small number of genes. Results were obtained with a larger number of genes and the results continued were still very good. However, we hope that estimation problems just appear when the number of genes is too small.

The Table 3 presents the standard deviations obtained from 1000 estimates. The largest values are associated towith the parameters a and b. With exception of the gene 2, the values associated towith the parameter a parameter are larger than those associates associated to b.

Concluding remarks

We have introduced a new proposal for genes and person diagnostic via Item Response Models. From a simulation study, it was shown that the models provide good estimates for several genes configurations. However, other studies and models should be proposed to allow, for example, different levels of activities of the genes. Longitudinal models, following the lines of Tavares and Andrade (2004) and Andrade and Tavares (2004), should also be considered.

Acknowledgments

This work was partially supported by grants from Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Coordenadoria para o Aperfeiçoamento de Pessoal de Nível Superior (CAPES), National Program of Excellency (PRONEX) contract No 76.97.1081.00, Thematic Project of Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) contract No 96/01741-7, Integrated Program for Teaching, Research and Extension (PROINT) of the UFPA, and PARD. It was partially developed when the first author was a doctoral student at the Department of Statistics of the University of São Paulo.

Received: September 22, 2003; Accepted: May 12, 2004

  • Andrade DF and Tavares HR (2004) Item response theory for longitudinal data: Population parameter estimation. To appear in Journal of Multivariate Analysis.
  • Andrade DF, Tavares HR and Valle RC (2000) Item Response Theory: Concepts and applications. Associação Brasileira de Estatística, São Paulo (in Portuguese).
  • Bock RD and Zimowski MF (1997) Multiple Group IRT. In: van der Linder WJ and Hambleton RK (eds) Handbook of Modern Item Response Theory. Spring-Verlag, New York.
  • Chow YS and Teicher H (1978) Probability Theory: Independence, Interchangeability, Martingales. Springer-Verlag, New York.
  • Doornik JA (1998) Object-Oriented Matrix Programming using Ox 2.0. Timberlake Consultants Ltd and Oxford, London. www.nuff.ox.ac.uk/Users/Doornik
  • Hambleton RK, Swaminathan H and Rogers HJ (1991) Fundamentals of Item Response Theory. Sage Publications, Newburg Park.
  • Lord FM (1980) Applications of Item Response Theory to Practical Testing Problems. Lawrence Erlbaum Associates, Inc., Hillsdale.
  • Paas LJ (1998) Mokken scaling characteristic sets and acquisition patterns of durable and financial products. Journal of Economic Psychology 19:353-376.
  • Tavares HR and Andrade DF (2004) Item response theory for longitudinal data: Item and population ability parameters estimation (to appear in Test).
  • Sanathanan L and Blumenthal N (1978) The logistic model and estimation of latent structure. Journal of the American Statistical Association 73:794-798.
  • Correspondence to
    Heliton R. Tavares
    Universidade Federal do Pará, Departamento de Estatística
    66075-900 Belém, Pará, Brazil
    E-mail:
  • Publication Dates

    • Publication in this collection
      14 Jan 2005
    • Date of issue
      2004

    History

    • Received
      22 Sept 2003
    • Accepted
      12 May 2004
    Sociedade Brasileira de Genética Rua Cap. Adelmio Norberto da Silva, 736, 14025-670 Ribeirão Preto SP Brazil, Tel.: (55 16) 3911-4130 / Fax.: (55 16) 3621-3552 - Ribeirão Preto - SP - Brazil
    E-mail: editor@gmb.org.br