ideas should be in papers: deFuse: An Algorithm for Gene Fusion Discovery in Tumor RNA-Seq Data

RNA-Seq 분석 과정의 일환으로 fusion gene을 찾고자 선택된 프로그램 논문. 프로그램 다 돌려보고 나서 논문 읽어 볼려고 했는데 프로그램이 중간에 error가 나서 stop 한 관계로.. error 원인을 찾다가 run_adaboost.R의 input file에서 NA가 존재하는 것이 원인인 것을 알고 이 record를 지워도 되는가 판단하고자 급하게 논문을 읽어본다.

abstract를 보자면..
defuse는 unique read align만 쓰는게 아니라 모든 align을 쓰고 exon의 끝부분의 fusion만 보는것이 아니라 모든 부분에서의 fusion을 본다고 한다. 그래서 더욱 sensitive 하다고. specificity를 높이기 위해서 RT-PCR로 confirm 한 novel feature로 adaboost classifier를 train 한다는데 이건 영뭔소린지 본문을 봐야 알겠다. 뭐 그리그리 해서 ovarian cancer에서 gene fusion을 찾았단다.

gene fusion은 double stranded DNA breakage이 DNA repair error에 의한 것이라는데 이건 이 논문을 봐보자.

뭐 여튼.. defuse는 다른 프로그램과는 달리 ambiguously aligning read 도 쓰고, split read와 discordant read를 이용하는데 그 순서를 기존 프로그램과 달리 discordant read 분석을 먼저 하고 그 담에 dynamic programming-based split read analysis가 들어간단다. 이 순서가 좀더 sensitive 하다고.

The deFuse algorithm

일단 용어 정리부터 하자면

fragment : a size selected cDNA sequence during RNA-Seq library construction
read : fragment에서 sequencing 된 부분
insert sequence : fragment에서 paired-read를 제외한 sequencing 안된 부분
fusion boundary : nucleotide 단위의 genomic 위치로 gene fusion이 일어나는 양쪽의 break point, 그러니까 두 유전자가 gene fusion이 일어날때 합쳐지는 그 bp위치
spanning reads : paired-ends 사이의 insert sequence에 fusion boundary가 있는 reads
split read : read 안에 fusion boundary가 있는 read
discordant alignment : spanning reads와 split read의 alignment을 의미. spanning reads의 경우 paired-ends가 서로 다른 유전자에 align이 될거고, split read의 경우 한쪽 부분이 유전자의 끝부분에 align될거고 나머지 끝부분은 align이 안됨

defuse의 개요가 아래 그림과 같다. 총 4단계로 구성되어 있는데..

read를 reference에 align 한다. 이 때 reference로는 spliced & unspliced gene을 모두 사용. 어떤경우에는 unspliced region이 intron인 gene fusion에 의해 발현되는 경우도 있기에. 이때 두가지 기준을 정해서 동일한 fusion event를 나타내는 discordant alignment를 clustering 한다.
가장 그럴듯한 fusion event를 선택한다. 아래 그림 a
각 fusion event의 fusion boundary를 찾기 위한 dynamic programming based solution에 이용될 split read를 찾는다. 아래 그림 b
spanning & split reads 증거들을 증명을 위한 test. split read로 찾은 fusion boundary를 바탕으로 paired-end의 fragment의 putative length를 계산하여 fragment length distribution을 기준으로 차이가 있는지를 확인한다. 그리고 나서 quantitative feature를 구한뒤 adaboost classifier로 진짜와 가짜를 구분한다. 그림 c

Conditions for considering discordant alignments to have originated from reads spanning the same fusion boundary.
defuse 알고리즘의 가장 첫번째는 fusion event가 일어나는 region을 걸치고 있는 spanning read를 찾는것. 그렇기 때문에 spanning read 의 선택 조건을 정하는 단계이다.
concordant paired-end로 fragment length distribution인 P(L)을 알 수 있는데, 여기서는 [lmin,lmax] 길이 안의 fragment만 고려한다.

lmin : a/2 percentile of P(L)
lmax : (1-a/2) percentile of P(L)
a : proportion of paired end reads that are not guaranteed by the algorithm to be assigned to the correct fusion event

서로 다른 discordant alignment가 동일한 fusion boundary를 포함하는 spanning read인지를 판단하는 2가지 조건.

c1 : overlapping boundary region condition : 동일한 gene fusion event에서 나온 두 paired-end의 fusion boundary region은 반드시 overlap 되어야 한다. 아래 그림 c
c2 : similar fragment length condition : 두 fragment의 길이차이(=dx+dy)가 lmax-lmin보다 작아야 한다.

아래 그림을 좀더 설명하자만 a와 d는 fusion boundary를 알고 있을때의 그림이고 b,c,e는 정확한 fusion boundary를 모를때의 위 조건에 대한 설명이다. 그러니까 실질적으로 a와 d와 같은 model에서 paired-end가 나왔을텐데 아직까지 boundary를 모르니까 b,c,e와 같은 그림이 그려질거고 거기서 부터 조건 c1과 c2가 나왔다고 생각면 된다.

Assigning a unique discordant alignment to each spanning read.
spanning read에 의해 가능한 모든 discordant alignment 중(=valid cluster)에서 split read를 찾기 위한 후보가 되는 valid cluster를 찾는 단계이다.
ambiguous alignment : spliced & unspliced gene sequence를 reference로 mapping 할때 gene들의 homology에 의해서 ambiguous alignment가 나타날 수도 있고 똑같은 유전자의 alternative splicing에 의한 동일한 exon들이 multiple splice variant(=isoform)에 나타남에 의할 수도 있다.
여튼 이러한 ambiguous alignment들에서 부터 제대로 된 alignment를 찾아야 한다.

valid cluster : discordant alignments set으로 이 set 안의 모든 두 paired-ends 는 조건 c1과 c2를 만족한다. discordant 하게 alignment된 paired-end 의 집합인데 이 안의 모든 원소들은 서로 overlap(조건 c1)되고 fragment 길이 차이가 최소한보다는 작다(조건 c2).

ambiguous alignment에 의해 동일한 paired-end가 여러개의 valid cluster 속할 수 있다. 이렇기 때문에 하나의 ambiguous aligned paired-ends를 하나의 valid cluster에 할당함으로 해서 fusion event를 최소화 한다(maximum parsimony solution 이라고 하는데 supplementary를 참조할 필요가 있다). 모든 read와 모든 valid cluster를 unselected라고 해놓고 알고리즘의 각 step에서 read를 가장 많이 갖은 valid cluster를 selected 라고 해놓고 그 안의 read들도 assigned라고 해놓는다. 이 같은 step을 반복해서 valid cluster를 뽑는데 이를 maximal valid cluster라 한다.

Split read boundary sequence prediction.
찾아진 valid cluster 가 영유하는 영역(=approximate fusion boundary)에 align이 될거라 예상되는 split read를 찾고 그 split read와 approximate fusion boundary를 dynamic programming으로 align 해서 fusion boundary를 찾는 단계이다.
위 단계까지 해서 fusion event가 일어났을 것이라 예상되는 region을 찾으면 bp 단위의 fusion boundary를 찾기 위해 targeted split read analysis를 한다.

approximate fusion boundary : 같은 valid cluster 안에 속하는 discordant alignment들의 fusion boundary region 의 intersection region. 아래 그림 a
candidate split read : 한쪽 부분이 approximate fusion boundary 에 anchored된 paired-end read
mate alignment region : candidate split read는 paired-end 중 한쪽만 anchored 된 read의 반대쪽 discordant read 가 고려 대상인데 한쪽이 align되었을때 align이 안된 read가 align 될거라고 예상되는 영역, 아래 그림 b

candidate split reads는 mate alignment region과 approximate fusion boundary가 겹치는 곳에 위치하게 되는 read들이 된다.
the split read analysis는 candidate split read를 approximate fusion boundary에 align 함으로써 진행된다. candidate split read의 fusion boundary에 의해 split 되었을 꺼라 예상되는 read(그러니까 anchored read 말고 반대쪽 read)를 transcript X의 approximate fusion boundary(=Sx)와 align 하고 transcript Y의 approximate fusion boundary(=Sy)와는 reverse로 align 한다(transcript X, Y는 gene fusion이 일어났을 거라 예상되는 trascript). 이때 align을 dynamic programming을 이용하기 때문에 X,Y 각각에 대해 matrix가 생성될 것인데 이를 Dx,Dy 라 한다. 이 allign에서 split을 (ix,iy,j)로 표현 할수 있는데 ix와 iy는 Sx와Sy에서 fusion boundary의 bp position을 뜻하고 j는 read 상에서의 bp 위치를 뜻한다. 이 (ix,iy,j)는 아래 공식으로 찾는다.

이때 Dx와 Dy는 threshold m_anchor를 넘어야 하는데 m_anchor= m*n_anchor 이다(m은 match score, n_anchor은 Sx 혹은 Sy에 align되어야 하는 최소한의 bp 갯수, 곧 Sx나 Sy 한쪽에만 너무 align 되는 것을 방지하기 위함이다).
여기서 나타날 수 있는 문제점이 여러(ix,iy,j)가 같은 최대값을 갖는 경우이다. 이는 여러 split read들의 fusion boundary (ix,iy)를 clustering 해서(=동일한 (ix,iy)를 갖는 split read를 clustering) 이 cluster 중 read의 anchoring score의 합의 최대인 것을 뽑는것으로 해결한다.

Corroborating spanning read and split read evidence.
최종적으로 찾아진 fusion boundary를 confirm 하는 단계로 split read의 align으로 찾아진 fusion boundary를 바탕으로 spanning read의 fragment length distribution과 concordant read의 fragment length distribution과의 비교를 통해 유의한지 판단한다.
spanning read evidenced와 split read evidence 사이의 일치성을 확인한다. 일단 split read에 의해 예측된 fusion boundary를 이용하여 spanning read의 fragment length를 추론한다. 이렇게 나온 spanning read의 fragment length의 분포(={li})와 concordant paired-end에서 나온 fragment length의 길이 분포(P(L))를 z-test 한다({li}가 P(L)에서 나왔다는 가정하에).

----------------------------reference--------------------------------
solid tumor : cysts와 liquid에 상관없는 비정상적인 조직덩어리. 정확히는 모르겠지만 말그대로 액체에 의해 커진게 아니라 조직 자체가 비정상적으로 커져버린 덩어리를 뜻하는것 같다. 악성일수도 있고 양성일 수도 있단다. 3가지 type(sarcoma, carcinoma, lymphomas)의 solid tumor가 있다.
sarcomas는 뼈와 근육같은 connective tissue에서부터 생성된 tumor type.
carcinomas는 glandular and epithelial cell에서부터 형성된것(샘선이나 상피세포). 이런 cell들은 공기의 유출입이나 위장과 관련된 cell. lymphomas는 림프절에 관련된 tumor 일것이고.

ideas should be in papers

Friday, August 5, 2011

deFuse: An Algorithm for Gene Fusion Discovery in Tumor RNA-Seq Data

No comments:

Post a Comment