1 00:00:00,539 --> 00:00:03,539 foreign 2 00:00:08,760 --> 00:00:14,580 like to welcome you to the third session 3 00:00:11,540 --> 00:00:16,400 in this uh this block of talk so Nick 4 00:00:14,580 --> 00:00:18,840 Moore will be talking about 5 00:00:16,400 --> 00:00:22,020 bioinformatics and his counters project 6 00:00:18,840 --> 00:00:24,539 and Nick Moore has been a contributor to 7 00:00:22,020 --> 00:00:27,000 micro Python and has spoken at several 8 00:00:24,539 --> 00:00:28,920 conferences before including a pycon and 9 00:00:27,000 --> 00:00:32,840 then it's called then it's conf Au so 10 00:00:28,920 --> 00:00:32,840 please uh give a warm welcome to Nick 11 00:00:33,540 --> 00:00:36,140 hi everyone 12 00:00:36,600 --> 00:00:40,200 um yeah good day my name's Nick Moore 13 00:00:38,160 --> 00:00:42,600 I've previously talked at conferences 14 00:00:40,200 --> 00:00:46,079 like this about things like micropython 15 00:00:42,600 --> 00:00:47,700 um and about Django and things like that 16 00:00:46,079 --> 00:00:49,020 and now I'm currently working in 17 00:00:47,700 --> 00:00:50,940 bioinformatics 18 00:00:49,020 --> 00:00:53,219 um if you're considering changing around 19 00:00:50,940 --> 00:00:54,840 fields in this way I recommend it's a 20 00:00:53,219 --> 00:00:57,120 great way to constantly feel imposter 21 00:00:54,840 --> 00:00:58,980 syndrome for your entire life it's great 22 00:00:57,120 --> 00:01:02,399 but it's a really interesting project 23 00:00:58,980 --> 00:01:04,619 and I hope you'll find a lot of fun like 24 00:01:02,399 --> 00:01:05,339 I'm finding it a lot of fun 25 00:01:04,619 --> 00:01:06,540 um 26 00:01:05,339 --> 00:01:07,680 I'm going to start the talk by 27 00:01:06,540 --> 00:01:09,240 introducing some Concepts about 28 00:01:07,680 --> 00:01:12,060 bioinformatics just so you have some 29 00:01:09,240 --> 00:01:15,060 idea of what the hell I'm talking about 30 00:01:12,060 --> 00:01:17,040 um this work is funded by Walter and 31 00:01:15,060 --> 00:01:17,760 Eliza Hall Institute 32 00:01:17,040 --> 00:01:20,400 um 33 00:01:17,760 --> 00:01:22,920 and also I'd like to thank uh some of my 34 00:01:20,400 --> 00:01:24,479 collaborators from Brooklyn Betty 35 00:01:22,920 --> 00:01:27,180 um Institute and the University of 36 00:01:24,479 --> 00:01:29,700 Washington genome Sciences uh who are 37 00:01:27,180 --> 00:01:32,340 helping out because I'm a software 38 00:01:29,700 --> 00:01:35,340 developer not a scientist certainly not 39 00:01:32,340 --> 00:01:36,960 a bioinformatician so the only way to 40 00:01:35,340 --> 00:01:38,520 make a project like this work is to be 41 00:01:36,960 --> 00:01:41,159 constantly working hand in hand with 42 00:01:38,520 --> 00:01:42,960 people who actually do the science 43 00:01:41,159 --> 00:01:45,180 um content warnings on this presentation 44 00:01:42,960 --> 00:01:47,759 we briefly mentioned cancer we briefly 45 00:01:45,180 --> 00:01:49,560 mentioned covid-19 we do not go into any 46 00:01:47,759 --> 00:01:51,479 kind of gross medical details this is 47 00:01:49,560 --> 00:01:53,460 about numbers 48 00:01:51,479 --> 00:01:54,960 um not about the actual diseases 49 00:01:53,460 --> 00:01:56,280 themselves 50 00:01:54,960 --> 00:01:58,439 um 51 00:01:56,280 --> 00:02:01,560 so bioinformatics in nine and a half 52 00:01:58,439 --> 00:02:03,180 minutes this will be cool 53 00:02:01,560 --> 00:02:04,619 um we'll move through it fairly quickly 54 00:02:03,180 --> 00:02:06,180 uh 55 00:02:04,619 --> 00:02:08,399 there'll be a chance to chat about this 56 00:02:06,180 --> 00:02:10,140 stuff over lunch or whatever I've got 57 00:02:08,399 --> 00:02:11,400 the green say hi sticker and I really 58 00:02:10,140 --> 00:02:13,020 mean it if you want to talk about this 59 00:02:11,400 --> 00:02:14,520 stuff all lunchtime we can talk about 60 00:02:13,020 --> 00:02:16,620 this stuff also my colleague Ellen's 61 00:02:14,520 --> 00:02:19,739 sitting right there so you can bug him 62 00:02:16,620 --> 00:02:21,959 about bioinformatics things too 63 00:02:19,739 --> 00:02:23,400 so approximately nine and a half minutes 64 00:02:21,959 --> 00:02:25,800 or we're going to go through fairly 65 00:02:23,400 --> 00:02:27,060 approximately about bioinformatics 66 00:02:25,800 --> 00:02:29,580 starting with what the hell is 67 00:02:27,060 --> 00:02:32,099 bioinformatics it's the analysis of 68 00:02:29,580 --> 00:02:34,080 biological data so we get lots and lots 69 00:02:32,099 --> 00:02:36,120 of data from biological processes how do 70 00:02:34,080 --> 00:02:38,420 we analyze them arguably it got started 71 00:02:36,120 --> 00:02:40,379 in about 1952 72 00:02:38,420 --> 00:02:41,760 when we first realized you could 73 00:02:40,379 --> 00:02:43,500 sequence proteins you could break 74 00:02:41,760 --> 00:02:45,800 proteins down to their building blocks 75 00:02:43,500 --> 00:02:47,540 and sequence them and so Albert Sanger 76 00:02:45,800 --> 00:02:49,860 sequenced 77 00:02:47,540 --> 00:02:51,959 insulin so we actually properly 78 00:02:49,860 --> 00:02:55,080 understood how insulin worked in the 79 00:02:51,959 --> 00:02:57,060 body etc etc and Alan Turing of all 80 00:02:55,080 --> 00:02:58,739 people decided to take some time off for 81 00:02:57,060 --> 00:03:01,940 inventing Computing to also invent 82 00:02:58,739 --> 00:03:05,040 bioinformatics just in his spare time 83 00:03:01,940 --> 00:03:07,319 in a paper that that kind of derived the 84 00:03:05,040 --> 00:03:09,720 chemical basis of morphogenesis 85 00:03:07,319 --> 00:03:12,060 before anyone you have any of the actual 86 00:03:09,720 --> 00:03:13,319 Parts worked he'd already written the 87 00:03:12,060 --> 00:03:15,239 paper that turned out to be largely 88 00:03:13,319 --> 00:03:16,739 correct once we found out that we could 89 00:03:15,239 --> 00:03:18,659 read the parts it's a pretty amazing 90 00:03:16,739 --> 00:03:20,159 work it really got Computing once 91 00:03:18,659 --> 00:03:22,260 computers got invented to do a lot of 92 00:03:20,159 --> 00:03:24,920 the hard work for us Margaret day Huff 93 00:03:22,260 --> 00:03:27,840 was a very early Pioneer of this stuff 94 00:03:24,920 --> 00:03:30,180 writing software in Fortran to actually 95 00:03:27,840 --> 00:03:32,760 align proteins or subsequences and work 96 00:03:30,180 --> 00:03:34,200 out how they would all fit together and 97 00:03:32,760 --> 00:03:36,599 really interestingly this protein 98 00:03:34,200 --> 00:03:40,860 information resource thing is a very 99 00:03:36,599 --> 00:03:42,060 early free online database 1984 and 100 00:03:40,860 --> 00:03:44,640 people are already going hey information 101 00:03:42,060 --> 00:03:46,260 wants to be free especially if it's a 102 00:03:44,640 --> 00:03:47,940 health information it's really useful 103 00:03:46,260 --> 00:03:49,680 information for researchers I think 104 00:03:47,940 --> 00:03:52,019 that's an amazing thing perhaps 105 00:03:49,680 --> 00:03:54,659 unsurprisingly given how close they are 106 00:03:52,019 --> 00:03:56,459 in our time of science there's a lot of 107 00:03:54,659 --> 00:03:57,659 parallels between biological sciences 108 00:03:56,459 --> 00:04:00,060 and computer science and I'll talk a lot 109 00:03:57,659 --> 00:04:02,280 about that here because a lot of your 110 00:04:00,060 --> 00:04:04,500 software people 111 00:04:02,280 --> 00:04:06,060 cell biology 112 00:04:04,500 --> 00:04:07,920 there's a lot we still don't know about 113 00:04:06,060 --> 00:04:09,120 cell biology even though we're made of 114 00:04:07,920 --> 00:04:10,680 cells 115 00:04:09,120 --> 00:04:12,480 um that's mostly because it happens 116 00:04:10,680 --> 00:04:15,120 inside the cells and what happens inside 117 00:04:12,480 --> 00:04:17,160 the cells stays inside the cells they 118 00:04:15,120 --> 00:04:18,840 Once you pull them apart to find out how 119 00:04:17,160 --> 00:04:20,459 they work they're generally dead and not 120 00:04:18,840 --> 00:04:22,620 so interesting anymore 121 00:04:20,459 --> 00:04:25,320 um but experiments let us make theories 122 00:04:22,620 --> 00:04:27,660 about how they must work internally so 123 00:04:25,320 --> 00:04:29,220 you can change the inputs you can see 124 00:04:27,660 --> 00:04:31,259 what the cell does 125 00:04:29,220 --> 00:04:33,240 generally dies but you know you can 126 00:04:31,259 --> 00:04:35,580 measure that and make some conclusions 127 00:04:33,240 --> 00:04:37,860 about how the cell must be working it's 128 00:04:35,580 --> 00:04:40,199 a lot like black boss box testing of 129 00:04:37,860 --> 00:04:43,259 software where you don't need to look 130 00:04:40,199 --> 00:04:44,880 inside the box you can still test how 131 00:04:43,259 --> 00:04:48,060 it's behaving just by changing the 132 00:04:44,880 --> 00:04:49,979 inputs and monitoring the outputs 133 00:04:48,060 --> 00:04:52,199 the human genome which I'm sure is a 134 00:04:49,979 --> 00:04:53,759 term you've heard it's the program that 135 00:04:52,199 --> 00:04:55,380 all of your cells run all of them run 136 00:04:53,759 --> 00:04:57,360 the same program it's a bit like having 137 00:04:55,380 --> 00:05:00,180 lots of things running the same 138 00:04:57,360 --> 00:05:02,520 container image but different parts of 139 00:05:00,180 --> 00:05:04,259 the code get run depending on what that 140 00:05:02,520 --> 00:05:06,660 particular cell is doing 141 00:05:04,259 --> 00:05:08,699 there's 23 chromosomes in there give or 142 00:05:06,660 --> 00:05:11,720 take each of them is a very long 143 00:05:08,699 --> 00:05:15,120 molecule like a long long long long tape 144 00:05:11,720 --> 00:05:18,180 of 100 million or so building blocks 145 00:05:15,120 --> 00:05:20,280 called nucleotides sat there's four of 146 00:05:18,180 --> 00:05:21,360 those so they make up an enormous long 147 00:05:20,280 --> 00:05:24,360 tape 148 00:05:21,360 --> 00:05:26,039 eerily eerily like a turing machine when 149 00:05:24,360 --> 00:05:27,900 it comes down to it 150 00:05:26,039 --> 00:05:29,820 um all of you so I said that already all 151 00:05:27,900 --> 00:05:31,500 of your cells have the same genome how 152 00:05:29,820 --> 00:05:33,660 it's expressed depends on what the 153 00:05:31,500 --> 00:05:34,440 cell's job is 154 00:05:33,660 --> 00:05:36,900 um 155 00:05:34,440 --> 00:05:38,580 these days we can actually read the DNA 156 00:05:36,900 --> 00:05:41,160 there's a technique called nanopore 157 00:05:38,580 --> 00:05:43,139 electrophoresis which basically sucks 158 00:05:41,160 --> 00:05:45,000 the DNA molecule through a teeny tiny 159 00:05:43,139 --> 00:05:47,060 hole a lot like a small child eating 160 00:05:45,000 --> 00:05:49,919 spaghetti in a disgusting way 161 00:05:47,060 --> 00:05:51,780 and it reads the molecule at one base at 162 00:05:49,919 --> 00:05:54,600 a time and so we can actually read the 163 00:05:51,780 --> 00:05:55,800 binary this is pretty exciting and using 164 00:05:54,600 --> 00:05:57,419 some other techniques that you might 165 00:05:55,800 --> 00:05:58,979 have heard of things like crispr and 166 00:05:57,419 --> 00:06:00,780 things like that we can actually write 167 00:05:58,979 --> 00:06:03,960 the binary we can hack little changes 168 00:06:00,780 --> 00:06:06,300 into the binaries which is way cool so 169 00:06:03,960 --> 00:06:07,880 we can read and write the binaries so of 170 00:06:06,300 --> 00:06:11,580 course we understand everything 171 00:06:07,880 --> 00:06:13,199 and not so much a bit like in software 172 00:06:11,580 --> 00:06:15,539 reading and writing the binaries is 173 00:06:13,199 --> 00:06:17,160 great but it's not the only thing we do 174 00:06:15,539 --> 00:06:18,120 know a bit about the syntax of the 175 00:06:17,160 --> 00:06:20,880 genome 176 00:06:18,120 --> 00:06:23,100 genes code for proteins they're wrapped 177 00:06:20,880 --> 00:06:24,960 in kind of a header and a footer it's 178 00:06:23,100 --> 00:06:27,060 getting Eerie isn't it that sort of 179 00:06:24,960 --> 00:06:28,979 regulate how much they get run and 180 00:06:27,060 --> 00:06:31,500 things like that some of them are tiny 181 00:06:28,979 --> 00:06:33,360 they have one little protein hundreds of 182 00:06:31,500 --> 00:06:36,180 base pairs in the binary 183 00:06:33,360 --> 00:06:38,100 other one of them's a huge they have 184 00:06:36,180 --> 00:06:40,440 dozens of proteins millions of bases 185 00:06:38,100 --> 00:06:44,180 they're like those enormous classes you 186 00:06:40,440 --> 00:06:44,180 see in some bits of software 187 00:06:44,220 --> 00:06:46,500 um 188 00:06:44,880 --> 00:06:49,319 and we know that they're sort of wrapped 189 00:06:46,500 --> 00:06:51,539 up in this way within the chromosomes 190 00:06:49,319 --> 00:06:53,699 there are no debug symbols and no 191 00:06:51,539 --> 00:06:55,800 comments it's very inconvenient 192 00:06:53,699 --> 00:06:57,600 um what we can do is we can look at the 193 00:06:55,800 --> 00:06:59,940 genes and we can go we think we know 194 00:06:57,600 --> 00:07:01,800 what this does we'll give it a name we 195 00:06:59,940 --> 00:07:03,060 can just assign them a name it's very 196 00:07:01,800 --> 00:07:05,340 much like what you do if you're ever 197 00:07:03,060 --> 00:07:07,199 reverse engineering a binary you you go 198 00:07:05,340 --> 00:07:09,539 through with a pencil if you're very old 199 00:07:07,199 --> 00:07:12,300 or with guidra if you're rather younger 200 00:07:09,539 --> 00:07:14,220 and you assign a thing and you go this 201 00:07:12,300 --> 00:07:15,479 function I think does this so I'll give 202 00:07:14,220 --> 00:07:16,160 it a name 203 00:07:15,479 --> 00:07:19,139 um 204 00:07:16,160 --> 00:07:21,419 we know of about 20 000 genes that 205 00:07:19,139 --> 00:07:22,819 encode proteins we think we know what 206 00:07:21,419 --> 00:07:25,919 they do 207 00:07:22,819 --> 00:07:28,020 that totals about three percent of the 208 00:07:25,919 --> 00:07:30,180 entire genome 209 00:07:28,020 --> 00:07:31,380 we know from other stuff there's a 210 00:07:30,180 --> 00:07:34,919 reference in my things which I'll send 211 00:07:31,380 --> 00:07:36,419 you later that 80 of it at least does 212 00:07:34,919 --> 00:07:38,580 something 213 00:07:36,419 --> 00:07:40,680 so there's a fair way to go I think it's 214 00:07:38,580 --> 00:07:42,960 fair to say if to to say we completely 215 00:07:40,680 --> 00:07:45,360 understand the genome well no 216 00:07:42,960 --> 00:07:46,740 people have said that we can download 217 00:07:45,360 --> 00:07:48,720 the whole thing we don't necessarily 218 00:07:46,740 --> 00:07:51,539 understand the whole thing 219 00:07:48,720 --> 00:07:53,099 okay so I'm about to talk a lot about 220 00:07:51,539 --> 00:07:54,419 Gene variants so let's just talk about 221 00:07:53,099 --> 00:07:55,199 what that is 222 00:07:54,419 --> 00:07:56,720 um 223 00:07:55,199 --> 00:07:58,740 those little 224 00:07:56,720 --> 00:08:01,400 symbols on the tape are called 225 00:07:58,740 --> 00:08:04,080 nucleotides or bases 226 00:08:01,400 --> 00:08:06,180 and they vary everyone's genome is 227 00:08:04,080 --> 00:08:08,039 different there's this concept of a 228 00:08:06,180 --> 00:08:09,419 reference genome which is like what is 229 00:08:08,039 --> 00:08:12,120 normal 230 00:08:09,419 --> 00:08:14,819 no one has that actual genome that's 231 00:08:12,120 --> 00:08:16,740 just like a a reference it's a typical 232 00:08:14,819 --> 00:08:19,139 genome 233 00:08:16,740 --> 00:08:20,400 um at one point one person did have that 234 00:08:19,139 --> 00:08:21,900 genome I think which was the guy who 235 00:08:20,400 --> 00:08:24,720 started the Human Genome Project wasn't 236 00:08:21,900 --> 00:08:26,060 it his personal sequence no anyway 237 00:08:24,720 --> 00:08:29,460 um 238 00:08:26,060 --> 00:08:30,960 so everyone's different you know in some 239 00:08:29,460 --> 00:08:32,399 way or another the most common kind of 240 00:08:30,960 --> 00:08:34,560 variant is just like a single base 241 00:08:32,399 --> 00:08:36,599 change somewhere there flip a bit on and 242 00:08:34,560 --> 00:08:37,800 off if it's a really common thing that 243 00:08:36,599 --> 00:08:39,360 happens in people we call it a 244 00:08:37,800 --> 00:08:41,760 polymorphism 245 00:08:39,360 --> 00:08:43,500 um because it's you know many shapes if 246 00:08:41,760 --> 00:08:45,440 it's rare we call it a variant because 247 00:08:43,500 --> 00:08:47,760 it's a little bit more scary 248 00:08:45,440 --> 00:08:49,080 some of these variants are really easily 249 00:08:47,760 --> 00:08:52,140 observable 250 00:08:49,080 --> 00:08:54,959 um infant lactose intolerance and adult 251 00:08:52,140 --> 00:08:58,140 lactase Persistence of both variants in 252 00:08:54,959 --> 00:09:00,480 LCT and mcm6 genes 253 00:08:58,140 --> 00:09:02,779 um infant lactose intolerance we call a 254 00:09:00,480 --> 00:09:05,459 variant because it's rare and it's scary 255 00:09:02,779 --> 00:09:07,680 adult lactase persistence we call a 256 00:09:05,459 --> 00:09:09,540 polymorphism because a lot of people 257 00:09:07,680 --> 00:09:11,959 have that that's where you can still 258 00:09:09,540 --> 00:09:14,519 drink milk as an adult 259 00:09:11,959 --> 00:09:15,779 or hair color is another example of a 260 00:09:14,519 --> 00:09:18,360 gene variant that we kind of understand 261 00:09:15,779 --> 00:09:20,220 really well well enough that we can look 262 00:09:18,360 --> 00:09:23,640 at the human genome recovered from a 263 00:09:20,220 --> 00:09:26,339 5700 year old bit of chewing gum and 264 00:09:23,640 --> 00:09:28,140 tell what color hair the woman who 265 00:09:26,339 --> 00:09:30,240 chewed that gum had as well as their 266 00:09:28,140 --> 00:09:32,760 skin color eye color and also she had 267 00:09:30,240 --> 00:09:35,940 duck and I think it was chestnuts for 268 00:09:32,760 --> 00:09:38,399 lunch because we can read all that DNA 269 00:09:35,940 --> 00:09:39,779 out of a piece of 5700 year old chewing 270 00:09:38,399 --> 00:09:41,760 gum 271 00:09:39,779 --> 00:09:44,519 we've also found correlations between 272 00:09:41,760 --> 00:09:48,360 people with severe covert symptoms and 273 00:09:44,519 --> 00:09:50,640 long covert and a particular variant on 274 00:09:48,360 --> 00:09:53,880 Fox P4 which is another Gene that seems 275 00:09:50,640 --> 00:09:55,440 to be involved in lung stuff so that's 276 00:09:53,880 --> 00:09:58,019 really interesting and might lead to 277 00:09:55,440 --> 00:10:00,980 some understanding of how covert works 278 00:09:58,019 --> 00:10:03,720 and how to help people with long covert 279 00:10:00,980 --> 00:10:07,200 that's a correlation they found we don't 280 00:10:03,720 --> 00:10:09,660 actually know necessarily why yet 281 00:10:07,200 --> 00:10:12,180 there are more subtle problems as well 282 00:10:09,660 --> 00:10:12,839 in cancers 283 00:10:12,180 --> 00:10:15,120 um 284 00:10:12,839 --> 00:10:16,940 cells are damaged by radiation or 285 00:10:15,120 --> 00:10:19,800 chemicals or just bad luck 286 00:10:16,940 --> 00:10:22,320 cell repair mechanisms then can either 287 00:10:19,800 --> 00:10:24,320 fix that damage or kill the cell off if 288 00:10:22,320 --> 00:10:27,180 it's unfixable 289 00:10:24,320 --> 00:10:31,019 those cell repair mechanisms have genes 290 00:10:27,180 --> 00:10:32,580 they're created by genes those genes 291 00:10:31,019 --> 00:10:34,440 can have variants and some of those 292 00:10:32,580 --> 00:10:36,440 variants can reduce the effectiveness of 293 00:10:34,440 --> 00:10:39,600 those repair mechanisms 294 00:10:36,440 --> 00:10:42,420 it's much harder to find these variants 295 00:10:39,600 --> 00:10:44,100 or how important they are because these 296 00:10:42,420 --> 00:10:45,899 are rare things in people's lives are 297 00:10:44,100 --> 00:10:47,940 complicated and it's a little bit like 298 00:10:45,899 --> 00:10:49,200 looking for a bug in an exception 299 00:10:47,940 --> 00:10:52,019 Handler 300 00:10:49,200 --> 00:10:54,540 in Live code that's already exposed to 301 00:10:52,019 --> 00:10:58,440 the Internet it's very hard to simulate 302 00:10:54,540 --> 00:10:59,760 this sort of stuff in in the real world 303 00:10:58,440 --> 00:11:02,279 um 304 00:10:59,760 --> 00:11:04,500 so which brings me to Multiplex assays 305 00:11:02,279 --> 00:11:06,060 of variant effect and that's my rodent 306 00:11:04,500 --> 00:11:08,160 of unusual size which is what the thing 307 00:11:06,060 --> 00:11:09,240 always reminds me of 308 00:11:08,160 --> 00:11:12,959 um 309 00:11:09,240 --> 00:11:14,940 there's a few thousand known variants of 310 00:11:12,959 --> 00:11:16,200 these sorts of cellular repair genes and 311 00:11:14,940 --> 00:11:17,519 there's a lot more we don't know about 312 00:11:16,200 --> 00:11:19,320 because we've never seen them clinically 313 00:11:17,519 --> 00:11:20,399 they've never happened in a patient who 314 00:11:19,320 --> 00:11:23,160 happened 315 00:11:20,399 --> 00:11:24,600 to have a particular cancer of a type 316 00:11:23,160 --> 00:11:26,399 that saw them go to the hospital and 317 00:11:24,600 --> 00:11:28,260 have their genes sequenced and so on and 318 00:11:26,399 --> 00:11:30,540 so forth we haven't seen them clinically 319 00:11:28,260 --> 00:11:31,560 some of them might be serious a lot of 320 00:11:30,540 --> 00:11:34,019 them are going to be just background 321 00:11:31,560 --> 00:11:36,240 noise so they're not that important but 322 00:11:34,019 --> 00:11:38,760 we want to know which ones are which 323 00:11:36,240 --> 00:11:40,740 so Multiplex essays take a different 324 00:11:38,760 --> 00:11:42,720 approach they say well let's make all 325 00:11:40,740 --> 00:11:43,980 the possible variants let's get those 326 00:11:42,720 --> 00:11:46,140 machines that can read and write the 327 00:11:43,980 --> 00:11:47,640 binary like I was mentioning before and 328 00:11:46,140 --> 00:11:49,800 we'll write all the possible variants 329 00:11:47,640 --> 00:11:51,480 and we'll make an experiment on all the 330 00:11:49,800 --> 00:11:54,600 possible variants and we'll measure 331 00:11:51,480 --> 00:11:56,579 their their we'll sort of score them how 332 00:11:54,600 --> 00:11:57,839 dangerous is this basically the outcome 333 00:11:56,579 --> 00:11:59,820 is the same we want to know which 334 00:11:57,839 --> 00:12:01,260 variants you should get worried about 335 00:11:59,820 --> 00:12:03,720 and which variants you shouldn't because 336 00:12:01,260 --> 00:12:05,700 that way you can tailor treatments to 337 00:12:03,720 --> 00:12:07,860 people's individual 338 00:12:05,700 --> 00:12:09,300 genetics 339 00:12:07,860 --> 00:12:11,820 um 340 00:12:09,300 --> 00:12:14,399 there's a resource that Alan talked 341 00:12:11,820 --> 00:12:17,880 about on Friday called mavdb that exists 342 00:12:14,399 --> 00:12:20,519 to store and and kind of retrieve these 343 00:12:17,880 --> 00:12:22,500 sort of results sets so that people can 344 00:12:20,519 --> 00:12:24,540 compare them and use them and so on and 345 00:12:22,500 --> 00:12:26,480 so forth so look that one up on video if 346 00:12:24,540 --> 00:12:28,680 you want that 347 00:12:26,480 --> 00:12:30,000 I'll talk briefly about a couple of 348 00:12:28,680 --> 00:12:31,140 techniques that are used for these 349 00:12:30,000 --> 00:12:33,120 because they're important to the 350 00:12:31,140 --> 00:12:35,000 software we're developing one of them is 351 00:12:33,120 --> 00:12:38,040 a thing called saturation genome editing 352 00:12:35,000 --> 00:12:41,220 this particular paper studies about 4 353 00:12:38,040 --> 00:12:43,019 000 single base variants of brca1 looks 354 00:12:41,220 --> 00:12:45,300 at the cell growth assessed after 11 355 00:12:43,019 --> 00:12:47,880 days I think it is and has 356 00:12:45,300 --> 00:12:50,760 a it basically tries all these different 357 00:12:47,880 --> 00:12:53,700 possible variants and assesses how much 358 00:12:50,760 --> 00:12:55,139 they have affected the cells it has good 359 00:12:53,700 --> 00:12:56,880 correlation with the variants we know of 360 00:12:55,139 --> 00:12:58,860 clinically so it looks like it's really 361 00:12:56,880 --> 00:13:00,600 valuable work and it identifies some new 362 00:12:58,860 --> 00:13:02,579 pathogenic variants some new dangerous 363 00:13:00,600 --> 00:13:06,060 variants so that could be really really 364 00:13:02,579 --> 00:13:07,440 important for for future medicine for 365 00:13:06,060 --> 00:13:09,000 treatment if you happen to have one of 366 00:13:07,440 --> 00:13:11,700 those variants that might be very 367 00:13:09,000 --> 00:13:14,639 important to your life 368 00:13:11,700 --> 00:13:16,620 this shows a little chart of the the 369 00:13:14,639 --> 00:13:18,959 outcome of that study where you can see 370 00:13:16,620 --> 00:13:21,300 that a huge number of these variants on 371 00:13:18,959 --> 00:13:23,339 brca1 really this they have a score 372 00:13:21,300 --> 00:13:25,139 around zero they didn't make much of a 373 00:13:23,339 --> 00:13:26,700 difference 374 00:13:25,139 --> 00:13:28,260 um there's a bunch more that have a very 375 00:13:26,700 --> 00:13:30,720 bad score they've made a big difference 376 00:13:28,260 --> 00:13:34,320 brca1 no longer worked sorry it says up 377 00:13:30,720 --> 00:13:37,139 there LOF which means loss of function 378 00:13:34,320 --> 00:13:39,660 um the brca1 cellular repair mechanism 379 00:13:37,139 --> 00:13:41,040 doesn't work anymore and there's a bunch 380 00:13:39,660 --> 00:13:43,740 of intermediate ones which is really 381 00:13:41,040 --> 00:13:46,320 interesting those are things that that 382 00:13:43,740 --> 00:13:49,519 um it works still but maybe not as well 383 00:13:46,320 --> 00:13:51,779 as it should similar work is underway on 384 00:13:49,519 --> 00:13:54,779 similar cellular repair mechanisms such 385 00:13:51,779 --> 00:13:56,339 as brca1 and pal B2 which are a similar 386 00:13:54,779 --> 00:13:58,800 kind of mechanisms and also very 387 00:13:56,339 --> 00:14:00,240 important for cancer research 388 00:13:58,800 --> 00:14:02,100 another technique I just want to talk 389 00:14:00,240 --> 00:14:04,079 about briefly is a thing called vampseq 390 00:14:02,100 --> 00:14:05,579 which is really cool I mostly want to 391 00:14:04,079 --> 00:14:08,519 talk about it because it takes jellyfish 392 00:14:05,579 --> 00:14:10,500 proteins and attaches them to a genome 393 00:14:08,519 --> 00:14:12,720 basically you make cells of every 394 00:14:10,500 --> 00:14:15,060 possible variant of a particular Gene 395 00:14:12,720 --> 00:14:17,940 you attach a fluorescent marker protein 396 00:14:15,060 --> 00:14:19,800 to that that um 397 00:14:17,940 --> 00:14:21,959 Gene as well so it's expressed along 398 00:14:19,800 --> 00:14:24,600 with the thing that you're doing and 399 00:14:21,959 --> 00:14:26,700 then you see how those cells reproduce 400 00:14:24,600 --> 00:14:28,560 the more the gene is expressed the more 401 00:14:26,700 --> 00:14:30,540 the cell glows and then you actually 402 00:14:28,560 --> 00:14:32,700 physically sort the cells into bins 403 00:14:30,540 --> 00:14:35,040 based on how much they glow and then you 404 00:14:32,700 --> 00:14:36,779 sequence each bin 405 00:14:35,040 --> 00:14:38,579 um sorry then you sequence each bin and 406 00:14:36,779 --> 00:14:41,459 you count the population of each 407 00:14:38,579 --> 00:14:42,720 sequence and from that you can derive a 408 00:14:41,459 --> 00:14:45,120 score 409 00:14:42,720 --> 00:14:47,459 the important bit about this is you can 410 00:14:45,120 --> 00:14:51,540 derive then a heat map of which which 411 00:14:47,459 --> 00:14:52,860 variants wear on the gene matter the 412 00:14:51,540 --> 00:14:54,420 most 413 00:14:52,860 --> 00:14:55,980 um it's a very cost effective technique 414 00:14:54,420 --> 00:14:57,120 because instead of running thousands of 415 00:14:55,980 --> 00:14:57,839 separate 416 00:14:57,120 --> 00:15:00,000 um 417 00:14:57,839 --> 00:15:01,740 sort of little tests you're putting 418 00:15:00,000 --> 00:15:02,940 putting things in big bins at a time 419 00:15:01,740 --> 00:15:05,160 which makes it a little bit more 420 00:15:02,940 --> 00:15:08,100 effective currently being used to infect 421 00:15:05,160 --> 00:15:09,540 investigate some Factor 9 hemophilia 422 00:15:08,100 --> 00:15:12,120 which is really interesting under 423 00:15:09,540 --> 00:15:13,560 various conditions 424 00:15:12,120 --> 00:15:15,060 um so that's a completely different 425 00:15:13,560 --> 00:15:16,440 thing now you've probably noticed that 426 00:15:15,060 --> 00:15:17,820 the technique here is completely 427 00:15:16,440 --> 00:15:19,800 different to the technique I talked 428 00:15:17,820 --> 00:15:20,639 about before SGA it's it's we have 429 00:15:19,800 --> 00:15:22,500 different inputs we have different 430 00:15:20,639 --> 00:15:24,300 outputs we have different maths scoring 431 00:15:22,500 --> 00:15:25,560 is different Etc they're using some of 432 00:15:24,300 --> 00:15:28,079 the same Tech 433 00:15:25,560 --> 00:15:30,060 but there's some big differences 434 00:15:28,079 --> 00:15:31,680 so the software I'm writing isn't about 435 00:15:30,060 --> 00:15:33,600 that stuff at all my software I'm 436 00:15:31,680 --> 00:15:35,639 running is about counting things now 437 00:15:33,600 --> 00:15:36,440 computers are really good at counting 438 00:15:35,639 --> 00:15:38,339 things 439 00:15:36,440 --> 00:15:40,260 they've been really good at counting 440 00:15:38,339 --> 00:15:42,300 things ever since they've got fed Punch 441 00:15:40,260 --> 00:15:44,040 Cards and 442 00:15:42,300 --> 00:15:45,839 um we had things called tabulating 443 00:15:44,040 --> 00:15:47,399 machines and so on 444 00:15:45,839 --> 00:15:49,680 they're really great at counting things 445 00:15:47,399 --> 00:15:51,540 so this is really quite a simple problem 446 00:15:49,680 --> 00:15:53,639 to solve obviously 447 00:15:51,540 --> 00:15:56,519 um from the the paper I've just 448 00:15:53,639 --> 00:16:00,180 mentioned on sge we have data for 13 449 00:15:56,519 --> 00:16:03,540 exons over 163 files we have 350 million 450 00:16:00,180 --> 00:16:06,660 sequences we have 66 billion bases in 451 00:16:03,540 --> 00:16:08,699 those sequences which isn't a ridiculous 452 00:16:06,660 --> 00:16:11,940 amount of data I mean it's it's a lot 453 00:16:08,699 --> 00:16:13,740 but it's not ridiculous right but those 454 00:16:11,940 --> 00:16:16,019 data sizes are only you know they're 455 00:16:13,740 --> 00:16:17,399 increasing but it's not ridiculous as 456 00:16:16,019 --> 00:16:19,019 such so obviously there's an easy 457 00:16:17,399 --> 00:16:21,060 solution to this we just put them all in 458 00:16:19,019 --> 00:16:22,980 a text oh sorry we put them all in a 459 00:16:21,060 --> 00:16:24,540 text file and we run sort on them and we 460 00:16:22,980 --> 00:16:27,120 run that through unique and that spits 461 00:16:24,540 --> 00:16:29,880 out the counts it's done from soft 462 00:16:27,120 --> 00:16:31,500 it's already optimized it runs on your 463 00:16:29,880 --> 00:16:32,760 computer it's multi-core it's fine 464 00:16:31,500 --> 00:16:34,800 you've already got it if you've got any 465 00:16:32,760 --> 00:16:36,240 kind of Unix machine it's it'll work out 466 00:16:34,800 --> 00:16:39,660 fine 467 00:16:36,240 --> 00:16:42,060 um unfortunately the problem is humans 468 00:16:39,660 --> 00:16:44,100 um the the actual problem isn't sorting 469 00:16:42,060 --> 00:16:46,380 and Counting the actual problem is file 470 00:16:44,100 --> 00:16:48,480 formats and metadata and filtering and 471 00:16:46,380 --> 00:16:50,940 grouping and transformation and 472 00:16:48,480 --> 00:16:52,740 statistics you have to do all of these 473 00:16:50,940 --> 00:16:54,480 things to the numbers before and after 474 00:16:52,740 --> 00:16:56,100 you count them statistics is 475 00:16:54,480 --> 00:16:58,259 particularly important because you don't 476 00:16:56,100 --> 00:16:59,940 want to do this where you find a 477 00:16:58,259 --> 00:17:02,579 correlation because you're looking for a 478 00:16:59,940 --> 00:17:04,740 correlation for anyone who can't see it 479 00:17:02,579 --> 00:17:07,020 this is a correlation of the Internet 480 00:17:04,740 --> 00:17:08,819 Explorer market share versus the murder 481 00:17:07,020 --> 00:17:09,780 rate in the United States of America and 482 00:17:08,819 --> 00:17:11,339 you can see they're very nicely 483 00:17:09,780 --> 00:17:12,179 correlated if you craft them in the 484 00:17:11,339 --> 00:17:14,040 right way 485 00:17:12,179 --> 00:17:16,319 you don't want that to happen you need 486 00:17:14,040 --> 00:17:17,520 to run tests as you look at this data to 487 00:17:16,319 --> 00:17:19,740 make sure that you're seeing a real 488 00:17:17,520 --> 00:17:22,679 effect and not some kind of statistical 489 00:17:19,740 --> 00:17:24,179 correlation 490 00:17:22,679 --> 00:17:25,500 previously a lot of this work was done 491 00:17:24,179 --> 00:17:27,000 in spreadsheets there's some pretty 492 00:17:25,500 --> 00:17:28,500 major problems with that for a start 493 00:17:27,000 --> 00:17:30,240 we're heading up into the billions of 494 00:17:28,500 --> 00:17:33,000 rows which doesn't play very well in 495 00:17:30,240 --> 00:17:34,740 Excel also Excel has this real tendency 496 00:17:33,000 --> 00:17:36,419 to want everything to be a date and 497 00:17:34,740 --> 00:17:37,919 genes have names like I mentioned before 498 00:17:36,419 --> 00:17:41,580 and some of those names are things like 499 00:17:37,919 --> 00:17:44,220 sept2 and March 3 and 500 00:17:41,580 --> 00:17:46,559 Excel thinks they're dates this paper 501 00:17:44,220 --> 00:17:48,900 estimates that roughly 20 percent of 502 00:17:46,559 --> 00:17:51,539 Cell Biology papers have errors 503 00:17:48,900 --> 00:17:53,100 introduced by these kinds of things this 504 00:17:51,539 --> 00:17:56,100 is it's a better update now but it's 505 00:17:53,100 --> 00:17:56,940 it's a ridiculously large number also 506 00:17:56,100 --> 00:17:59,880 just 507 00:17:56,940 --> 00:18:01,320 any series of digits with one e in them 508 00:17:59,880 --> 00:18:03,799 is turned into a floating Point number 509 00:18:01,320 --> 00:18:07,080 thanks real helpful Excel 510 00:18:03,799 --> 00:18:09,179 so spreadsheets are not a great thing 511 00:18:07,080 --> 00:18:12,900 for this so a lot of people write 512 00:18:09,179 --> 00:18:14,760 programs which is great python are very 513 00:18:12,900 --> 00:18:16,320 popular to coordinate all of that stuff 514 00:18:14,760 --> 00:18:17,700 you probably need some bash script as 515 00:18:16,320 --> 00:18:19,919 well and maybe a little bit of a make 516 00:18:17,700 --> 00:18:22,260 file yeah it's fine we can learn how to 517 00:18:19,919 --> 00:18:24,179 write some make maybe snake make as well 518 00:18:22,260 --> 00:18:25,860 we can coordinate all of these pieces 519 00:18:24,179 --> 00:18:27,660 together and all you have to do is take 520 00:18:25,860 --> 00:18:30,840 someone who's a newly minted cancer 521 00:18:27,660 --> 00:18:33,240 research of keen to do cancer research 522 00:18:30,840 --> 00:18:35,039 and send them off to learn Unix and 523 00:18:33,240 --> 00:18:36,780 Python and R and all of this stuff for 524 00:18:35,039 --> 00:18:38,220 like a year or so because they have got 525 00:18:36,780 --> 00:18:40,679 nothing better to be doing with their 526 00:18:38,220 --> 00:18:41,820 lives actually it will be a lot better 527 00:18:40,679 --> 00:18:44,160 if we could just get them working on 528 00:18:41,820 --> 00:18:45,900 cancer research straight away so 529 00:18:44,160 --> 00:18:47,460 um 530 00:18:45,900 --> 00:18:51,179 previously we've had a tool called 531 00:18:47,460 --> 00:18:52,980 enrich 2 which is a GUI based tool that 532 00:18:51,179 --> 00:18:54,600 people can feed these data files to have 533 00:18:52,980 --> 00:18:58,020 them processed they get spat back out 534 00:18:54,600 --> 00:19:00,960 again it's very nice it's GUI based it's 535 00:18:58,020 --> 00:19:03,059 easy to use it's written in Python 2 536 00:19:00,960 --> 00:19:05,039 um it's specific to some specific 537 00:19:03,059 --> 00:19:07,080 experimental setups though and it's 538 00:19:05,039 --> 00:19:10,440 difficult to kind of grow from where 539 00:19:07,080 --> 00:19:11,880 enrich is to cover all of these 540 00:19:10,440 --> 00:19:14,340 different kinds of experiment that 541 00:19:11,880 --> 00:19:16,320 people were coming up with because 542 00:19:14,340 --> 00:19:18,539 files are always increasing scientists 543 00:19:16,320 --> 00:19:20,640 are always coming up with new kinds of 544 00:19:18,539 --> 00:19:22,380 experiment to run that is part of the 545 00:19:20,640 --> 00:19:23,880 point of doing science is you should be 546 00:19:22,380 --> 00:19:25,860 able to say 547 00:19:23,880 --> 00:19:27,900 I've got a new way to do this I want to 548 00:19:25,860 --> 00:19:29,280 try my new way of running this 549 00:19:27,900 --> 00:19:31,440 experiment 550 00:19:29,280 --> 00:19:33,120 um we want to enable that 551 00:19:31,440 --> 00:19:34,919 so we've started on this new project 552 00:19:33,120 --> 00:19:38,039 called Countess count based experiment 553 00:19:34,919 --> 00:19:40,919 scoring and statistics so as the name 554 00:19:38,039 --> 00:19:43,260 suggests it's all about counting things 555 00:19:40,919 --> 00:19:45,360 but it's also about the scoring and the 556 00:19:43,260 --> 00:19:47,460 statistics and and how to support 557 00:19:45,360 --> 00:19:49,260 multiple kinds of experiment 558 00:19:47,460 --> 00:19:51,900 it's got a graphical user interface 559 00:19:49,260 --> 00:19:53,940 because our users like to use programs 560 00:19:51,900 --> 00:19:55,799 that look like programs not like 561 00:19:53,940 --> 00:19:56,700 script 562 00:19:55,799 --> 00:19:58,740 um 563 00:19:56,700 --> 00:20:01,140 but it's plugin based so you can add and 564 00:19:58,740 --> 00:20:03,480 remove modules from it really easily it 565 00:20:01,140 --> 00:20:05,520 uses flexible data pipelines so the data 566 00:20:03,480 --> 00:20:07,679 kind of flows through it neatly 567 00:20:05,520 --> 00:20:09,419 and rather than having a whole kind of 568 00:20:07,679 --> 00:20:11,520 bunch of here's my program and my other 569 00:20:09,419 --> 00:20:13,020 program and my make file and calls this 570 00:20:11,520 --> 00:20:15,059 and calls that it just has a single text 571 00:20:13,020 --> 00:20:17,160 configuration file that binds all that 572 00:20:15,059 --> 00:20:19,260 stuff together 573 00:20:17,160 --> 00:20:21,960 um its architecture looks something like 574 00:20:19,260 --> 00:20:23,880 this to try and keep all the modules a 575 00:20:21,960 --> 00:20:27,500 bit separate there is a GUI 576 00:20:23,880 --> 00:20:27,500 it's written in TK enter 577 00:20:27,799 --> 00:20:32,520 I suspect it will not always be written 578 00:20:30,120 --> 00:20:34,020 in TK enter because that has proved not 579 00:20:32,520 --> 00:20:36,120 necessarily a great idea and as per 580 00:20:34,020 --> 00:20:37,440 Russell's talked just before this there 581 00:20:36,120 --> 00:20:39,120 are alternatives that are definitely 582 00:20:37,440 --> 00:20:40,860 worth considering but it's very separate 583 00:20:39,120 --> 00:20:43,260 from the rest of the code so I'm not 584 00:20:40,860 --> 00:20:45,419 actually all that worried about that we 585 00:20:43,260 --> 00:20:47,280 can come back and replace that later 586 00:20:45,419 --> 00:20:49,440 as an overview of what it currently 587 00:20:47,280 --> 00:20:51,780 looks like here is a screenshot of what 588 00:20:49,440 --> 00:20:53,280 it currently looks like the nice thing 589 00:20:51,780 --> 00:20:55,080 about tikanta it does actually work 590 00:20:53,280 --> 00:20:56,700 nicely over X11 forwarding if you've got 591 00:20:55,080 --> 00:20:57,660 to do that 592 00:20:56,700 --> 00:20:59,880 um 593 00:20:57,660 --> 00:21:01,740 the window is split into three parts 594 00:20:59,880 --> 00:21:03,539 basically there's this tree of plugins 595 00:21:01,740 --> 00:21:06,620 and how they modularly connect together 596 00:21:03,539 --> 00:21:08,820 to flow data in and out of the system 597 00:21:06,620 --> 00:21:10,200 on the left there things are loading 598 00:21:08,820 --> 00:21:11,880 things up from files on the right 599 00:21:10,200 --> 00:21:13,559 they're writing things out to files and 600 00:21:11,880 --> 00:21:15,600 the data kind of flows through all those 601 00:21:13,559 --> 00:21:17,280 modules on its way through it's fairly 602 00:21:15,600 --> 00:21:19,380 drag and drop state is saved to 603 00:21:17,280 --> 00:21:21,720 configuration files Etc 604 00:21:19,380 --> 00:21:23,280 uh each plugin has a configuration 605 00:21:21,720 --> 00:21:24,780 screen so you can just fill in the 606 00:21:23,280 --> 00:21:27,840 blanks in the form all that's saved to 607 00:21:24,780 --> 00:21:29,460 the config file and most essentially as 608 00:21:27,840 --> 00:21:32,100 you're doing this stuff there's live 609 00:21:29,460 --> 00:21:32,940 data previews of stuff not the whole 610 00:21:32,100 --> 00:21:34,679 file 611 00:21:32,940 --> 00:21:35,940 100 million rows or so at the moment 612 00:21:34,679 --> 00:21:38,640 maybe we'll make that a little more 613 00:21:35,940 --> 00:21:40,919 flexible as we go along but the idea is 614 00:21:38,640 --> 00:21:41,940 that it it processes a bunch of rows so 615 00:21:40,919 --> 00:21:43,740 that you can see what you're doing if 616 00:21:41,940 --> 00:21:45,720 you're writing a regular expression you 617 00:21:43,740 --> 00:21:48,780 get to see that live that you're making 618 00:21:45,720 --> 00:21:50,820 the changes and that it's working which 619 00:21:48,780 --> 00:21:52,320 is great there's also CLI which you can 620 00:21:50,820 --> 00:21:54,539 run that just basically runs the same 621 00:21:52,320 --> 00:21:55,980 config file but run it without having to 622 00:21:54,539 --> 00:21:57,360 look at it which is useful if you want 623 00:21:55,980 --> 00:22:00,059 to use some bigger computer somewhere 624 00:21:57,360 --> 00:22:01,679 further away some of these runs to run 625 00:22:00,059 --> 00:22:03,240 the whole file can take hours and hours 626 00:22:01,679 --> 00:22:06,240 to run so it's kind of nice if you can 627 00:22:03,240 --> 00:22:07,200 just SSH somewhere and do that 628 00:22:06,240 --> 00:22:08,880 um 629 00:22:07,200 --> 00:22:10,679 all right the config files are really 630 00:22:08,880 --> 00:22:12,659 simple there that any file format that 631 00:22:10,679 --> 00:22:14,940 you see everywhere because it's easy to 632 00:22:12,659 --> 00:22:16,799 diff it works nicely in revision control 633 00:22:14,940 --> 00:22:19,440 it's relatively simple you can hand it 634 00:22:16,799 --> 00:22:21,960 if you have to it's loaded and saved by 635 00:22:19,440 --> 00:22:24,120 the GUI it's loaded by the CLR module 636 00:22:21,960 --> 00:22:26,039 all the fun stuff happens in here this 637 00:22:24,120 --> 00:22:27,740 is the the pipeline code is basically 638 00:22:26,039 --> 00:22:30,299 all about 639 00:22:27,740 --> 00:22:31,919 coordinating multiple processes and the 640 00:22:30,299 --> 00:22:33,360 flow of data through the system and all 641 00:22:31,919 --> 00:22:34,860 this and all the tricky stuff happens in 642 00:22:33,360 --> 00:22:36,960 there it's all hidden away nicely in 643 00:22:34,860 --> 00:22:39,720 there 644 00:22:36,960 --> 00:22:41,520 um it's a work in progress it can be 645 00:22:39,720 --> 00:22:42,539 improved there's lots of work to do in 646 00:22:41,520 --> 00:22:45,059 there 647 00:22:42,539 --> 00:22:46,500 um but that's what it's aiming to do it 648 00:22:45,059 --> 00:22:48,299 calls out to the plugins and makes the 649 00:22:46,500 --> 00:22:50,039 plugins do things for it and the plugins 650 00:22:48,299 --> 00:22:51,020 can be written in such a way that they 651 00:22:50,039 --> 00:22:53,520 don't need to think about 652 00:22:51,020 --> 00:22:55,860 multi-processing you do you write the 653 00:22:53,520 --> 00:22:57,960 plugin in a way that does like a map and 654 00:22:55,860 --> 00:22:59,760 a reduce for example and let the 655 00:22:57,960 --> 00:23:01,320 pipeline worry about how to make that 656 00:22:59,760 --> 00:23:02,039 actually happen 657 00:23:01,320 --> 00:23:03,720 um 658 00:23:02,039 --> 00:23:05,700 so there are basically three kinds 659 00:23:03,720 --> 00:23:08,700 there's reading files there's altering 660 00:23:05,700 --> 00:23:10,740 your data and writing files back they 661 00:23:08,700 --> 00:23:13,140 all subclass one of the many subclasses 662 00:23:10,740 --> 00:23:14,580 that are available that do all of that 663 00:23:13,140 --> 00:23:17,460 kind of structural 664 00:23:14,580 --> 00:23:20,640 stuff about this is a map reducy kind of 665 00:23:17,460 --> 00:23:22,500 plug-in versus a very large number of 666 00:23:20,640 --> 00:23:24,900 the plugins literally just take a row at 667 00:23:22,500 --> 00:23:28,799 a time transform it in some way and spit 668 00:23:24,900 --> 00:23:30,299 out another column into that table so if 669 00:23:28,799 --> 00:23:32,159 you want to calculate a score for 670 00:23:30,299 --> 00:23:33,780 example you might take account and 671 00:23:32,159 --> 00:23:35,700 another count and you might divide one 672 00:23:33,780 --> 00:23:37,919 by the other and take a log of that and 673 00:23:35,700 --> 00:23:40,919 that will give you a score 674 00:23:37,919 --> 00:23:43,140 you don't need to run that all in one 675 00:23:40,919 --> 00:23:46,080 place all on one table or anything it 676 00:23:43,140 --> 00:23:47,880 can run on many rows separately in 677 00:23:46,080 --> 00:23:50,520 different places on many CPUs 678 00:23:47,880 --> 00:23:53,460 concurrently so we allow that to happen 679 00:23:50,520 --> 00:23:55,440 by not making the plugin writer 680 00:23:53,460 --> 00:23:57,840 understand that they can just implement 681 00:23:55,440 --> 00:24:01,200 the function that works on a row and not 682 00:23:57,840 --> 00:24:03,179 worry about anything else so that's nice 683 00:24:01,200 --> 00:24:04,380 anyone can write a plugin this is 684 00:24:03,179 --> 00:24:06,539 actually one of the coolest things in 685 00:24:04,380 --> 00:24:10,860 Python recently is this whole entry 686 00:24:06,539 --> 00:24:13,799 points mechanism if you write a a little 687 00:24:10,860 --> 00:24:15,360 Clause into your pup your plugins Pi 688 00:24:13,799 --> 00:24:16,980 Project file 689 00:24:15,360 --> 00:24:19,080 all it needs to do is say hey this has 690 00:24:16,980 --> 00:24:21,360 an entry point it's a countess plugins 691 00:24:19,080 --> 00:24:22,140 entry point 692 00:24:21,360 --> 00:24:24,120 um 693 00:24:22,140 --> 00:24:26,580 and you declare that basically here 694 00:24:24,120 --> 00:24:28,140 here's my plugin when someone installs 695 00:24:26,580 --> 00:24:29,760 that with Pip or whatever 696 00:24:28,140 --> 00:24:32,520 Countess can actually see that that 697 00:24:29,760 --> 00:24:33,900 plugin exists and it can can write it 698 00:24:32,520 --> 00:24:35,880 the coding contest looks a little bit 699 00:24:33,900 --> 00:24:37,500 like that it basically just goes through 700 00:24:35,880 --> 00:24:41,400 all the entry points from that import 701 00:24:37,500 --> 00:24:42,059 lid entry points function and it says 702 00:24:41,400 --> 00:24:44,400 um 703 00:24:42,059 --> 00:24:46,860 are these valid plugins do they actually 704 00:24:44,400 --> 00:24:48,240 derive from that base plugin if they do 705 00:24:46,860 --> 00:24:50,480 then we'll count them we'll make them 706 00:24:48,240 --> 00:24:53,940 available to the user 707 00:24:50,480 --> 00:24:55,320 and lastly the the library is underlying 708 00:24:53,940 --> 00:24:56,700 all this stuff 709 00:24:55,320 --> 00:24:59,039 um 710 00:24:56,700 --> 00:25:00,600 so a lot of the stuff runs on pandas 711 00:24:59,039 --> 00:25:03,600 because pandas is really handy for this 712 00:25:00,600 --> 00:25:05,640 and numpy there are other libraries too 713 00:25:03,600 --> 00:25:09,059 for example there's one called minimap 2 714 00:25:05,640 --> 00:25:10,100 that I've wrapped up to be a contest 715 00:25:09,059 --> 00:25:13,260 plugin 716 00:25:10,100 --> 00:25:16,679 mini map 2 is an example of a thing that 717 00:25:13,260 --> 00:25:19,080 runs per per row it takes a variant and 718 00:25:16,679 --> 00:25:21,120 it turns it into a little kind of a diff 719 00:25:19,080 --> 00:25:22,860 almost so you have a sequence and you 720 00:25:21,120 --> 00:25:24,840 say I want to know where this sequence 721 00:25:22,860 --> 00:25:27,480 comes from and and how it's different 722 00:25:24,840 --> 00:25:29,400 from the reference genome so it could 723 00:25:27,480 --> 00:25:32,039 run once per row and wrapping that up 724 00:25:29,400 --> 00:25:34,620 takes a a dozen lines of code to just 725 00:25:32,039 --> 00:25:36,600 basically say all right for each line 726 00:25:34,620 --> 00:25:38,340 this function gets cold so this function 727 00:25:36,600 --> 00:25:40,380 just needs to look up this particular 728 00:25:38,340 --> 00:25:41,400 variant and compare it and stuff like 729 00:25:40,380 --> 00:25:43,440 that 730 00:25:41,400 --> 00:25:46,039 that's really good just as an example 731 00:25:43,440 --> 00:25:49,200 I've got five minutes so this is perfect 732 00:25:46,039 --> 00:25:52,020 as an example of Countess in in used in 733 00:25:49,200 --> 00:25:53,840 Anger this is kind of loosely based on 734 00:25:52,020 --> 00:25:56,340 that SGA work before 735 00:25:53,840 --> 00:25:58,620 so the first 736 00:25:56,340 --> 00:26:01,740 as you select the little plugins on the 737 00:25:58,620 --> 00:26:04,200 left pane there it shows you their 738 00:26:01,740 --> 00:26:07,200 config and their own individual output 739 00:26:04,200 --> 00:26:08,940 on the right there so the first thing is 740 00:26:07,200 --> 00:26:11,700 they are loading up all the fastq files 741 00:26:08,940 --> 00:26:12,840 163 files get loaded and what it's 742 00:26:11,700 --> 00:26:15,179 actually doing is grabbing the first 743 00:26:12,840 --> 00:26:16,980 little bit out of each file to see have 744 00:26:15,179 --> 00:26:19,020 a representative sample of that data and 745 00:26:16,980 --> 00:26:21,179 you can see that data down in the bottom 746 00:26:19,020 --> 00:26:22,620 sort of third of the screen over the 747 00:26:21,179 --> 00:26:25,080 right there 748 00:26:22,620 --> 00:26:26,940 so that shows you the actual contents of 749 00:26:25,080 --> 00:26:28,620 those fast queue files and fast use a 750 00:26:26,940 --> 00:26:30,299 particular file format used a lot in 751 00:26:28,620 --> 00:26:31,740 this this thing 752 00:26:30,299 --> 00:26:33,600 there's another plug in there that 753 00:26:31,740 --> 00:26:35,400 literally is just extracting some data 754 00:26:33,600 --> 00:26:37,740 from those columns that are in the fastq 755 00:26:35,400 --> 00:26:39,480 file and altering it a little bit so 756 00:26:37,740 --> 00:26:41,700 that we can use it later 757 00:26:39,480 --> 00:26:43,559 along with those big fast queue files 758 00:26:41,700 --> 00:26:46,380 comes a little metadata file called SRA 759 00:26:43,559 --> 00:26:48,000 run table and there's some metadata in 760 00:26:46,380 --> 00:26:50,580 there we need so we also load that up 761 00:26:48,000 --> 00:26:54,299 it's an it's called SRA runtable.txt 762 00:26:50,580 --> 00:26:56,400 it's a CSV file who's surprised so it 763 00:26:54,299 --> 00:26:58,440 basically reads us there's a CSV reader 764 00:26:56,400 --> 00:27:00,120 that reads the CSV file 765 00:26:58,440 --> 00:27:01,559 and then we have to transform that a bit 766 00:27:00,120 --> 00:27:03,120 because it's also not quite what we 767 00:27:01,559 --> 00:27:04,620 wanted so we pulled some different 768 00:27:03,120 --> 00:27:07,260 columns out of it that aren't actually 769 00:27:04,620 --> 00:27:10,080 CSV columns they're just like embedded 770 00:27:07,260 --> 00:27:12,120 within the text strings in the CSV and 771 00:27:10,080 --> 00:27:13,980 now we can do like a database join of 772 00:27:12,120 --> 00:27:15,659 these two data sources using a join 773 00:27:13,980 --> 00:27:17,279 plugin which is one of the built-in 774 00:27:15,659 --> 00:27:18,659 plugins 775 00:27:17,279 --> 00:27:20,760 um 776 00:27:18,659 --> 00:27:22,380 we have variants we want to compare them 777 00:27:20,760 --> 00:27:24,240 to the reference genome so we have a 778 00:27:22,380 --> 00:27:26,460 separate file again with the reference 779 00:27:24,240 --> 00:27:28,500 genome for each of the exons in it 780 00:27:26,460 --> 00:27:29,700 we then join that with the data I think 781 00:27:28,500 --> 00:27:31,740 you're seeing where this going the data 782 00:27:29,700 --> 00:27:34,260 is all kind of flowing together 783 00:27:31,740 --> 00:27:37,500 we split out data one way for the the 784 00:27:34,260 --> 00:27:38,700 things that are unchanged oh no I 785 00:27:37,500 --> 00:27:41,760 pressed a button 786 00:27:38,700 --> 00:27:42,480 we split out the data one way for the 787 00:27:41,760 --> 00:27:44,520 um 788 00:27:42,480 --> 00:27:47,340 the stuff that's unchanged and one way 789 00:27:44,520 --> 00:27:49,740 for the stuff that is changed oh I 790 00:27:47,340 --> 00:27:52,260 pressed the wrong button 791 00:27:49,740 --> 00:27:55,140 sorry about this 792 00:27:52,260 --> 00:27:57,059 and then we finally we call the variance 793 00:27:55,140 --> 00:28:00,000 we say okay they're changed these are 794 00:27:57,059 --> 00:28:02,340 all changed in what way are they changed 795 00:28:00,000 --> 00:28:04,140 um so we go through and we we make like 796 00:28:02,340 --> 00:28:06,720 a little diff for each one 797 00:28:04,140 --> 00:28:08,460 and then we take counts of each diff How 798 00:28:06,720 --> 00:28:10,380 likely are these or how common are these 799 00:28:08,460 --> 00:28:12,600 compared to how common are the ones that 800 00:28:10,380 --> 00:28:14,580 are unchanged we calculate a score from 801 00:28:12,600 --> 00:28:16,380 that and finally because this is science 802 00:28:14,580 --> 00:28:18,840 we write it all out to a CSV file 803 00:28:16,380 --> 00:28:20,279 because CSV files Make the World Go 804 00:28:18,840 --> 00:28:22,200 Round 805 00:28:20,279 --> 00:28:23,940 um and all of that's in a configuration 806 00:28:22,200 --> 00:28:26,279 file so you can change any one of those 807 00:28:23,940 --> 00:28:28,220 steps and it's it's relatively easy to 808 00:28:26,279 --> 00:28:30,299 take someone else's work replicated 809 00:28:28,220 --> 00:28:33,840 alter it a little bit maybe just change 810 00:28:30,299 --> 00:28:35,700 the way it scored maybe just 811 00:28:33,840 --> 00:28:37,679 um alter the way you've filtered the 812 00:28:35,700 --> 00:28:39,120 data early on maybe you add in a step 813 00:28:37,679 --> 00:28:40,440 you just go back there somewhere and say 814 00:28:39,120 --> 00:28:41,760 I want to throw out all of these ones 815 00:28:40,440 --> 00:28:44,159 because these ones look like garbage to 816 00:28:41,760 --> 00:28:45,360 me this is a these low quality reads I'm 817 00:28:44,159 --> 00:28:48,960 going to get rid of and I'm going to see 818 00:28:45,360 --> 00:28:51,140 whether my my statistics come out nicer 819 00:28:48,960 --> 00:28:53,220 at the at the other end 820 00:28:51,140 --> 00:28:55,140 the idea is that they should be very 821 00:28:53,220 --> 00:28:57,419 flexible very easy for people to change 822 00:28:55,140 --> 00:29:00,480 if you change your approach you can 823 00:28:57,419 --> 00:29:03,240 still like your sequencing approach you 824 00:29:00,480 --> 00:29:04,620 could change the first few steps while 825 00:29:03,240 --> 00:29:05,940 leaving the rest of the steps alone and 826 00:29:04,620 --> 00:29:07,500 check that the answer still come out 827 00:29:05,940 --> 00:29:09,000 about the same if you wanted to change 828 00:29:07,500 --> 00:29:11,220 the scoring you can leave the first 829 00:29:09,000 --> 00:29:13,260 steps alone and change the last steps 830 00:29:11,220 --> 00:29:14,340 you could change the middle if you 831 00:29:13,260 --> 00:29:15,900 happen to change the ways you're 832 00:29:14,340 --> 00:29:17,520 thinking about things 833 00:29:15,900 --> 00:29:19,260 so that's contests hopefully we've seen 834 00:29:17,520 --> 00:29:21,000 a little bit about bioinformatics we've 835 00:29:19,260 --> 00:29:23,279 seen some interesting problems that we 836 00:29:21,000 --> 00:29:23,940 want to solve and why they're important 837 00:29:23,279 --> 00:29:26,220 um 838 00:29:23,940 --> 00:29:28,559 one of the really lovely things about 839 00:29:26,220 --> 00:29:30,899 this job is at any time you feel like 840 00:29:28,559 --> 00:29:32,940 you know your work is pointless and 841 00:29:30,899 --> 00:29:36,080 you're staring into a video monitor all 842 00:29:32,940 --> 00:29:39,419 day and night for no reason you see 843 00:29:36,080 --> 00:29:41,039 brca1 here and ovarian cancer there and 844 00:29:39,419 --> 00:29:43,440 and so on and so forth and you realize 845 00:29:41,039 --> 00:29:45,539 oh hell yeah no this actually is this is 846 00:29:43,440 --> 00:29:48,059 quite meaningful at the end of the day 847 00:29:45,539 --> 00:29:50,700 and hopefully we can really help 848 00:29:48,059 --> 00:29:52,860 move people faster into those research 849 00:29:50,700 --> 00:29:54,120 activities hopefully you've seen some 850 00:29:52,860 --> 00:29:55,620 interesting stuff about python 851 00:29:54,120 --> 00:29:57,480 techniques a little bit about the entry 852 00:29:55,620 --> 00:29:59,220 points and stuff and maybe a Nifty 853 00:29:57,480 --> 00:30:00,539 approach to data pressing I look I think 854 00:29:59,220 --> 00:30:02,340 that this thing is actually useful 855 00:30:00,539 --> 00:30:04,380 Beyond bioinformatics potentially I 856 00:30:02,340 --> 00:30:05,760 think that there's potential for this 857 00:30:04,380 --> 00:30:08,460 kind of approach to work in other 858 00:30:05,760 --> 00:30:10,020 Sciences as well I hope I'd love to talk 859 00:30:08,460 --> 00:30:12,659 to people about that if you if you're 860 00:30:10,020 --> 00:30:13,320 working in some other science as well 861 00:30:12,659 --> 00:30:15,419 um 862 00:30:13,320 --> 00:30:17,159 and I think it's been a really 863 00:30:15,419 --> 00:30:19,620 interesting project and I'm very glad to 864 00:30:17,159 --> 00:30:21,840 be working on it at the moment 865 00:30:19,620 --> 00:30:22,980 um so thank you very much I think we 866 00:30:21,840 --> 00:30:24,059 should have time for a couple of 867 00:30:22,980 --> 00:30:25,740 questions 868 00:30:24,059 --> 00:30:27,960 thank you very much Nick okay we have 869 00:30:25,740 --> 00:30:29,940 time for one question uh and if you have 870 00:30:27,960 --> 00:30:32,700 any additional questions please ask Nick 871 00:30:29,940 --> 00:30:36,899 afterwards in the lunch lunch hallway 872 00:30:32,700 --> 00:30:39,299 track all right do we have any questions 873 00:30:36,899 --> 00:30:42,240 yeah I should also mention actually just 874 00:30:39,299 --> 00:30:43,620 while I've got the mic here um that URL 875 00:30:42,240 --> 00:30:45,179 will give you all the slides and all the 876 00:30:43,620 --> 00:30:47,159 notes from all the slides so if you if 877 00:30:45,179 --> 00:30:51,440 you wanted to see anything closer up 878 00:30:47,159 --> 00:30:51,440 it's there yes we have one question 879 00:30:58,799 --> 00:31:03,720 so you're putting together a workflow 880 00:31:01,200 --> 00:31:05,100 for crunching lots of stuff you 881 00:31:03,720 --> 00:31:07,440 mentioned briefly that you're spreading 882 00:31:05,100 --> 00:31:09,360 out the work so do you have like a 883 00:31:07,440 --> 00:31:11,460 particular technique you're using or 884 00:31:09,360 --> 00:31:14,220 pickling and sending the jobs around 885 00:31:11,460 --> 00:31:17,159 uh yeah starting with the work 886 00:31:14,220 --> 00:31:20,100 um at the moment it's all running on one 887 00:31:17,159 --> 00:31:22,980 computer rather than distributed 888 00:31:20,100 --> 00:31:25,140 um it could also be the the back end of 889 00:31:22,980 --> 00:31:27,080 this thing could also run on a desk for 890 00:31:25,140 --> 00:31:30,059 example it does have distributed sort of 891 00:31:27,080 --> 00:31:31,140 pandas bit distributed 892 00:31:30,059 --> 00:31:33,120 um 893 00:31:31,140 --> 00:31:34,860 and we've we've worked with that 894 00:31:33,120 --> 00:31:36,059 previously 895 00:31:34,860 --> 00:31:37,080 um I think that's probably a good way to 896 00:31:36,059 --> 00:31:39,000 go 897 00:31:37,080 --> 00:31:40,440 at the moment I've been concentrating on 898 00:31:39,000 --> 00:31:44,340 getting it working on sort of multiple 899 00:31:40,440 --> 00:31:45,179 CPUs and that abstraction layer of 900 00:31:44,340 --> 00:31:47,580 um 901 00:31:45,179 --> 00:31:50,039 how do you write a plug-in that knows 902 00:31:47,580 --> 00:31:53,399 enough about what is going on that it 903 00:31:50,039 --> 00:31:55,500 can say hey I want to do a map reduce 904 00:31:53,399 --> 00:31:57,419 or hey I don't care I just want to work 905 00:31:55,500 --> 00:31:59,100 on a row at a time just give me rows I 906 00:31:57,419 --> 00:32:00,779 just want raw rows and I'll give you 907 00:31:59,100 --> 00:32:02,159 back a column 908 00:32:00,779 --> 00:32:04,500 um what I want to do is get that 909 00:32:02,159 --> 00:32:07,100 abstraction right and then in some ways 910 00:32:04,500 --> 00:32:10,260 that will take care of itself because 911 00:32:07,100 --> 00:32:11,640 uh whatever back end you choose to run 912 00:32:10,260 --> 00:32:16,080 it on 913 00:32:11,640 --> 00:32:18,179 um it becomes its problem right uh it's 914 00:32:16,080 --> 00:32:21,000 how does it assign CPUs does it 915 00:32:18,179 --> 00:32:22,620 distribute stuff uh in some 916 00:32:21,000 --> 00:32:24,600 circumstances doesn't need to the other 917 00:32:22,620 --> 00:32:28,740 thing that I ran into very quickly and I 918 00:32:24,600 --> 00:32:30,299 if anyone follows me on um uh thingo 919 00:32:28,740 --> 00:32:31,860 fetty verse 920 00:32:30,299 --> 00:32:33,720 um you might have seen me whinging about 921 00:32:31,860 --> 00:32:36,360 this at one point I finally got to the 922 00:32:33,720 --> 00:32:38,580 point where I was using all 16 CPU cores 923 00:32:36,360 --> 00:32:40,140 on my computer and it was great and it 924 00:32:38,580 --> 00:32:41,700 made a lot of loud noise with the fans 925 00:32:40,140 --> 00:32:43,080 all turning on and stuff and I went this 926 00:32:41,700 --> 00:32:45,360 is brilliant and then I ran out of Fram 927 00:32:43,080 --> 00:32:47,760 and it was killed by the ocean killer 928 00:32:45,360 --> 00:32:48,960 um and so that really you know one of 929 00:32:47,760 --> 00:32:50,760 the really important bits is actually 930 00:32:48,960 --> 00:32:53,640 not just how to use all the CPUs but how 931 00:32:50,760 --> 00:32:55,500 do we not read all the data how do we 932 00:32:53,640 --> 00:32:58,140 wait until some data comes out the other 933 00:32:55,500 --> 00:33:00,059 end before we read more data and I I 934 00:32:58,140 --> 00:33:03,899 think that's a 935 00:33:00,059 --> 00:33:06,539 a worthy problem of evaluation but also 936 00:33:03,899 --> 00:33:09,179 it's something that the plugins kind of 937 00:33:06,539 --> 00:33:11,640 if we get the API for the plugins right 938 00:33:09,179 --> 00:33:13,140 we can come back to it in some ways 939 00:33:11,640 --> 00:33:14,399 because 940 00:33:13,140 --> 00:33:16,500 um 941 00:33:14,399 --> 00:33:18,720 because the plugins can then exploit 942 00:33:16,500 --> 00:33:20,760 that that new understanding 943 00:33:18,720 --> 00:33:23,700 without changing the plugins or changing 944 00:33:20,760 --> 00:33:25,679 people's experiments like conceivably we 945 00:33:23,700 --> 00:33:28,860 can go we can already run a lot of these 946 00:33:25,679 --> 00:33:31,159 experiments on one computer in one place 947 00:33:28,860 --> 00:33:36,120 um I mentioned before that the research 948 00:33:31,159 --> 00:33:38,159 is underway on pal B2 and I spoke to the 949 00:33:36,120 --> 00:33:40,440 guy who's researching that and I said oh 950 00:33:38,159 --> 00:33:42,659 so have you got a similar amount of data 951 00:33:40,440 --> 00:33:45,480 to this brca11 he says oh yeah it's 952 00:33:42,659 --> 00:33:48,539 similar it's three times as much 953 00:33:45,480 --> 00:33:50,220 um so yeah literally in the space of a 954 00:33:48,539 --> 00:33:51,840 couple of years data is tripled in size 955 00:33:50,220 --> 00:33:54,120 that we're dealing with so yeah I think 956 00:33:51,840 --> 00:33:55,740 it's it's quite likely that these 957 00:33:54,120 --> 00:33:57,960 problems will get bigger and bigger and 958 00:33:55,740 --> 00:34:00,779 bigger and we'll just be chasing Moore's 959 00:33:57,960 --> 00:34:02,460 Law the whole time so um but that's it's 960 00:34:00,779 --> 00:34:05,159 a really interesting thing to consider 961 00:34:02,460 --> 00:34:06,720 is do we use Dash do we use spark do we 962 00:34:05,159 --> 00:34:08,639 use a million different possible 963 00:34:06,720 --> 00:34:10,919 clustering things 964 00:34:08,639 --> 00:34:12,300 thank you Nick we'd like to offer you 965 00:34:10,919 --> 00:34:13,679 this gift as Thanksgiving of our 966 00:34:12,300 --> 00:34:15,240 appreciation for your interesting talk 967 00:34:13,679 --> 00:34:17,460 thank you very much no worries thank you 968 00:34:15,240 --> 00:34:20,460 all right 969 00:34:17,460 --> 00:34:20,460 foreign