1 00:00:00,480 --> 00:00:03,480 foreign 2 00:00:08,660 --> 00:00:14,880 today we have uh David Lawrence from 3 00:00:12,120 --> 00:00:17,760 South Australia giving us a talk 4 00:00:14,880 --> 00:00:20,640 analyzing and sharing genetic data with 5 00:00:17,760 --> 00:00:22,820 python uh could we give David a warm 6 00:00:20,640 --> 00:00:22,820 welcome 7 00:00:25,280 --> 00:00:29,359 [Applause] 8 00:00:26,820 --> 00:00:29,359 to you David 9 00:00:35,399 --> 00:00:39,420 um good morning 10 00:00:37,860 --> 00:00:42,180 um hi yeah so I have a pretty weird 11 00:00:39,420 --> 00:00:44,340 career um path so I was interested in 12 00:00:42,180 --> 00:00:46,800 computer games at the start and then I 13 00:00:44,340 --> 00:00:48,800 realized I needed more money and not 14 00:00:46,800 --> 00:00:52,200 working insane amount of hours and then 15 00:00:48,800 --> 00:00:55,620 worked in the corporate world doing Java 16 00:00:52,200 --> 00:00:57,120 programming for a while and then went 17 00:00:55,620 --> 00:00:59,460 into bioinformatics and that's actually 18 00:00:57,120 --> 00:01:00,840 been looking at I looked at it today I'm 19 00:00:59,460 --> 00:01:02,460 like wow I've actually been a 20 00:01:00,840 --> 00:01:04,260 bioinformatician longer than a 21 00:01:02,460 --> 00:01:06,360 programmer so that was 22 00:01:04,260 --> 00:01:08,280 pretty strange 23 00:01:06,360 --> 00:01:10,740 um yeah but there's a saying in science 24 00:01:08,280 --> 00:01:12,180 where the best you know as statisticians 25 00:01:10,740 --> 00:01:15,780 they say the best thing about it is you 26 00:01:12,180 --> 00:01:17,460 get to play in other people's backyards 27 00:01:15,780 --> 00:01:19,860 um and I feel the same about programmers 28 00:01:17,460 --> 00:01:21,180 so you know um if you want to if you a 29 00:01:19,860 --> 00:01:23,340 programmer you can 30 00:01:21,180 --> 00:01:24,360 um you know you can end up working on a 31 00:01:23,340 --> 00:01:27,060 national 32 00:01:24,360 --> 00:01:28,500 um electricity Grid or genetics you know 33 00:01:27,060 --> 00:01:30,060 so there's a lot of cool things we can 34 00:01:28,500 --> 00:01:32,159 do 35 00:01:30,060 --> 00:01:33,360 um so the place I get to play is in 36 00:01:32,159 --> 00:01:35,220 biology 37 00:01:33,360 --> 00:01:37,439 um which I I'm biased but I think it's 38 00:01:35,220 --> 00:01:39,119 one of the most interesting things uh 39 00:01:37,439 --> 00:01:41,220 there is 40 00:01:39,119 --> 00:01:43,439 um so everyone you know because of the 41 00:01:41,220 --> 00:01:48,600 vaccines most people know about mRNA now 42 00:01:43,439 --> 00:01:51,000 so DNA makes RNA which makes protein and 43 00:01:48,600 --> 00:01:53,520 they've actually you know it was only in 44 00:01:51,000 --> 00:01:55,200 my grandparents era so maybe their sixth 45 00:01:53,520 --> 00:01:57,479 or my parents actually in the 60s they 46 00:01:55,200 --> 00:01:59,579 discovered how the DNA gets turned into 47 00:01:57,479 --> 00:02:00,860 protein so there's actually a code it's 48 00:01:59,579 --> 00:02:04,680 called codons 49 00:02:00,860 --> 00:02:07,380 and the the bases get changed um the U 50 00:02:04,680 --> 00:02:11,180 is the RNA version of a t but yeah 51 00:02:07,380 --> 00:02:14,400 basically this is how it's translated 52 00:02:11,180 --> 00:02:17,400 and basically the DNA gets translated 53 00:02:14,400 --> 00:02:19,319 three at a time into proteins and then 54 00:02:17,400 --> 00:02:20,760 proteins they fold up that's a 55 00:02:19,319 --> 00:02:23,580 computational problem that people are 56 00:02:20,760 --> 00:02:25,500 working on and it folds not just a 57 00:02:23,580 --> 00:02:26,459 single protein it folds into multiple 58 00:02:25,500 --> 00:02:29,000 ones 59 00:02:26,459 --> 00:02:31,680 then into complexes and then it gets 60 00:02:29,000 --> 00:02:33,900 ridiculously complicated and hard so 61 00:02:31,680 --> 00:02:36,540 there's a lot of computational stuff to 62 00:02:33,900 --> 00:02:38,760 do in biology 63 00:02:36,540 --> 00:02:40,020 um so yeah and bioinformatics when you 64 00:02:38,760 --> 00:02:43,620 think about it it the business 65 00:02:40,020 --> 00:02:46,800 requirement is basically to discover all 66 00:02:43,620 --> 00:02:48,420 about this stuff about life and we know 67 00:02:46,800 --> 00:02:50,160 a lot but there's 68 00:02:48,420 --> 00:02:51,900 an astonishing amount that we don't know 69 00:02:50,160 --> 00:02:56,400 there's definitely lifetimes worth of 70 00:02:51,900 --> 00:02:58,400 work to to find it all 71 00:02:56,400 --> 00:03:00,599 um yeah so there's a big demand for 72 00:02:58,400 --> 00:03:03,180 programmers and the more python 73 00:03:00,599 --> 00:03:05,040 programmers in bioinformatics the better 74 00:03:03,180 --> 00:03:07,200 I think the better it needs are some 75 00:03:05,040 --> 00:03:08,340 more a lot of software skills there's a 76 00:03:07,200 --> 00:03:10,739 lot of people are sort of self-taught 77 00:03:08,340 --> 00:03:13,500 biologists and um yeah if you can come 78 00:03:10,739 --> 00:03:16,260 in with um the software skills you'd be 79 00:03:13,500 --> 00:03:17,099 very welcome so yeah check out the jobs 80 00:03:16,260 --> 00:03:19,440 um 81 00:03:17,099 --> 00:03:20,700 yeah um and if you're interested my 82 00:03:19,440 --> 00:03:23,640 recommendation for getting into 83 00:03:20,700 --> 00:03:25,140 bioinformatics is this um online web app 84 00:03:23,640 --> 00:03:27,959 it's basically leak code for 85 00:03:25,140 --> 00:03:29,760 bioinformatics project Rosalind 86 00:03:27,959 --> 00:03:33,180 um and yeah basically it starts off 87 00:03:29,760 --> 00:03:34,560 really simple counting nucleotides so 88 00:03:33,180 --> 00:03:36,599 you basically just have to they randomly 89 00:03:34,560 --> 00:03:38,420 generate an output you run your Python 90 00:03:36,599 --> 00:03:40,560 program or whatever program against it 91 00:03:38,420 --> 00:03:42,420 paste the answer in if you got it right 92 00:03:40,560 --> 00:03:46,500 you move on to the next one starts off 93 00:03:42,420 --> 00:03:48,900 easy gets really hard so lots of fun so 94 00:03:46,500 --> 00:03:52,260 I'm going to talk about my particular 95 00:03:48,900 --> 00:03:54,120 job so this is I work in pathology so 96 00:03:52,260 --> 00:03:57,000 here's an example of something that that 97 00:03:54,120 --> 00:03:59,519 happens to me so basically in well me 98 00:03:57,000 --> 00:04:00,780 and a massive team so in the Women's and 99 00:03:59,519 --> 00:04:01,980 Children's Hospital the other side of 100 00:04:00,780 --> 00:04:03,720 the river 101 00:04:01,980 --> 00:04:05,519 um generally people you know a baby 102 00:04:03,720 --> 00:04:07,319 might be born with heart problems and 103 00:04:05,519 --> 00:04:10,080 the question is is it due to infection 104 00:04:07,319 --> 00:04:12,299 is it due to a genetic predisposition is 105 00:04:10,080 --> 00:04:15,360 it due to birth complications all kinds 106 00:04:12,299 --> 00:04:18,959 of troubles and knowing the reason 107 00:04:15,360 --> 00:04:22,079 um helps the clinician decide what to do 108 00:04:18,959 --> 00:04:24,060 so how pathology works is basically the 109 00:04:22,079 --> 00:04:26,520 clinician sends an order to the 110 00:04:24,060 --> 00:04:29,100 Pathology company essay pathology is the 111 00:04:26,520 --> 00:04:32,220 public service provider for pathology in 112 00:04:29,100 --> 00:04:34,500 South Australia we take blood we take 113 00:04:32,220 --> 00:04:37,020 information and we send back a report 114 00:04:34,500 --> 00:04:39,120 that informs the clinician's decisions 115 00:04:37,020 --> 00:04:41,580 on what to do 116 00:04:39,120 --> 00:04:43,560 so there's that you know that's how it 117 00:04:41,580 --> 00:04:45,780 worked for things like um you know blood 118 00:04:43,560 --> 00:04:47,699 concentrations of some you know calcium 119 00:04:45,780 --> 00:04:50,160 or something it's a bit more complicated 120 00:04:47,699 --> 00:04:52,440 with DNA sequencing so we still take 121 00:04:50,160 --> 00:04:54,840 blood we extract the blood 122 00:04:52,440 --> 00:04:57,300 um put it in a DNA sequencer process it 123 00:04:54,840 --> 00:05:00,180 on HPC machines work out the difference 124 00:04:57,300 --> 00:05:02,360 between like a reference human genome 125 00:05:00,180 --> 00:05:05,520 and and this patient's Human Genome 126 00:05:02,360 --> 00:05:07,800 analyze it and send back the report so 127 00:05:05,520 --> 00:05:10,440 this is at foam Road the other side of 128 00:05:07,800 --> 00:05:14,100 um North Terrace this is my colleague 129 00:05:10,440 --> 00:05:17,639 and our expensive new sequencer the way 130 00:05:14,100 --> 00:05:21,419 it works is basically the DNA gets put 131 00:05:17,639 --> 00:05:23,639 on this uh sort of glass tile and they 132 00:05:21,419 --> 00:05:27,259 sort of wash the whole thing in bases 133 00:05:23,639 --> 00:05:29,460 and when the next base at a time gets um 134 00:05:27,259 --> 00:05:30,720 Incorporated and it flashes a certain 135 00:05:29,460 --> 00:05:32,340 color and there's a camera in the 136 00:05:30,720 --> 00:05:35,340 machine and it just like 137 00:05:32,340 --> 00:05:37,440 um basically sees that you know there's 138 00:05:35,340 --> 00:05:39,840 a million flashes of this color and that 139 00:05:37,440 --> 00:05:43,380 color and it works out what it is and 140 00:05:39,840 --> 00:05:45,660 eventually it turns it into DNA 141 00:05:43,380 --> 00:05:48,240 sequences so this is the raw output of a 142 00:05:45,660 --> 00:05:51,240 sequencer the new one we have is 16 to 143 00:05:48,240 --> 00:05:54,960 20 billion reads in 44 hours and these 144 00:05:51,240 --> 00:05:56,940 are usually like 250 bases long the 145 00:05:54,960 --> 00:06:00,000 other things are like quality score and 146 00:05:56,940 --> 00:06:02,460 how confident the base calling is 147 00:06:00,000 --> 00:06:05,039 so what do we do how do we you know 200 148 00:06:02,460 --> 00:06:08,220 bases where you know what does that mean 149 00:06:05,039 --> 00:06:10,680 um see the 200 bases we work out where 150 00:06:08,220 --> 00:06:12,539 in the three billion human bases that 151 00:06:10,680 --> 00:06:14,340 came from so there's like a string 152 00:06:12,539 --> 00:06:15,960 matching with a bit of fuzziness in 153 00:06:14,340 --> 00:06:18,900 there where we map to the human 154 00:06:15,960 --> 00:06:21,660 reference genome so 155 00:06:18,900 --> 00:06:25,740 um and here's an example of a mapping 156 00:06:21,660 --> 00:06:28,199 um the uh these all there's no cut no 157 00:06:25,740 --> 00:06:31,220 letters in the map reads that means it's 158 00:06:28,199 --> 00:06:34,860 a perfect match but the G in the middle 159 00:06:31,220 --> 00:06:37,259 that's a mismatch so the patient differs 160 00:06:34,860 --> 00:06:39,960 from the reference genome the reference 161 00:06:37,259 --> 00:06:42,720 genome has an a the patient has a g half 162 00:06:39,960 --> 00:06:44,840 of the time so you have two copies of 163 00:06:42,720 --> 00:06:47,400 chromosomes one from mum one from Dad 164 00:06:44,840 --> 00:06:49,440 and the on one of them we don't know 165 00:06:47,400 --> 00:06:51,960 which either the mum's copy or the dad's 166 00:06:49,440 --> 00:06:55,139 copy um there's a g instead of an A so 167 00:06:51,960 --> 00:06:57,240 this is called a heterozygous mutation 168 00:06:55,139 --> 00:06:59,940 so where do these human genomes come 169 00:06:57,240 --> 00:07:02,280 from uh it's basically was it extremely 170 00:06:59,940 --> 00:07:04,020 expensive project in the early 2000s and 171 00:07:02,280 --> 00:07:05,220 yeah you can basically download this off 172 00:07:04,020 --> 00:07:07,199 the internet 173 00:07:05,220 --> 00:07:08,340 um through you know it cost insane 174 00:07:07,199 --> 00:07:10,680 amount of money to produce you can 175 00:07:08,340 --> 00:07:12,419 download a text file I use it we use it 176 00:07:10,680 --> 00:07:14,699 in genomics for everything uh it's 177 00:07:12,419 --> 00:07:17,460 extremely useful thanks for paying for 178 00:07:14,699 --> 00:07:19,500 science back you know 23 years ago I 179 00:07:17,460 --> 00:07:21,120 will use it every day 180 00:07:19,500 --> 00:07:22,979 um and Yeah the secret you know it was 181 00:07:21,120 --> 00:07:25,979 insanely expensive to do sequencing in 182 00:07:22,979 --> 00:07:27,780 the past but the it's actually much uh 183 00:07:25,979 --> 00:07:30,120 falling faster than Moore's Law so if 184 00:07:27,780 --> 00:07:32,940 you think computers are getting uh 185 00:07:30,120 --> 00:07:36,539 faster over time well uh genomics is uh 186 00:07:32,940 --> 00:07:38,819 going even crazier and when things get 187 00:07:36,539 --> 00:07:41,819 cheaper people want more of it this is 188 00:07:38,819 --> 00:07:43,440 our demand for the stuff running through 189 00:07:41,819 --> 00:07:45,900 our software 190 00:07:43,440 --> 00:07:48,060 um that starts around about 2013 and 191 00:07:45,900 --> 00:07:51,180 it's getting nuts right so 192 00:07:48,060 --> 00:07:54,360 good times so how's python used in 193 00:07:51,180 --> 00:07:56,539 bioinformatics well generally uh the 194 00:07:54,360 --> 00:08:00,419 tools for doing the uh the hardcore 195 00:07:56,539 --> 00:08:03,060 computational stuff are written in C for 196 00:08:00,419 --> 00:08:06,660 Speed or sometimes rust now but pythons 197 00:08:03,060 --> 00:08:09,000 very commonly used to sort of uh join 198 00:08:06,660 --> 00:08:14,220 the multiple tools together so you might 199 00:08:09,000 --> 00:08:15,960 Define a workflow and coordinate the 200 00:08:14,220 --> 00:08:18,599 running of all these different jobs in 201 00:08:15,960 --> 00:08:19,860 Python so here's an example I'm just 202 00:08:18,599 --> 00:08:21,240 going to skip over this but there's a 203 00:08:19,860 --> 00:08:24,120 program called snake make which is 204 00:08:21,240 --> 00:08:26,879 basically like make but it allows python 205 00:08:24,120 --> 00:08:28,620 so it's pretty cool using bioinformatics 206 00:08:26,879 --> 00:08:32,580 maybe it's cool useful for you guys as 207 00:08:28,620 --> 00:08:35,599 well for dependency stuff 208 00:08:32,580 --> 00:08:38,219 um so for research typically 209 00:08:35,599 --> 00:08:39,839 researchers work in small teams and the 210 00:08:38,219 --> 00:08:42,360 output is basically a paper so someone 211 00:08:39,839 --> 00:08:44,159 has a biological question and you work 212 00:08:42,360 --> 00:08:46,920 with the biologist for six months to 213 00:08:44,159 --> 00:08:48,959 three years and a lot of the time it's 214 00:08:46,920 --> 00:08:51,899 stuff like we have this data we have a 215 00:08:48,959 --> 00:08:53,880 biological question what can we do and 216 00:08:51,899 --> 00:08:56,519 you often try a whole bunch of stuff you 217 00:08:53,880 --> 00:08:58,860 write one-off programs you investigate 218 00:08:56,519 --> 00:09:02,220 it and python is amazing for this 219 00:08:58,860 --> 00:09:03,899 because you know I can write python 10 220 00:09:02,220 --> 00:09:05,880 times faster than C or something I don't 221 00:09:03,899 --> 00:09:07,500 know what it is but it's like it's so 222 00:09:05,880 --> 00:09:10,019 much easier and and the thing is we have 223 00:09:07,500 --> 00:09:12,060 HPC we can run we've got hundreds of 224 00:09:10,019 --> 00:09:13,500 cores we can run things in parallel it 225 00:09:12,060 --> 00:09:14,940 doesn't matter how fast it is often we 226 00:09:13,500 --> 00:09:17,580 only run it once we're just 227 00:09:14,940 --> 00:09:19,320 experimenting and if we can experiment 228 00:09:17,580 --> 00:09:21,720 twice in a day rather than once every 229 00:09:19,320 --> 00:09:25,019 few days we converge on the answer 230 00:09:21,720 --> 00:09:26,880 faster and that's great 231 00:09:25,019 --> 00:09:29,880 um so yeah to go back to varying tools 232 00:09:26,880 --> 00:09:31,200 when so the original sequence is at the 233 00:09:29,880 --> 00:09:33,380 top and the bottom one is at the bottom 234 00:09:31,200 --> 00:09:36,060 that's called in point mutation 235 00:09:33,380 --> 00:09:39,000 we have a file format for this it's 236 00:09:36,060 --> 00:09:42,360 called the VCF format and basically 237 00:09:39,000 --> 00:09:44,880 there's a chromosome location a 238 00:09:42,360 --> 00:09:47,040 chromosome name a position and then the 239 00:09:44,880 --> 00:09:50,459 original sequence and then the new 240 00:09:47,040 --> 00:09:52,440 sequence and a bunch of other stuff so 241 00:09:50,459 --> 00:09:54,959 um I've written some software that some 242 00:09:52,440 --> 00:09:58,680 a user essay path to analyze this kind 243 00:09:54,959 --> 00:10:01,440 of data and it's called variant grid 244 00:09:58,680 --> 00:10:03,060 um so what we do is 245 00:10:01,440 --> 00:10:04,920 um we start with the sequencing which 246 00:10:03,060 --> 00:10:06,779 has a huge amount of data we run then we 247 00:10:04,920 --> 00:10:10,680 do the get the diffs and we do all this 248 00:10:06,779 --> 00:10:14,100 on an HPC and then from this 249 00:10:10,680 --> 00:10:16,320 um the VCF for the variant calls then it 250 00:10:14,100 --> 00:10:20,399 gets ingested into our program and I 251 00:10:16,320 --> 00:10:23,160 take over from that point there 252 00:10:20,399 --> 00:10:25,380 um so it's basically a Django app 253 00:10:23,160 --> 00:10:28,100 um it's running on g-unicorn with 254 00:10:25,380 --> 00:10:30,420 postgres being the central database 255 00:10:28,100 --> 00:10:32,519 and whenever we have anything you know 256 00:10:30,420 --> 00:10:34,980 two more than a couple of seconds I farm 257 00:10:32,519 --> 00:10:38,459 it off to a um to a queue and salary 258 00:10:34,980 --> 00:10:39,899 workers do the work most of the time the 259 00:10:38,459 --> 00:10:42,480 the longest running jobs might be say 260 00:10:39,899 --> 00:10:44,279 half an hour which is for annotations 261 00:10:42,480 --> 00:10:46,440 they run things like computational 262 00:10:44,279 --> 00:10:49,920 prediction so for instance you might 263 00:10:46,440 --> 00:10:53,399 have a sequence and it'll do things like 264 00:10:49,920 --> 00:10:56,579 um look at the charges of the the new 265 00:10:53,399 --> 00:10:58,980 modified protein and work out how much 266 00:10:56,579 --> 00:11:00,959 of a sort of how wonky the protein will 267 00:10:58,980 --> 00:11:02,399 go and classify it as pathogenic or 268 00:11:00,959 --> 00:11:04,260 something and so there's all these 269 00:11:02,399 --> 00:11:07,260 different tools and public databases 270 00:11:04,260 --> 00:11:08,700 that we use to work out what a variant 271 00:11:07,260 --> 00:11:11,579 does 272 00:11:08,700 --> 00:11:15,060 so yeah here's the VCF again 273 00:11:11,579 --> 00:11:17,040 um and uh so the way I represent it is 274 00:11:15,060 --> 00:11:18,600 in Django 275 00:11:17,040 --> 00:11:20,399 um yeah because this isn't in the Django 276 00:11:18,600 --> 00:11:21,300 stream not sure if everyone knows Django 277 00:11:20,399 --> 00:11:23,399 but 278 00:11:21,300 --> 00:11:25,019 um basically you can declare classes 279 00:11:23,399 --> 00:11:27,120 that represent 280 00:11:25,019 --> 00:11:30,120 um database tables 281 00:11:27,120 --> 00:11:32,839 um so here's my Chrome contig is just 282 00:11:30,120 --> 00:11:35,160 another biology name for a chromosome 283 00:11:32,839 --> 00:11:37,200 position reference and there might be 284 00:11:35,160 --> 00:11:38,220 multiple variants for a location like 285 00:11:37,200 --> 00:11:40,079 you know 286 00:11:38,220 --> 00:11:42,260 people might not have a t you might have 287 00:11:40,079 --> 00:11:44,940 a c someone else might have a g 288 00:11:42,260 --> 00:11:47,240 and so you can have multiple ones there 289 00:11:44,940 --> 00:11:49,800 and you can write your python classes 290 00:11:47,240 --> 00:11:52,140 and it sort of handles the SQL for you 291 00:11:49,800 --> 00:11:56,160 so instead like on the bottom right here 292 00:11:52,140 --> 00:11:58,740 is the database and on the top left is 293 00:11:56,160 --> 00:12:00,480 how you write the Django code and it 294 00:11:58,740 --> 00:12:02,880 means that instead of writing all the 295 00:12:00,480 --> 00:12:04,320 you know technical computery stuff 296 00:12:02,880 --> 00:12:06,839 you're just sort of working with 297 00:12:04,320 --> 00:12:10,140 variants and samples and you you end up 298 00:12:06,839 --> 00:12:12,600 sort of working in your domain which is 299 00:12:10,140 --> 00:12:13,940 really useful instead of computer you 300 00:12:12,600 --> 00:12:15,480 know code 301 00:12:13,940 --> 00:12:17,459 but 302 00:12:15,480 --> 00:12:18,060 um yeah so there's a huge amount of 303 00:12:17,459 --> 00:12:20,399 um 304 00:12:18,060 --> 00:12:23,100 biological data and we get basically 305 00:12:20,399 --> 00:12:24,720 everything that we can find we'll we'll 306 00:12:23,100 --> 00:12:27,480 run it and that you know there's 307 00:12:24,720 --> 00:12:29,700 hundreds we have hundreds of columns of 308 00:12:27,480 --> 00:12:33,420 different tools and public databases 309 00:12:29,700 --> 00:12:36,240 that tell us what a variant does 310 00:12:33,420 --> 00:12:37,740 um so yeah there ends up being uh uh you 311 00:12:36,240 --> 00:12:39,839 know hundreds of thousands to millions 312 00:12:37,740 --> 00:12:42,300 of variants um and almost all you know 313 00:12:39,839 --> 00:12:44,579 they're things like be a millimeter 314 00:12:42,300 --> 00:12:46,740 taller or shorter or have um a slightly 315 00:12:44,579 --> 00:12:48,000 different color hair or something so you 316 00:12:46,740 --> 00:12:51,240 don't care about that when you're 317 00:12:48,000 --> 00:12:53,399 searching for the um genetic cause of 318 00:12:51,240 --> 00:12:57,300 disease so most of the work is throwing 319 00:12:53,399 --> 00:12:59,100 out all of that stuff and to fight to 320 00:12:57,300 --> 00:13:00,899 search through this it's not easy you 321 00:12:59,100 --> 00:13:02,339 have to know um it's like a mini 322 00:13:00,899 --> 00:13:04,860 research project you have to know about 323 00:13:02,339 --> 00:13:07,860 the disease how common it is the family 324 00:13:04,860 --> 00:13:10,740 history you end up going and doing uh 325 00:13:07,860 --> 00:13:12,720 looking through the literature and so 326 00:13:10,740 --> 00:13:14,399 these met these team of people called 327 00:13:12,720 --> 00:13:17,279 medical scientists who basically do this 328 00:13:14,399 --> 00:13:19,320 investigation and the goal is basically 329 00:13:17,279 --> 00:13:22,440 of this program is to allow those people 330 00:13:19,320 --> 00:13:23,820 to run their own filters on huge amounts 331 00:13:22,440 --> 00:13:26,220 of data sets because they've got 332 00:13:23,820 --> 00:13:27,360 excellent medical knowledge but not so 333 00:13:26,220 --> 00:13:29,880 you know they're not going to be able to 334 00:13:27,360 --> 00:13:33,480 run Jupiter notebooks or something 335 00:13:29,880 --> 00:13:35,220 so a naive way to do filtering would be 336 00:13:33,480 --> 00:13:37,320 something like this where you basically 337 00:13:35,220 --> 00:13:39,240 build up filters but the trouble is 338 00:13:37,320 --> 00:13:41,639 we're offering off and running something 339 00:13:39,240 --> 00:13:44,160 like 50 to 100 filters and this gets 340 00:13:41,639 --> 00:13:47,279 unwieldy very fast 341 00:13:44,160 --> 00:13:49,139 so basically I have a mate who works um 342 00:13:47,279 --> 00:13:53,579 in video compositing and he showed me 343 00:13:49,139 --> 00:13:55,800 one day a directed a cyclic graph often 344 00:13:53,579 --> 00:13:57,360 these are sort of tilted sideways and 345 00:13:55,800 --> 00:13:58,620 basically the output of one node goes 346 00:13:57,360 --> 00:14:02,220 into the other one some kind of 347 00:13:58,620 --> 00:14:04,680 filtering or thing happens and that's 348 00:14:02,220 --> 00:14:08,000 how it works and that allows people with 349 00:14:04,680 --> 00:14:10,920 domain knowledge to apply computational 350 00:14:08,000 --> 00:14:14,180 work which is exactly what we want to do 351 00:14:10,920 --> 00:14:16,680 so ripped it off and do it for genomics 352 00:14:14,180 --> 00:14:19,220 so the way it works under the hood is 353 00:14:16,680 --> 00:14:21,779 each node returns a Django Q object 354 00:14:19,220 --> 00:14:24,839 which is basically Django's way of 355 00:14:21,779 --> 00:14:27,120 filtering a query set or an SQL query so 356 00:14:24,839 --> 00:14:31,820 you can take these Q objects do logical 357 00:14:27,120 --> 00:14:35,639 operations on them and make a SQL query 358 00:14:31,820 --> 00:14:38,339 so here's an example so I have 359 00:14:35,639 --> 00:14:39,779 um here um there's populate like I've 360 00:14:38,339 --> 00:14:41,940 got a whole bunch of queue objects one 361 00:14:39,779 --> 00:14:44,820 of them is population frequency greater 362 00:14:41,940 --> 00:14:46,800 than or equal to 0.01 one of them is 363 00:14:44,820 --> 00:14:49,440 Gene symbol in this Gene list and the 364 00:14:46,800 --> 00:14:52,440 other one is damage greater than 0.0.5 365 00:14:49,440 --> 00:14:53,940 and then you can run 366 00:14:52,440 --> 00:14:58,440 um you know you could write this by 367 00:14:53,940 --> 00:15:02,459 saying you know q and Q 2 and Q3 but you 368 00:14:58,440 --> 00:15:05,220 can also just reduce it down applying 369 00:15:02,459 --> 00:15:07,380 and like that and you can if you want to 370 00:15:05,220 --> 00:15:09,360 have basically you know apply this 371 00:15:07,380 --> 00:15:11,339 filter or this filter or this filter and 372 00:15:09,360 --> 00:15:14,519 this filter you can build all kind of 373 00:15:11,339 --> 00:15:16,079 logical operations in this way 374 00:15:14,519 --> 00:15:19,620 and this is what the GUI looks like 375 00:15:16,079 --> 00:15:22,139 basically on the left is the nodes the 376 00:15:19,620 --> 00:15:23,639 filters and you allow and you click on 377 00:15:22,139 --> 00:15:26,519 one so we've clicked on the red one in 378 00:15:23,639 --> 00:15:29,040 the middle and that loads the population 379 00:15:26,519 --> 00:15:30,839 frequency and you can choose what 380 00:15:29,040 --> 00:15:33,120 databases you want to use in your 381 00:15:30,839 --> 00:15:36,320 population frequency and below that is 382 00:15:33,120 --> 00:15:39,899 the grid of the variance like the 31 383 00:15:36,320 --> 00:15:42,000 that appear in that node and it allows 384 00:15:39,899 --> 00:15:44,040 the sort of the medical scientist to 385 00:15:42,000 --> 00:15:46,199 sort of jump around and like apply more 386 00:15:44,040 --> 00:15:47,820 and more or less stringent filters if 387 00:15:46,199 --> 00:15:49,800 something comes through or not they can 388 00:15:47,820 --> 00:15:52,380 adjust the population frequency if it's 389 00:15:49,800 --> 00:15:54,540 not quite known how common is it disease 390 00:15:52,380 --> 00:15:56,040 is things like that so it allows them to 391 00:15:54,540 --> 00:15:58,740 do their own filtering 392 00:15:56,040 --> 00:16:00,779 uh and yeah we try and allow the GUI to 393 00:15:58,740 --> 00:16:02,339 work nicely for 394 00:16:00,779 --> 00:16:03,240 um the particular filters they want to 395 00:16:02,339 --> 00:16:05,880 run 396 00:16:03,240 --> 00:16:08,279 and if you click on a variant behind 397 00:16:05,880 --> 00:16:10,980 this loads a page per variant and we 398 00:16:08,279 --> 00:16:13,339 have all that annotation details um that 399 00:16:10,980 --> 00:16:13,339 you can view 400 00:16:13,440 --> 00:16:17,040 um yeah because the data size we have 401 00:16:15,120 --> 00:16:19,680 something like 30 000 samples each with 402 00:16:17,040 --> 00:16:23,880 a million rows we store everything in 403 00:16:19,680 --> 00:16:25,860 per patient part uh postgres partitions 404 00:16:23,880 --> 00:16:28,440 and we also filter those we might have 405 00:16:25,860 --> 00:16:30,839 multiple partitions per sample so 95 of 406 00:16:28,440 --> 00:16:33,300 a patient variants are common but a lot 407 00:16:30,839 --> 00:16:34,740 of the time if it's a rare disease you 408 00:16:33,300 --> 00:16:36,899 know it happens one in ten thousand 409 00:16:34,740 --> 00:16:39,660 people you just want to throw away those 410 00:16:36,899 --> 00:16:42,060 really common variants so by putting 411 00:16:39,660 --> 00:16:43,980 them in separate partitions we can 412 00:16:42,060 --> 00:16:45,779 choose to jump to just the partition 413 00:16:43,980 --> 00:16:49,740 we're interested in which makes it 414 00:16:45,779 --> 00:16:52,259 really fast so here's an example where 415 00:16:49,740 --> 00:16:54,660 at the start We have basically 416 00:16:52,259 --> 00:16:56,699 collections is the name of the 417 00:16:54,660 --> 00:16:59,339 partitions that we're after so we always 418 00:16:56,699 --> 00:17:01,339 apply the rare one but then if someone 419 00:16:59,339 --> 00:17:04,380 asks for common variance which means 420 00:17:01,339 --> 00:17:05,780 they haven't applied the you know 421 00:17:04,380 --> 00:17:08,160 population filter 422 00:17:05,780 --> 00:17:10,679 then we also include the common 423 00:17:08,160 --> 00:17:14,040 collection which is like uh 20 times as 424 00:17:10,679 --> 00:17:16,439 big and then we do a filtered relation 425 00:17:14,040 --> 00:17:18,540 which is Django's way of joining to a 426 00:17:16,439 --> 00:17:21,799 table with a constraint and because of 427 00:17:18,540 --> 00:17:24,839 that constraint like we say just this 428 00:17:21,799 --> 00:17:27,780 parent table just this one and then the 429 00:17:24,839 --> 00:17:29,340 postgres knows okay well because of that 430 00:17:27,780 --> 00:17:31,200 constraint I only have to look in this 431 00:17:29,340 --> 00:17:33,240 partition and that makes the query 432 00:17:31,200 --> 00:17:35,100 really fast because yeah if we if we 433 00:17:33,240 --> 00:17:37,140 didn't jump to the partitions 434 00:17:35,100 --> 00:17:39,120 we're talking about billions and 435 00:17:37,140 --> 00:17:41,520 billions of rows and that's not going to 436 00:17:39,120 --> 00:17:43,320 perform in real time 437 00:17:41,520 --> 00:17:45,240 so once we've found these variants we 438 00:17:43,320 --> 00:17:48,720 have classification and this is the 439 00:17:45,240 --> 00:17:50,220 classification uh part of the project so 440 00:17:48,720 --> 00:17:53,160 basically we Auto populated but the 441 00:17:50,220 --> 00:17:54,960 medical scientists can also tweak it go 442 00:17:53,160 --> 00:17:56,580 grab information from various other 443 00:17:54,960 --> 00:17:59,280 sources some things you can't automate 444 00:17:56,580 --> 00:18:01,020 like interpreting literature uh well not 445 00:17:59,280 --> 00:18:03,299 yet 446 00:18:01,020 --> 00:18:05,640 um and yeah so ultimately what they do 447 00:18:03,299 --> 00:18:07,860 is they follow a framework called the 448 00:18:05,640 --> 00:18:10,020 acmg guidelines and they finally 449 00:18:07,860 --> 00:18:12,480 classify a variant in one of five ways 450 00:18:10,020 --> 00:18:15,840 it's benign likely benign 451 00:18:12,480 --> 00:18:18,240 um uncertain and finally likely and then 452 00:18:15,840 --> 00:18:21,000 pathogenic that means definitely disease 453 00:18:18,240 --> 00:18:23,820 causing and then we send that back to 454 00:18:21,000 --> 00:18:26,400 the clinicians and they can look up 455 00:18:23,820 --> 00:18:30,120 drugs and stuff to use 456 00:18:26,400 --> 00:18:33,419 so one of the troubles with this is that 457 00:18:30,120 --> 00:18:35,640 um the state uh the health in Australia 458 00:18:33,419 --> 00:18:37,140 is basically run by States and so 459 00:18:35,640 --> 00:18:39,480 everyone sort of works by themselves and 460 00:18:37,140 --> 00:18:41,039 don't talk to each other so if someone 461 00:18:39,480 --> 00:18:43,020 in Victoria came in and there was a 462 00:18:41,039 --> 00:18:45,419 variant that was pathogenic and someone 463 00:18:43,020 --> 00:18:47,880 in South Australia and they had the same 464 00:18:45,419 --> 00:18:49,740 one but they said it was unknown 465 00:18:47,880 --> 00:18:52,140 um then they would never we would never 466 00:18:49,740 --> 00:18:54,480 know and someone one of them got a bad 467 00:18:52,140 --> 00:18:56,640 diagnosis 468 00:18:54,480 --> 00:18:58,799 um so ideally we'd like to share but 469 00:18:56,640 --> 00:19:01,320 it's actually really difficult because 470 00:18:58,799 --> 00:19:03,480 um it's patient data and there's all 471 00:19:01,320 --> 00:19:05,640 kinds of ethical and security issues to 472 00:19:03,480 --> 00:19:07,559 do to deal with that 473 00:19:05,640 --> 00:19:09,840 um so there's a group called Australian 474 00:19:07,559 --> 00:19:12,360 genomics and they sort of sit as a 475 00:19:09,840 --> 00:19:14,820 national group um that sort of works 476 00:19:12,360 --> 00:19:18,000 above the states to try and help the 477 00:19:14,820 --> 00:19:21,059 individual states work better especially 478 00:19:18,000 --> 00:19:24,000 better together so they came up with the 479 00:19:21,059 --> 00:19:26,000 the plan of coming of sharians which is 480 00:19:24,000 --> 00:19:28,860 basically they have a central system 481 00:19:26,000 --> 00:19:31,140 where you take the state's individual a 482 00:19:28,860 --> 00:19:33,320 class classifications and send them up 483 00:19:31,140 --> 00:19:36,179 to a central server with all of the 484 00:19:33,320 --> 00:19:39,240 anonymity and all kinds of stuff that 485 00:19:36,179 --> 00:19:40,980 you have to do with and if there's a 486 00:19:39,240 --> 00:19:42,720 discrepancy so two different states have 487 00:19:40,980 --> 00:19:45,240 different results then it basically 488 00:19:42,720 --> 00:19:46,380 sends an email and says look sort this 489 00:19:45,240 --> 00:19:48,299 out 490 00:19:46,380 --> 00:19:50,760 and then there's a little like workflow 491 00:19:48,299 --> 00:19:52,260 that we send them down 492 00:19:50,760 --> 00:19:54,179 um so yeah this um Project's been 493 00:19:52,260 --> 00:19:55,919 running since 2019 and so far we've 494 00:19:54,179 --> 00:19:59,039 Linked UP basically every public 495 00:19:55,919 --> 00:20:01,620 pathology provider in Australia and 496 00:19:59,039 --> 00:20:03,960 um also just recently New Zealand we're 497 00:20:01,620 --> 00:20:07,980 now starting to go after 498 00:20:03,960 --> 00:20:10,460 um the uh private Labs pathology labs 499 00:20:07,980 --> 00:20:13,260 and we're also starting to go into 500 00:20:10,460 --> 00:20:15,720 collecting the cancer ones which are a 501 00:20:13,260 --> 00:20:17,520 little bit different and require a bit 502 00:20:15,720 --> 00:20:19,679 so the ones we've traditionally captured 503 00:20:17,520 --> 00:20:21,299 are the inherited ones and we're 504 00:20:19,679 --> 00:20:22,919 starting to capture the cancer ones 505 00:20:21,299 --> 00:20:24,780 which are have a bit of different 506 00:20:22,919 --> 00:20:27,480 information 507 00:20:24,780 --> 00:20:28,919 so the way it works is basically so 508 00:20:27,480 --> 00:20:32,640 essay pathology you know they use 509 00:20:28,919 --> 00:20:35,460 variant grid and so we can do whatever 510 00:20:32,640 --> 00:20:37,620 we like there but in other states they 511 00:20:35,460 --> 00:20:39,720 use different software and so we sort of 512 00:20:37,620 --> 00:20:41,580 have to deal like right integration 513 00:20:39,720 --> 00:20:45,179 software with them and talk to their 514 00:20:41,580 --> 00:20:47,700 system and then translate the data from 515 00:20:45,179 --> 00:20:52,080 their format into the common format and 516 00:20:47,700 --> 00:20:54,419 then send it up um via API 517 00:20:52,080 --> 00:20:55,980 um and then if two different Labs get 518 00:20:54,419 --> 00:20:57,660 their different results then this is 519 00:20:55,980 --> 00:20:58,799 what we show them a diff 520 00:20:57,660 --> 00:21:00,780 um and then they have them basically 521 00:20:58,799 --> 00:21:02,760 they usually have a meeting bring in the 522 00:21:00,780 --> 00:21:05,880 clinicians and the big guns of their 523 00:21:02,760 --> 00:21:07,799 labs to decide what happens 524 00:21:05,880 --> 00:21:09,419 um and then they sort it out go through 525 00:21:07,799 --> 00:21:11,100 a workflow one of them change well 526 00:21:09,419 --> 00:21:13,080 ideally one of them changes to agree 527 00:21:11,100 --> 00:21:14,640 with the other one and then uh it 528 00:21:13,080 --> 00:21:16,679 eventually Cascades through their system 529 00:21:14,640 --> 00:21:19,140 through the API again and then we mark 530 00:21:16,679 --> 00:21:21,059 it as resolved 531 00:21:19,140 --> 00:21:22,919 so yeah one of the troubles 532 00:21:21,059 --> 00:21:26,340 um is how to handle data from many Labs 533 00:21:22,919 --> 00:21:28,860 every state does their own thing and so 534 00:21:26,340 --> 00:21:31,440 we had to write custom converters to put 535 00:21:28,860 --> 00:21:34,940 into Json so sometimes we run it on 536 00:21:31,440 --> 00:21:37,559 their systems sometimes they just dump 537 00:21:34,940 --> 00:21:40,679 data on an S3 bucket and then we run the 538 00:21:37,559 --> 00:21:42,900 code on our cloud system and send it 539 00:21:40,679 --> 00:21:47,520 from our Cloud systems back to another 540 00:21:42,900 --> 00:21:50,039 through the API and yeah so we we allow 541 00:21:47,520 --> 00:21:51,780 basically everything we allow people to 542 00:21:50,039 --> 00:21:53,400 to give every bit of information they 543 00:21:51,780 --> 00:21:54,960 have but after but after we've 544 00:21:53,400 --> 00:21:57,179 investigated and asked for them what it 545 00:21:54,960 --> 00:22:00,179 is then we can sort of add a bit of type 546 00:21:57,179 --> 00:22:01,919 information such as float or int and we 547 00:22:00,179 --> 00:22:04,080 try and get them to use a we use a 548 00:22:01,919 --> 00:22:06,900 controlled vocabulary like an ontology 549 00:22:04,080 --> 00:22:09,059 so if a lab calls it population 550 00:22:06,900 --> 00:22:12,120 frequency another lab calls it pop freak 551 00:22:09,059 --> 00:22:13,679 or something we will convert that and 552 00:22:12,120 --> 00:22:15,720 send it all up on that you know with the 553 00:22:13,679 --> 00:22:18,120 consistency on the API so we try and 554 00:22:15,720 --> 00:22:19,679 hide the um the differences in those 555 00:22:18,120 --> 00:22:22,620 connector programs 556 00:22:19,679 --> 00:22:24,780 so here's what the Json looks like and 557 00:22:22,620 --> 00:22:27,240 the advantage of this is we have I don't 558 00:22:24,780 --> 00:22:29,580 know maybe 100 150 different fields that 559 00:22:27,240 --> 00:22:33,480 we collect from all the different Labs 560 00:22:29,580 --> 00:22:35,940 um and uh what what we do is we've like 561 00:22:33,480 --> 00:22:37,740 um so here there's Nomad allele number 562 00:22:35,940 --> 00:22:39,840 which is how many 563 00:22:37,740 --> 00:22:41,720 um counts this has been seen in in a 564 00:22:39,840 --> 00:22:44,400 hundred thousand person population 565 00:22:41,720 --> 00:22:48,059 survey we know that's going to be an 566 00:22:44,400 --> 00:22:50,580 integer so we have a basically a key 567 00:22:48,059 --> 00:22:52,140 with an integer on it and we then when 568 00:22:50,580 --> 00:22:55,080 we get the data we can run a validation 569 00:22:52,140 --> 00:22:58,140 and say this is supposed to be an INT 570 00:22:55,080 --> 00:22:59,640 um you know uh it's a it's a string and 571 00:22:58,140 --> 00:23:01,260 we can like raise a little warning and 572 00:22:59,640 --> 00:23:03,260 stuff like that which just helps keep 573 00:23:01,260 --> 00:23:07,140 the data pretty clean 574 00:23:03,260 --> 00:23:08,640 so um yeah here's the project timeline 575 00:23:07,140 --> 00:23:11,340 um so basically I started this so I work 576 00:23:08,640 --> 00:23:14,700 as a I work as a researcher on genomics 577 00:23:11,340 --> 00:23:17,820 and I people kept coming to me asking 578 00:23:14,700 --> 00:23:19,320 for me to do filtering and I said why 579 00:23:17,820 --> 00:23:22,080 don't I write some software so that 580 00:23:19,320 --> 00:23:25,919 people don't bother me anymore and then 581 00:23:22,080 --> 00:23:28,020 it got used and then the researchers 582 00:23:25,919 --> 00:23:29,580 started using it then the diagnostic 583 00:23:28,020 --> 00:23:33,780 people started using it we ended up 584 00:23:29,580 --> 00:23:36,960 putting basically we won a um a project 585 00:23:33,780 --> 00:23:39,659 to sort of people have this very rare 586 00:23:36,960 --> 00:23:42,299 disease and a um they basically have 587 00:23:39,659 --> 00:23:43,740 this online site where all researchers 588 00:23:42,299 --> 00:23:45,960 from around the world can upload their 589 00:23:43,740 --> 00:23:47,820 data and share it and we got we're 590 00:23:45,960 --> 00:23:51,120 getting external funding from that from 591 00:23:47,820 --> 00:23:54,240 the states and that's still ongoing that 592 00:23:51,120 --> 00:23:56,760 was um 2018 the first funding and then 593 00:23:54,240 --> 00:24:00,120 um we hired then we won the share uh so 594 00:23:56,760 --> 00:24:01,559 sharing was a project that we had that 595 00:24:00,120 --> 00:24:04,380 Australian genomics wanted to do and 596 00:24:01,559 --> 00:24:06,659 variant grid 1 the tender for that and 597 00:24:04,380 --> 00:24:07,500 we used thanks um yeah the tender for 598 00:24:06,659 --> 00:24:10,140 that 599 00:24:07,500 --> 00:24:12,000 um and use variant grid technology as 600 00:24:10,140 --> 00:24:14,100 the base for sharing it so we've sort of 601 00:24:12,000 --> 00:24:16,440 have the ability to skin it and turn on 602 00:24:14,100 --> 00:24:19,620 and off things through settings to make 603 00:24:16,440 --> 00:24:22,200 the site work however the client wants 604 00:24:19,620 --> 00:24:23,340 it and you know swap the CSS out and 605 00:24:22,200 --> 00:24:25,260 stuff 606 00:24:23,340 --> 00:24:27,720 um yeah so in 2019 we've got a second 607 00:24:25,260 --> 00:24:30,299 developer which was um amazing 608 00:24:27,720 --> 00:24:31,980 um and 2020 we went open source and you 609 00:24:30,299 --> 00:24:34,500 can go to the GitHub page and check it 610 00:24:31,980 --> 00:24:36,480 out I've um in science 611 00:24:34,500 --> 00:24:39,679 um you're supposed to write a paper to 612 00:24:36,480 --> 00:24:42,059 advertise your project and I've been um 613 00:24:39,679 --> 00:24:43,980 I don't know I'm a programmer not a 614 00:24:42,059 --> 00:24:45,720 writer and I've my boss keeps on my back 615 00:24:43,980 --> 00:24:46,919 to write a paper and one day I'll get to 616 00:24:45,720 --> 00:24:49,500 it 617 00:24:46,919 --> 00:24:51,059 um but yeah and then so it got 618 00:24:49,500 --> 00:24:52,260 diagnostic use and it's been used by 619 00:24:51,059 --> 00:24:55,440 sapath 620 00:24:52,260 --> 00:24:57,000 um and yeah so that's how it's going so 621 00:24:55,440 --> 00:24:58,260 thanks a lot 622 00:24:57,000 --> 00:24:58,660 um cheers 623 00:24:58,260 --> 00:24:59,700 [Applause] 624 00:24:58,660 --> 00:25:06,890 [Music] 625 00:24:59,700 --> 00:25:06,890 [Applause] 626 00:25:10,500 --> 00:25:16,679 all right thank you very much David what 627 00:25:13,320 --> 00:25:18,220 a great uh sequence of slides can we 628 00:25:16,679 --> 00:25:18,540 have a round of applause for David 629 00:25:18,220 --> 00:25:21,650 [Music] 630 00:25:18,540 --> 00:25:21,650 [Applause] 631 00:25:23,340 --> 00:25:29,039 now we do have a few minutes for 632 00:25:26,640 --> 00:25:30,900 questions so if anybody in the room 633 00:25:29,039 --> 00:25:32,580 wants to put their hand up or if anybody 634 00:25:30,900 --> 00:25:35,580 wants to type their questions into 635 00:25:32,580 --> 00:25:37,740 Discord and we can read them out here 636 00:25:35,580 --> 00:25:39,480 please do so 637 00:25:37,740 --> 00:25:41,159 I'll give you a moment to type things 638 00:25:39,480 --> 00:25:43,860 into Discord because I know it can be 639 00:25:41,159 --> 00:25:46,860 hard to raise your hands sometimes and 640 00:25:43,860 --> 00:25:48,779 I'll check and answer them later if um 641 00:25:46,860 --> 00:25:52,640 anyone else say anything else so but 642 00:25:48,779 --> 00:25:52,640 yeah feel free to ask any questions 643 00:25:52,679 --> 00:25:58,100 all right we have a question from down 644 00:25:54,960 --> 00:25:58,100 here near the front 645 00:25:59,520 --> 00:26:04,080 um yeah I guess I was just wondering how 646 00:26:01,440 --> 00:26:06,559 kind of easy and usable you find the 647 00:26:04,080 --> 00:26:10,919 kind of ecosystem of 648 00:26:06,559 --> 00:26:12,900 data apis like that ncbi provides for 649 00:26:10,919 --> 00:26:15,539 looking up your sequences and stuff 650 00:26:12,900 --> 00:26:16,980 because I've made my way slowly into 651 00:26:15,539 --> 00:26:20,039 this world and yeah I found it a 652 00:26:16,980 --> 00:26:20,940 struggle yeah um okay that's a great 653 00:26:20,039 --> 00:26:22,620 question 654 00:26:20,940 --> 00:26:24,659 um there are tools there's like um 655 00:26:22,620 --> 00:26:27,240 entrees um there's a python library 656 00:26:24,659 --> 00:26:29,460 biopython and some that sort of manages 657 00:26:27,240 --> 00:26:31,559 that we use that for things like um 658 00:26:29,460 --> 00:26:34,140 querying say how many 659 00:26:31,559 --> 00:26:36,480 um Publications there are for a gene or 660 00:26:34,140 --> 00:26:39,600 you know when was the last time or or 661 00:26:36,480 --> 00:26:42,059 other things but basically the 662 00:26:39,600 --> 00:26:43,260 um yeah the tools aren't 663 00:26:42,059 --> 00:26:46,620 um awesome 664 00:26:43,260 --> 00:26:48,000 um we write yeah basically we um we 665 00:26:46,620 --> 00:26:48,659 always wrap our 666 00:26:48,000 --> 00:26:51,659 um 667 00:26:48,659 --> 00:26:53,640 API calls in uh with timeouts and 668 00:26:51,659 --> 00:26:55,980 exceptions because the services are up 669 00:26:53,640 --> 00:26:57,840 and down uh I don't know it's um but 670 00:26:55,980 --> 00:26:58,500 yeah it's a bit of a weird 671 00:26:57,840 --> 00:27:01,260 um 672 00:26:58,500 --> 00:27:03,240 yeah basically a lot of work was done a 673 00:27:01,260 --> 00:27:05,220 long time ago um by 674 00:27:03,240 --> 00:27:06,659 um people that yeah there's a lot of 675 00:27:05,220 --> 00:27:08,220 some somewhat questionable decisions 676 00:27:06,659 --> 00:27:11,820 that you wouldn't necessarily do it this 677 00:27:08,220 --> 00:27:13,380 way uh now and um but yeah it's um it's 678 00:27:11,820 --> 00:27:15,179 a big yeah that's why we need more 679 00:27:13,380 --> 00:27:17,100 programmers in the field 680 00:27:15,179 --> 00:27:20,100 um get get our sort our apis out and 681 00:27:17,100 --> 00:27:23,520 yeah there's not there's a lot of um uh 682 00:27:20,100 --> 00:27:25,620 custom file formats and wacky old um you 683 00:27:23,520 --> 00:27:27,179 know really inefficient old file formats 684 00:27:25,620 --> 00:27:29,700 that are there instead of you know in my 685 00:27:27,179 --> 00:27:32,880 in my ideal world every service would 686 00:27:29,700 --> 00:27:34,679 have a rest API and it would be trivial 687 00:27:32,880 --> 00:27:36,960 to get everything but that's not a world 688 00:27:34,679 --> 00:27:38,340 we live in yet so but yeah more 689 00:27:36,960 --> 00:27:40,260 programmers would would make that world 690 00:27:38,340 --> 00:27:42,539 a reality so 691 00:27:40,260 --> 00:27:43,620 consider bioinformatics 692 00:27:42,539 --> 00:27:45,960 yeah 693 00:27:43,620 --> 00:27:49,559 all right we have one question from the 694 00:27:45,960 --> 00:27:52,140 Discord uh do the findings push to or 695 00:27:49,559 --> 00:27:55,440 pull from open targets genetics 696 00:27:52,140 --> 00:27:58,200 open targets genetics 697 00:27:55,440 --> 00:27:59,820 um I'm not sure so we go to clinvar as 698 00:27:58,200 --> 00:28:01,799 well which is a very common oh that's 699 00:27:59,820 --> 00:28:04,620 the most popular as far as I know 700 00:28:01,799 --> 00:28:07,559 um variant classification sharing uh 701 00:28:04,620 --> 00:28:10,340 platform and we collect a lot more data 702 00:28:07,559 --> 00:28:12,480 than they do and we also 703 00:28:10,340 --> 00:28:14,640 collect stuff that hasn't sort of been 704 00:28:12,480 --> 00:28:17,100 approved for public use so it's behind a 705 00:28:14,640 --> 00:28:20,760 like a private server and allows the um 706 00:28:17,100 --> 00:28:23,640 uh sort of labs across the country to 707 00:28:20,760 --> 00:28:27,120 sort out any problems out of 708 00:28:23,640 --> 00:28:28,799 um you know internally and then after a 709 00:28:27,120 --> 00:28:31,020 while they've sort everything out and 710 00:28:28,799 --> 00:28:32,640 then they share it and then we do once 711 00:28:31,020 --> 00:28:34,620 we sort of set things to the public 712 00:28:32,640 --> 00:28:36,720 share level then we go through and 713 00:28:34,620 --> 00:28:39,779 collect everything and send it to the 714 00:28:36,720 --> 00:28:41,640 external databases for everyone else to 715 00:28:39,779 --> 00:28:44,520 be able to look at so we do talk to 716 00:28:41,640 --> 00:28:45,900 other databases and push it out and we 717 00:28:44,520 --> 00:28:47,640 also pull those in so we can do 718 00:28:45,900 --> 00:28:49,980 comparisons if if ours are different 719 00:28:47,640 --> 00:28:52,320 from theirs so we both push and pull to 720 00:28:49,980 --> 00:28:54,360 external databases 721 00:28:52,320 --> 00:28:56,400 all right that's all the question all 722 00:28:54,360 --> 00:28:59,100 the time we have for questions uh I'm 723 00:28:56,400 --> 00:29:00,720 have the traditional speakers gift for 724 00:28:59,100 --> 00:29:02,820 David 725 00:29:00,720 --> 00:29:06,299 the wonderful 726 00:29:02,820 --> 00:29:09,260 pycon a youth speakers mug thank you one 727 00:29:06,299 --> 00:29:09,260 more round of applause for David