1 00:00:00,480 --> 00:00:03,480 foreign 2 00:00:08,400 --> 00:00:12,120 thank you um I'd like to thank the 3 00:00:10,620 --> 00:00:13,679 organizers for giving me the opportunity 4 00:00:12,120 --> 00:00:15,299 to talk to you today 5 00:00:13,679 --> 00:00:17,340 um you know this is my first time 6 00:00:15,299 --> 00:00:19,140 presenting at pycon I usually present to 7 00:00:17,340 --> 00:00:20,460 rooms of scientific researchers is a 8 00:00:19,140 --> 00:00:22,500 very different style of talk than what 9 00:00:20,460 --> 00:00:24,000 I'm used to but I'm really excited to 10 00:00:22,500 --> 00:00:25,980 tell you the story of an ongoing 11 00:00:24,000 --> 00:00:28,140 academic open source project to build a 12 00:00:25,980 --> 00:00:29,820 biological database for experimental 13 00:00:28,140 --> 00:00:31,619 data that's generated by researchers and 14 00:00:29,820 --> 00:00:33,120 genomics and I'm really excited to share 15 00:00:31,619 --> 00:00:34,860 it with this community because Django 16 00:00:33,120 --> 00:00:36,120 really plays a starring role and while 17 00:00:34,860 --> 00:00:37,640 there's a bit of technical stuff and 18 00:00:36,120 --> 00:00:40,140 I'll explain a little bit of the biology 19 00:00:37,640 --> 00:00:42,120 to kind of motivate things this is 20 00:00:40,140 --> 00:00:44,700 mostly a talk about how we try to match 21 00:00:42,120 --> 00:00:46,620 our team shifting expertise and variable 22 00:00:44,700 --> 00:00:49,140 level of resourcing to available open 23 00:00:46,620 --> 00:00:50,820 source Tools in order to achieve you 24 00:00:49,140 --> 00:00:52,260 know the goals and requirements of this 25 00:00:50,820 --> 00:00:54,180 project that really changed a lot over 26 00:00:52,260 --> 00:00:55,559 time and I hope it also maybe gives you 27 00:00:54,180 --> 00:00:57,300 a little bit of insight into what it's 28 00:00:55,559 --> 00:00:59,280 like to do software development in an 29 00:00:57,300 --> 00:01:01,320 academic research context which I think 30 00:00:59,280 --> 00:01:03,239 is both in some ways similar and very 31 00:01:01,320 --> 00:01:04,440 different to how many people do software 32 00:01:03,239 --> 00:01:07,500 development 33 00:01:04,440 --> 00:01:10,080 so a little bit about me so I am a 34 00:01:07,500 --> 00:01:11,760 computational biologist I'm a researcher 35 00:01:10,080 --> 00:01:13,680 at the Walton Eliza Hall Institute of 36 00:01:11,760 --> 00:01:15,720 Medical Research or wihai in Melbourne 37 00:01:13,680 --> 00:01:17,340 Australia this work is all being been 38 00:01:15,720 --> 00:01:18,479 done in close collaboration with 39 00:01:17,340 --> 00:01:19,680 colleagues at the University of 40 00:01:18,479 --> 00:01:21,900 Washington and the broadman media 41 00:01:19,680 --> 00:01:24,000 Institute for precision medicine in 42 00:01:21,900 --> 00:01:26,759 Seattle and the work that I'm talking 43 00:01:24,000 --> 00:01:28,140 about has been supported by competitive 44 00:01:26,759 --> 00:01:30,659 grants from the medical research future 45 00:01:28,140 --> 00:01:32,280 fund here in Australia and also from the 46 00:01:30,659 --> 00:01:33,840 national Human Genome Research Institute 47 00:01:32,280 --> 00:01:36,180 which is part of the National Institutes 48 00:01:33,840 --> 00:01:37,860 of Health in the United States 49 00:01:36,180 --> 00:01:40,020 and so with that I'd like to just tell 50 00:01:37,860 --> 00:01:43,640 you a little bit about what kind of data 51 00:01:40,020 --> 00:01:43,640 we're storing and why we're doing that 52 00:01:44,400 --> 00:01:48,420 so fundamentally the data that we're 53 00:01:46,259 --> 00:01:50,640 storing is about protein so proteins are 54 00:01:48,420 --> 00:01:52,200 a type of biological molecule they have 55 00:01:50,640 --> 00:01:53,280 a lot of different functions there you 56 00:01:52,200 --> 00:01:55,799 can think of them sort of as Little 57 00:01:53,280 --> 00:01:58,560 Machines inside your cells that perform 58 00:01:55,799 --> 00:02:01,500 all of the functions that your cells 59 00:01:58,560 --> 00:02:02,820 need to perform so antibodies are a kind 60 00:02:01,500 --> 00:02:04,439 of protein that I think everyone's 61 00:02:02,820 --> 00:02:06,860 learned a lot about over the last few 62 00:02:04,439 --> 00:02:09,179 years that are involved in immunity 63 00:02:06,860 --> 00:02:10,319 enzymes are a class of proteins that 64 00:02:09,179 --> 00:02:11,700 assist in all types of different 65 00:02:10,319 --> 00:02:14,099 chemical reactions whether that's 66 00:02:11,700 --> 00:02:15,660 metabolism or synthesizing the various 67 00:02:14,099 --> 00:02:17,580 different chemicals that your body needs 68 00:02:15,660 --> 00:02:19,620 they have a lot of different different 69 00:02:17,580 --> 00:02:21,660 roles and then there are a lot of other 70 00:02:19,620 --> 00:02:23,520 different sort of 71 00:02:21,660 --> 00:02:25,739 different functions that proteins have 72 00:02:23,520 --> 00:02:27,660 like forming the physical structure of 73 00:02:25,739 --> 00:02:30,180 your cells to things like this chaperone 74 00:02:27,660 --> 00:02:32,640 protein here which actually forms like a 75 00:02:30,180 --> 00:02:34,680 little canister that can open up and 76 00:02:32,640 --> 00:02:37,500 contain other proteins and help them 77 00:02:34,680 --> 00:02:39,599 stabilize and so and I'll just mention 78 00:02:37,500 --> 00:02:41,400 that all of these images of these 79 00:02:39,599 --> 00:02:42,900 protein structures that I'm showing are 80 00:02:41,400 --> 00:02:44,819 from the excellent molecule of the month 81 00:02:42,900 --> 00:02:47,180 series which is run by the protein Data 82 00:02:44,819 --> 00:02:49,560 Bank sort of a kind of a public science 83 00:02:47,180 --> 00:02:51,120 sort of series which I highly recommend 84 00:02:49,560 --> 00:02:52,620 if you want to learn more about the 85 00:02:51,120 --> 00:02:56,420 interesting things that proteins can do 86 00:02:52,620 --> 00:02:56,420 and see more of these cool images 87 00:02:57,599 --> 00:03:02,160 now Decades of work in molecular biology 88 00:03:00,060 --> 00:03:05,640 has taught us these relationship which 89 00:03:02,160 --> 00:03:08,459 is that DNA gets transcribed into RNA 90 00:03:05,640 --> 00:03:10,379 and RNA gets translated into protein and 91 00:03:08,459 --> 00:03:11,819 it's helped us work out the mechanics of 92 00:03:10,379 --> 00:03:13,319 this and the grammar and I'm not going 93 00:03:11,819 --> 00:03:15,060 to go into this because I don't want to 94 00:03:13,319 --> 00:03:16,500 spend all of this time giving you a 95 00:03:15,060 --> 00:03:17,640 public science lecture although if you 96 00:03:16,500 --> 00:03:19,140 want to learn more about any of this 97 00:03:17,640 --> 00:03:20,760 stuff you can come Corner me at the con 98 00:03:19,140 --> 00:03:22,920 at the conference later and I'll answer 99 00:03:20,760 --> 00:03:24,420 any of your questions so we're not going 100 00:03:22,920 --> 00:03:26,400 to focus on this but the takeaway here 101 00:03:24,420 --> 00:03:28,440 is that the DNA sequence which is stored 102 00:03:26,400 --> 00:03:30,599 in your genome inside the nucleus of 103 00:03:28,440 --> 00:03:32,940 your cells contains all of the 104 00:03:30,599 --> 00:03:35,760 information that is needed to determine 105 00:03:32,940 --> 00:03:37,680 the sequence of the protein so the 106 00:03:35,760 --> 00:03:39,840 information flows largely in this way 107 00:03:37,680 --> 00:03:42,120 and that's the kind of the take-home 108 00:03:39,840 --> 00:03:45,299 message here 109 00:03:42,120 --> 00:03:47,580 now proteins are chains of amino acid 110 00:03:45,299 --> 00:03:49,440 it's a biological polymer the amino 111 00:03:47,580 --> 00:03:51,780 acids are the subunits and it's just a 112 00:03:49,440 --> 00:03:53,700 single chain of amino acids there are 20 113 00:03:51,780 --> 00:03:55,739 canonical amino acids so we can give 114 00:03:53,700 --> 00:03:57,060 each of them a single letter and we can 115 00:03:55,739 --> 00:03:59,099 just write the sequence as a string 116 00:03:57,060 --> 00:04:03,319 which was what we're doing here so this 117 00:03:59,099 --> 00:04:06,200 is the the amino acid sequence 118 00:04:03,319 --> 00:04:09,120 of this protein that I showed you before 119 00:04:06,200 --> 00:04:11,700 and so the important thing that I want 120 00:04:09,120 --> 00:04:14,519 to tell you about this is that the 121 00:04:11,700 --> 00:04:16,260 sequence of amino acids determines the 122 00:04:14,519 --> 00:04:19,019 structure of the protein so this this 123 00:04:16,260 --> 00:04:21,060 chain of amino acids is going to fold up 124 00:04:19,019 --> 00:04:22,500 and it's going to fold up into some 125 00:04:21,060 --> 00:04:24,900 shape based on the biophysical 126 00:04:22,500 --> 00:04:26,699 properties of those amino acids and the 127 00:04:24,900 --> 00:04:28,800 shape and the position of the atoms in 128 00:04:26,699 --> 00:04:30,540 that protein structure are what allows 129 00:04:28,800 --> 00:04:32,699 the protein to perform its function or 130 00:04:30,540 --> 00:04:35,340 to do its job so that's what's important 131 00:04:32,699 --> 00:04:37,979 and some of you who you know follow this 132 00:04:35,340 --> 00:04:40,020 uh kind of aspect of science May 133 00:04:37,979 --> 00:04:41,460 recognize this as the protein folding 134 00:04:40,020 --> 00:04:42,660 problem and it's something that a lot of 135 00:04:41,460 --> 00:04:44,820 groups have spent a lot of time trying 136 00:04:42,660 --> 00:04:45,800 to predict is how do you go from a 137 00:04:44,820 --> 00:04:49,199 sequence 138 00:04:45,800 --> 00:04:50,699 to the structure 139 00:04:49,199 --> 00:04:52,560 um and so the structure allows the 140 00:04:50,699 --> 00:04:53,880 protein to do its function so the 141 00:04:52,560 --> 00:04:55,620 function of this protein which is a 142 00:04:53,880 --> 00:04:57,600 caspase protein it's involved in program 143 00:04:55,620 --> 00:04:59,940 cell death is that it kills the cell 144 00:04:57,600 --> 00:05:01,320 when it's time to kill the cell and 145 00:04:59,940 --> 00:05:03,180 there's a lot of context in which that's 146 00:05:01,320 --> 00:05:04,639 important including cancer and also 147 00:05:03,180 --> 00:05:07,500 normal development 148 00:05:04,639 --> 00:05:09,300 but the take home here is that the amino 149 00:05:07,500 --> 00:05:10,979 acid sequence determines whether or not 150 00:05:09,300 --> 00:05:14,520 the protein is going to be able to do 151 00:05:10,979 --> 00:05:16,860 its function based on its structure 152 00:05:14,520 --> 00:05:18,479 now I told you before that we can get to 153 00:05:16,860 --> 00:05:20,100 the amino acid sequence from the DNA 154 00:05:18,479 --> 00:05:21,780 sequence so if I just swap over to the 155 00:05:20,100 --> 00:05:23,759 DNA sequence here we can say that the 156 00:05:21,780 --> 00:05:25,440 DNA sequence is what determines the 157 00:05:23,759 --> 00:05:27,539 function of whether the protein can do 158 00:05:25,440 --> 00:05:30,840 its function and so then a change so if 159 00:05:27,539 --> 00:05:32,520 we change this a to a t which is there 160 00:05:30,840 --> 00:05:34,080 in red kind of in the in the bottom 161 00:05:32,520 --> 00:05:34,800 middle there 162 00:05:34,080 --> 00:05:37,380 um 163 00:05:34,800 --> 00:05:39,539 you know that may cause a change in the 164 00:05:37,380 --> 00:05:41,280 that DNA change may cause a change in 165 00:05:39,539 --> 00:05:43,020 the amino acid sequence that change in 166 00:05:41,280 --> 00:05:44,699 the amino acid sequence may make a 167 00:05:43,020 --> 00:05:46,740 meaningful change to the structure and 168 00:05:44,699 --> 00:05:48,419 that change to the structure may have a 169 00:05:46,740 --> 00:05:50,639 meaningful impact on whether or not the 170 00:05:48,419 --> 00:05:52,680 protein can do its function and despite 171 00:05:50,639 --> 00:05:55,259 the success of tools like Alpha full 2 172 00:05:52,680 --> 00:05:57,180 and primate Ai and other sort of large 173 00:05:55,259 --> 00:05:59,160 predictive models the effect of a 174 00:05:57,180 --> 00:06:01,080 specific amino acid change on a protein 175 00:05:59,160 --> 00:06:02,880 structure and therefore function remains 176 00:06:01,080 --> 00:06:04,320 very difficult to predict and there's 177 00:06:02,880 --> 00:06:06,180 still really no substitute for doing 178 00:06:04,320 --> 00:06:08,160 experiment and advances in genomic 179 00:06:06,180 --> 00:06:10,800 technology particularly High throughput 180 00:06:08,160 --> 00:06:12,840 DNA sequencing and high throughput DNA 181 00:06:10,800 --> 00:06:14,759 synthesis allows us to look at these 182 00:06:12,840 --> 00:06:16,320 changes at scale and that's the type of 183 00:06:14,759 --> 00:06:19,740 experimental data that we're trying to 184 00:06:16,320 --> 00:06:22,139 store in this database 185 00:06:19,740 --> 00:06:24,419 um and so what researchers can do now is 186 00:06:22,139 --> 00:06:26,639 we can measure every change in a single 187 00:06:24,419 --> 00:06:28,560 protein at once so we make a population 188 00:06:26,639 --> 00:06:31,740 of cells where each one has a different 189 00:06:28,560 --> 00:06:33,479 variant of the protein and then we 190 00:06:31,740 --> 00:06:35,400 compete them against each other in some 191 00:06:33,479 --> 00:06:37,979 sort of functional assay and we see 192 00:06:35,400 --> 00:06:40,560 whether or not the protein Works under 193 00:06:37,979 --> 00:06:43,319 that assay condition or not and then 194 00:06:40,560 --> 00:06:44,340 through the application of high 195 00:06:43,319 --> 00:06:46,199 throughput sequencing and then 196 00:06:44,340 --> 00:06:47,759 bioinformatics what we can do is we can 197 00:06:46,199 --> 00:06:50,699 take the data that comes out of this 198 00:06:47,759 --> 00:06:52,500 assay and we can assess a functional 199 00:06:50,699 --> 00:06:54,840 score for each one of those changes that 200 00:06:52,500 --> 00:06:56,460 we're interested in and we can plot it 201 00:06:54,840 --> 00:06:58,020 you know using a heat looking at a heat 202 00:06:56,460 --> 00:07:00,300 map like this and we get all of this 203 00:06:58,020 --> 00:07:02,340 information about which positions in the 204 00:07:00,300 --> 00:07:03,720 protein are important what changes are 205 00:07:02,340 --> 00:07:05,220 tolerated and there's all kinds of 206 00:07:03,720 --> 00:07:06,780 interesting applications for this data 207 00:07:05,220 --> 00:07:08,580 if you want to learn more about the 208 00:07:06,780 --> 00:07:10,500 bioinformatics behind calculating these 209 00:07:08,580 --> 00:07:13,380 scores and how all of this works and 210 00:07:10,500 --> 00:07:15,120 here this stuff explained by a software 211 00:07:13,380 --> 00:07:16,380 engineer rather than by me which I think 212 00:07:15,120 --> 00:07:18,419 is going to give you some interesting 213 00:07:16,380 --> 00:07:22,080 additional context please go see Nick 214 00:07:18,419 --> 00:07:23,759 Morse talk tomorrow which I think is at 215 00:07:22,080 --> 00:07:25,800 noon in this room unless the schedule 216 00:07:23,759 --> 00:07:27,539 has changed 217 00:07:25,800 --> 00:07:29,220 um so that's the background that I 218 00:07:27,539 --> 00:07:31,560 wanted to give you about 219 00:07:29,220 --> 00:07:34,440 um the the science I hope that it was 220 00:07:31,560 --> 00:07:35,880 you know somewhat clear and uh and that 221 00:07:34,440 --> 00:07:38,099 gives you some idea of kind of what 222 00:07:35,880 --> 00:07:40,020 we're dealing with and so we wanted to 223 00:07:38,099 --> 00:07:41,819 build a database to store all this 224 00:07:40,020 --> 00:07:43,740 information and so why do we want a 225 00:07:41,819 --> 00:07:46,020 database when we talk about research 226 00:07:43,740 --> 00:07:48,060 data and um 227 00:07:46,020 --> 00:07:48,840 and and what we want to do with it and 228 00:07:48,060 --> 00:07:51,120 and 229 00:07:48,840 --> 00:07:52,500 how to have it have the most impact a 230 00:07:51,120 --> 00:07:55,199 lot of times we talk about how we want 231 00:07:52,500 --> 00:07:57,180 data to be fair and fair data is an 232 00:07:55,199 --> 00:07:59,639 acronym for findable accessible 233 00:07:57,180 --> 00:08:01,319 interoperable and reusable and so we 234 00:07:59,639 --> 00:08:03,419 want all the public data that's being 235 00:08:01,319 --> 00:08:05,099 you know generated especially the stuff 236 00:08:03,419 --> 00:08:06,599 that's being funded by government grants 237 00:08:05,099 --> 00:08:08,880 and is generated with public funds we 238 00:08:06,599 --> 00:08:10,560 want it all all that data to be fair we 239 00:08:08,880 --> 00:08:12,240 want it to be open we want everybody to 240 00:08:10,560 --> 00:08:14,819 be able to use it and so we need to 241 00:08:12,240 --> 00:08:16,919 organize it in some way 242 00:08:14,819 --> 00:08:18,780 and uh one of the best ways to do this 243 00:08:16,919 --> 00:08:19,919 is through applying data standards so 244 00:08:18,780 --> 00:08:21,780 I'm just going to show you some quick 245 00:08:19,919 --> 00:08:24,060 examples of what actual data looks like 246 00:08:21,780 --> 00:08:25,979 from actual papers chosen basically at 247 00:08:24,060 --> 00:08:27,660 random by what I just had downloaded on 248 00:08:25,979 --> 00:08:29,940 my laptop and that I opened up in the 249 00:08:27,660 --> 00:08:31,500 spreadsheet program so here's one where 250 00:08:29,940 --> 00:08:33,120 you actually have the amino acid 251 00:08:31,500 --> 00:08:35,039 sequences for all the different variants 252 00:08:33,120 --> 00:08:36,839 listed there are some quantities that 253 00:08:35,039 --> 00:08:39,120 were measured here's another one that 254 00:08:36,839 --> 00:08:42,779 uses a very sort of terse single letter 255 00:08:39,120 --> 00:08:44,760 position single letter code to describe 256 00:08:42,779 --> 00:08:47,640 the variations on a sequence which is 257 00:08:44,760 --> 00:08:49,019 not depicted in the spreadsheet here's 258 00:08:47,640 --> 00:08:50,160 another one that does something similar 259 00:08:49,019 --> 00:08:52,140 they measured a whole bunch of 260 00:08:50,160 --> 00:08:53,880 additional different things additional 261 00:08:52,140 --> 00:08:56,040 stuff during their assay it's not 262 00:08:53,880 --> 00:08:58,320 necessarily clear from this screenshot 263 00:08:56,040 --> 00:09:00,060 what the most important value is if you 264 00:08:58,320 --> 00:09:01,860 just want to know what is this protein 265 00:09:00,060 --> 00:09:02,519 variant do 266 00:09:01,860 --> 00:09:04,620 um 267 00:09:02,519 --> 00:09:06,420 and here's another another different one 268 00:09:04,620 --> 00:09:08,459 that is actually 269 00:09:06,420 --> 00:09:09,899 represented as a matrix as you would 270 00:09:08,459 --> 00:09:11,459 make if you wanted to build one of those 271 00:09:09,899 --> 00:09:13,140 heat maps that I showed you briefly on 272 00:09:11,459 --> 00:09:15,660 the previous slide so this is kind of 273 00:09:13,140 --> 00:09:17,640 all over the place right and if what you 274 00:09:15,660 --> 00:09:19,620 want to do is download a data set 275 00:09:17,640 --> 00:09:21,779 because you're interested in some 276 00:09:19,620 --> 00:09:23,399 protein and what it does 277 00:09:21,779 --> 00:09:25,440 um you're going to maybe have a lot of 278 00:09:23,399 --> 00:09:27,360 kind of munging and data cleaning and 279 00:09:25,440 --> 00:09:29,160 you know parsing that you have to do for 280 00:09:27,360 --> 00:09:30,839 this and even worse if you want to do 281 00:09:29,160 --> 00:09:32,220 something like you know do a big 282 00:09:30,839 --> 00:09:34,140 meta-analysis or build a machine 283 00:09:32,220 --> 00:09:37,800 learning model that is a huge amount of 284 00:09:34,140 --> 00:09:40,080 work for you or probably your student to 285 00:09:37,800 --> 00:09:42,240 kind of get all of this stuff into shape 286 00:09:40,080 --> 00:09:45,240 so that you can actually start doing the 287 00:09:42,240 --> 00:09:47,700 work so if we can build a database then 288 00:09:45,240 --> 00:09:49,800 store everything in in a standard format 289 00:09:47,700 --> 00:09:52,140 make that available it's really going to 290 00:09:49,800 --> 00:09:54,660 save a lot of people a lot of time 291 00:09:52,140 --> 00:09:56,220 so I'm going to keep coming back to this 292 00:09:54,660 --> 00:09:58,140 project timeline where I'm going to kind 293 00:09:56,220 --> 00:09:59,880 of walk you through like the kind of how 294 00:09:58,140 --> 00:10:01,860 our project went 295 00:09:59,880 --> 00:10:03,420 um and so faced with this increasingly 296 00:10:01,860 --> 00:10:05,760 unmanageable situation in the literature 297 00:10:03,420 --> 00:10:07,019 that I just gave you a sneak peek of 298 00:10:05,760 --> 00:10:08,459 um my colleagues at the University of 299 00:10:07,019 --> 00:10:10,080 Washington met with some of the other 300 00:10:08,459 --> 00:10:11,519 Pioneers in this field and they decided 301 00:10:10,080 --> 00:10:12,720 that our field needed its own database 302 00:10:11,519 --> 00:10:15,540 because the data is sufficiently 303 00:10:12,720 --> 00:10:16,860 different from other types of data to 304 00:10:15,540 --> 00:10:19,100 not be well served by anything that 305 00:10:16,860 --> 00:10:21,240 already existed and at the time 306 00:10:19,100 --> 00:10:23,160 databases were really either kind of 307 00:10:21,240 --> 00:10:25,140 tiny Boutique efforts like you know 308 00:10:23,160 --> 00:10:27,839 information just about a very specific 309 00:10:25,140 --> 00:10:29,279 gene or specific biological process or 310 00:10:27,839 --> 00:10:30,660 they're run by centrally funded 311 00:10:29,279 --> 00:10:32,700 Government research organizations so 312 00:10:30,660 --> 00:10:34,440 building a database that was sort of you 313 00:10:32,700 --> 00:10:37,560 know mid-sized to serve this growing 314 00:10:34,440 --> 00:10:39,660 field seemed really fanciful 315 00:10:37,560 --> 00:10:41,820 and then I attended pycon Au for the 316 00:10:39,660 --> 00:10:43,440 first time in Melbourne in 2016 and I 317 00:10:41,820 --> 00:10:44,940 learned about Django and after some 318 00:10:43,440 --> 00:10:46,920 conversations with some people who are 319 00:10:44,940 --> 00:10:48,420 here in this room today I was filled 320 00:10:46,920 --> 00:10:50,519 with hubris and I contacted my 321 00:10:48,420 --> 00:10:51,959 collaborators in Seattle and I told them 322 00:10:50,519 --> 00:10:54,380 that I could build the database that 323 00:10:51,959 --> 00:10:54,380 they needed 324 00:10:55,190 --> 00:10:58,380 [Applause] 325 00:10:56,399 --> 00:11:00,000 so what did we think we what do we want 326 00:10:58,380 --> 00:11:01,740 to do like what did success look like to 327 00:11:00,000 --> 00:11:03,779 us when we started this project 328 00:11:01,740 --> 00:11:06,180 well we wanted to build a database where 329 00:11:03,779 --> 00:11:08,880 users could upload their data sets 330 00:11:06,180 --> 00:11:10,620 we wanted to build and we wanted to make 331 00:11:08,880 --> 00:11:11,839 it so that users could find and download 332 00:11:10,620 --> 00:11:13,980 data sets that were interested 333 00:11:11,839 --> 00:11:15,360 interesting to them and this is entirely 334 00:11:13,980 --> 00:11:16,500 targeted at researchers who are 335 00:11:15,360 --> 00:11:18,720 generating the data sets that we're 336 00:11:16,500 --> 00:11:22,380 talking about and we needed to make sure 337 00:11:18,720 --> 00:11:24,240 that it could be managed by a very small 338 00:11:22,380 --> 00:11:26,279 team with limited expertise because 339 00:11:24,240 --> 00:11:28,440 that's what we had 340 00:11:26,279 --> 00:11:29,940 so when we got started we got a very 341 00:11:28,440 --> 00:11:31,380 small amount of money with from a 342 00:11:29,940 --> 00:11:33,660 collaborator who had some flexible 343 00:11:31,380 --> 00:11:36,079 funding for which this work was in scope 344 00:11:33,660 --> 00:11:39,360 and we got started and it was just me 345 00:11:36,079 --> 00:11:40,800 and a student who has just finished his 346 00:11:39,360 --> 00:11:42,959 Master's in bioinformatics he wasn't 347 00:11:40,800 --> 00:11:44,399 sure if he wanted to do a PhD and he 348 00:11:42,959 --> 00:11:46,380 decided he wanted to do a little bit of 349 00:11:44,399 --> 00:11:48,360 work first as like you know doing some 350 00:11:46,380 --> 00:11:51,240 software stuff and so he'd done a little 351 00:11:48,360 --> 00:11:52,980 bit of work with like websites and we've 352 00:11:51,240 --> 00:11:55,500 both heard of Django so we figured that 353 00:11:52,980 --> 00:11:57,420 was good enough to get started and in 354 00:11:55,500 --> 00:11:59,100 the beginning we basically followed the 355 00:11:57,420 --> 00:12:01,380 Django girls tutorial but instead of 356 00:11:59,100 --> 00:12:04,339 making a Blog we built a database for 357 00:12:01,380 --> 00:12:04,339 genomic researchers 358 00:12:05,160 --> 00:12:09,300 and one of the really great benefits of 359 00:12:07,620 --> 00:12:10,440 working with Django you know this worked 360 00:12:09,300 --> 00:12:11,760 one of the benefits of working with 361 00:12:10,440 --> 00:12:12,779 Django especially at the beginning of 362 00:12:11,760 --> 00:12:15,959 the project when we didn't know anything 363 00:12:12,779 --> 00:12:17,519 was that it let us really focus on the 364 00:12:15,959 --> 00:12:19,680 higher level Concepts like getting the 365 00:12:17,519 --> 00:12:21,060 organization of the data right 366 00:12:19,680 --> 00:12:23,160 um so we settled on this hierarchical 367 00:12:21,060 --> 00:12:24,779 organization with these nested records 368 00:12:23,160 --> 00:12:27,300 of different types where an experiment 369 00:12:24,779 --> 00:12:29,820 set encapsulates all of the studies in a 370 00:12:27,300 --> 00:12:32,399 single Paper an experiment describes all 371 00:12:29,820 --> 00:12:34,140 the wet lab steps and the sequencing and 372 00:12:32,399 --> 00:12:35,880 then these score set records store the 373 00:12:34,140 --> 00:12:38,519 actual data and how the data analysis 374 00:12:35,880 --> 00:12:40,320 was performed and the each so we kind of 375 00:12:38,519 --> 00:12:41,579 have this hierarchy which becomes 376 00:12:40,320 --> 00:12:43,620 important when we have a more 377 00:12:41,579 --> 00:12:46,560 complicated experimental design so this 378 00:12:43,620 --> 00:12:47,339 is a real example where we had one set 379 00:12:46,560 --> 00:12:49,260 of 380 00:12:47,339 --> 00:12:51,899 one set of protein variants that was 381 00:12:49,260 --> 00:12:54,480 assayed in a bacteriophage model system 382 00:12:51,899 --> 00:12:56,579 and in a yeast model system and then 383 00:12:54,480 --> 00:12:58,620 they were each analyzed to produce two 384 00:12:56,579 --> 00:13:01,440 different data sets 385 00:12:58,620 --> 00:13:03,120 um and uh and the important thing for us 386 00:13:01,440 --> 00:13:04,800 here was that if somebody 387 00:13:03,120 --> 00:13:07,620 grabbed one of these data sets they 388 00:13:04,800 --> 00:13:09,180 would now be able to easily find all of 389 00:13:07,620 --> 00:13:11,399 the other data sets that were related in 390 00:13:09,180 --> 00:13:14,220 the database and they would be able to 391 00:13:11,399 --> 00:13:15,720 see if the raw data had been analyzed in 392 00:13:14,220 --> 00:13:17,940 a different way or there were alternate 393 00:13:15,720 --> 00:13:19,079 representations of the data and so these 394 00:13:17,940 --> 00:13:20,700 were the kinds of things that we could 395 00:13:19,079 --> 00:13:22,500 spend our time thinking about rather 396 00:13:20,700 --> 00:13:25,560 than thinking about how to just get 397 00:13:22,500 --> 00:13:27,060 words to appear in the browser 398 00:13:25,560 --> 00:13:28,800 um 399 00:13:27,060 --> 00:13:30,300 so when we kicked off 400 00:13:28,800 --> 00:13:31,920 um this is kind of the tech stack that 401 00:13:30,300 --> 00:13:34,760 we were working with so that the 402 00:13:31,920 --> 00:13:37,079 database was hosted in a university 403 00:13:34,760 --> 00:13:39,600 Department within a medical school 404 00:13:37,079 --> 00:13:42,540 they're very concerned about making sure 405 00:13:39,600 --> 00:13:43,440 that everything is working 406 00:13:42,540 --> 00:13:46,560 um 407 00:13:43,440 --> 00:13:49,019 and so you know they were still running 408 00:13:46,560 --> 00:13:50,820 on on Centos 6 so we actually had to get 409 00:13:49,019 --> 00:13:52,620 set up with an AWS virtual machine 410 00:13:50,820 --> 00:13:54,720 running Centos 6 so that we could get 411 00:13:52,620 --> 00:13:56,339 that environment set up and then we 412 00:13:54,720 --> 00:13:58,019 could provide an installation procedure 413 00:13:56,339 --> 00:14:00,480 to the IT department so that they could 414 00:13:58,019 --> 00:14:03,959 get all of our stuff working we were 415 00:14:00,480 --> 00:14:05,519 running uh postgres 9.6 for for the back 416 00:14:03,959 --> 00:14:07,560 end because we ended up using Json 417 00:14:05,519 --> 00:14:09,420 Fields a lot asked me about how that 418 00:14:07,560 --> 00:14:12,839 went it's actually it's awesome it's a 419 00:14:09,420 --> 00:14:14,339 success story The IT department asked us 420 00:14:12,839 --> 00:14:16,019 if we could please just use system 421 00:14:14,339 --> 00:14:18,720 python rather than installing something 422 00:14:16,019 --> 00:14:21,360 so that meant Python 3.4 and we were 423 00:14:18,720 --> 00:14:23,700 using Django one we're also running 424 00:14:21,360 --> 00:14:25,740 rabbitmq and celery so we had a task 425 00:14:23,700 --> 00:14:27,779 queue to offload the longer running data 426 00:14:25,740 --> 00:14:29,220 processing and validation tasks when we 427 00:14:27,779 --> 00:14:31,200 were accepting the user data when we're 428 00:14:29,220 --> 00:14:32,399 accepting the uploads 429 00:14:31,200 --> 00:14:33,720 um and I don't want to dwell on this too 430 00:14:32,399 --> 00:14:35,220 much but this was really hard and it was 431 00:14:33,720 --> 00:14:36,660 a lot of work like I have a PhD in 432 00:14:35,220 --> 00:14:39,240 genomics like I didn't even really know 433 00:14:36,660 --> 00:14:41,760 what devops was and here we are like 434 00:14:39,240 --> 00:14:44,040 trying to do this but you know we we did 435 00:14:41,760 --> 00:14:46,500 it we did the thing and so this is what 436 00:14:44,040 --> 00:14:49,199 version 1.0 looked like 437 00:14:46,500 --> 00:14:51,240 um so here we have uh you know a data 438 00:14:49,199 --> 00:14:54,180 table that was based on on data that was 439 00:14:51,240 --> 00:14:56,160 uploaded by the user 440 00:14:54,180 --> 00:14:57,779 um we had automatically generated 441 00:14:56,160 --> 00:14:59,100 accession numbers you know we have like 442 00:14:57,779 --> 00:15:00,480 a title and abstract and some other 443 00:14:59,100 --> 00:15:03,420 stuff that you can see some other you 444 00:15:00,480 --> 00:15:05,579 know textual metadata uh we had linkouts 445 00:15:03,420 --> 00:15:07,139 to uh an external visualization tool 446 00:15:05,579 --> 00:15:08,639 that would make those heat Maps like I 447 00:15:07,139 --> 00:15:11,279 showed you with some extra annotations 448 00:15:08,639 --> 00:15:13,560 and that could download the data from 449 00:15:11,279 --> 00:15:15,180 the server produce those images and then 450 00:15:13,560 --> 00:15:17,639 give you the sort of interactive plot 451 00:15:15,180 --> 00:15:20,459 and that was located posted um by 452 00:15:17,639 --> 00:15:22,380 another collaborating Institution 453 00:15:20,459 --> 00:15:24,300 um we have download wings so that people 454 00:15:22,380 --> 00:15:26,579 could get the data tables out and then 455 00:15:24,300 --> 00:15:28,199 and we have you know data licensing that 456 00:15:26,579 --> 00:15:29,940 was selectable by the user so this kind 457 00:15:28,199 --> 00:15:32,279 of had all of the pieces that we needed 458 00:15:29,940 --> 00:15:34,620 like this is what we wanted to build and 459 00:15:32,279 --> 00:15:36,959 we were able to do it and it took us 460 00:15:34,620 --> 00:15:41,579 about a year so we went from that first 461 00:15:36,959 --> 00:15:44,160 commit to uh in June of 2017 to uh the 462 00:15:41,579 --> 00:15:46,019 first deployment of the full thing we 463 00:15:44,160 --> 00:15:47,639 got a couple little more pots of money 464 00:15:46,019 --> 00:15:49,680 to keep things going because this is an 465 00:15:47,639 --> 00:15:51,779 academic project we wrote a paper we put 466 00:15:49,680 --> 00:15:55,560 it on the pre-print server and then the 467 00:15:51,779 --> 00:15:59,579 paper got published in November of 2019. 468 00:15:55,560 --> 00:16:01,019 and everybody lives happily ever after 469 00:15:59,579 --> 00:16:02,220 um so here's what we thought was going 470 00:16:01,019 --> 00:16:04,820 to happen so we thought we put the 471 00:16:02,220 --> 00:16:04,820 database up 472 00:16:05,639 --> 00:16:08,160 um 473 00:16:06,180 --> 00:16:09,600 and then users would come and they would 474 00:16:08,160 --> 00:16:12,839 upload the data 475 00:16:09,600 --> 00:16:14,220 and then we would you know maybe do some 476 00:16:12,839 --> 00:16:15,120 maintenance and like maybe you spend 477 00:16:14,220 --> 00:16:18,180 some time building some cool 478 00:16:15,120 --> 00:16:20,880 visualizations and then we would sort of 479 00:16:18,180 --> 00:16:23,339 move on to other things 480 00:16:20,880 --> 00:16:24,899 and uh what actually happened was we 481 00:16:23,339 --> 00:16:27,180 launched the database and then we got 482 00:16:24,899 --> 00:16:29,220 this an overwhelming and unexpected 483 00:16:27,180 --> 00:16:32,160 response from the clinical Community 484 00:16:29,220 --> 00:16:34,139 because they are in the business of 485 00:16:32,160 --> 00:16:35,639 trying to figure out what variants what 486 00:16:34,139 --> 00:16:37,920 genetic variants that are observed in 487 00:16:35,639 --> 00:16:40,320 patients do and we had a whole bunch of 488 00:16:37,920 --> 00:16:41,639 experimental data about all of these 489 00:16:40,320 --> 00:16:43,139 different changes that happen in 490 00:16:41,639 --> 00:16:44,399 proteins many of which are clinically 491 00:16:43,139 --> 00:16:46,199 relevant and they're like well we want 492 00:16:44,399 --> 00:16:48,540 to use that to interpret what's 493 00:16:46,199 --> 00:16:49,860 happening in patients and I'm not going 494 00:16:48,540 --> 00:16:52,139 to tell you anything else about how 495 00:16:49,860 --> 00:16:54,420 clinical Genesis do the data use the 496 00:16:52,139 --> 00:16:55,920 data or how clinical genetics works but 497 00:16:54,420 --> 00:16:58,440 if you're interested please go check out 498 00:16:55,920 --> 00:17:01,079 David Warren's talk which I think is 499 00:16:58,440 --> 00:17:02,040 tomorrow at 11 20 a.m in Hall B and 500 00:17:01,079 --> 00:17:03,600 you'll learn more about the 501 00:17:02,040 --> 00:17:07,140 bioinformatics of variant classification 502 00:17:03,600 --> 00:17:08,160 and see some more Django stuff 503 00:17:07,140 --> 00:17:09,900 um 504 00:17:08,160 --> 00:17:11,280 so we also didn't get as much of a 505 00:17:09,900 --> 00:17:12,240 response from the researchers as we 506 00:17:11,280 --> 00:17:14,400 thought that we were going to and I 507 00:17:12,240 --> 00:17:16,079 think a lot of you saw that coming so we 508 00:17:14,400 --> 00:17:18,780 ended up having to take on some major 509 00:17:16,079 --> 00:17:21,179 curation curation tasks to get all the 510 00:17:18,780 --> 00:17:23,160 old papers in in particular and people 511 00:17:21,179 --> 00:17:25,799 are much more keen to deposit their new 512 00:17:23,160 --> 00:17:27,179 stuff rather than their old stuff and so 513 00:17:25,799 --> 00:17:29,040 this meant that we had to keep working 514 00:17:27,179 --> 00:17:30,900 on the project and invest a lot of time 515 00:17:29,040 --> 00:17:32,460 in developing new features and things 516 00:17:30,900 --> 00:17:34,380 like that 517 00:17:32,460 --> 00:17:36,740 and so all of these things that I told 518 00:17:34,380 --> 00:17:39,660 you before were still true 519 00:17:36,740 --> 00:17:41,460 but what became clear from working with 520 00:17:39,660 --> 00:17:43,980 uh starting to work with the clinical 521 00:17:41,460 --> 00:17:46,799 folks was that we needed really really 522 00:17:43,980 --> 00:17:48,720 good apis because we needed to make sure 523 00:17:46,799 --> 00:17:50,520 that once we had all this data in the 524 00:17:48,720 --> 00:17:52,799 central repository people did not want 525 00:17:50,520 --> 00:17:55,140 to come to our website to get it they 526 00:17:52,799 --> 00:17:57,720 wanted us to push it out to other 527 00:17:55,140 --> 00:18:00,299 biological biological data resources 528 00:17:57,720 --> 00:18:02,820 other front ends clinical Information 529 00:18:00,299 --> 00:18:05,600 Systems other places they wanted to be 530 00:18:02,820 --> 00:18:08,340 able to get it wholesale and this meant 531 00:18:05,600 --> 00:18:09,660 that we needed to have better API 532 00:18:08,340 --> 00:18:11,240 support than what we kept from Django 533 00:18:09,660 --> 00:18:13,620 out of the box because we needed to 534 00:18:11,240 --> 00:18:16,380 customize the data models combine 535 00:18:13,620 --> 00:18:17,520 internal data models in various ways and 536 00:18:16,380 --> 00:18:19,620 we also needed to do things like 537 00:18:17,520 --> 00:18:21,179 maintain different API endpoints for 538 00:18:19,620 --> 00:18:24,000 various collaborators based on their 539 00:18:21,179 --> 00:18:25,799 specific needs 540 00:18:24,000 --> 00:18:27,360 and meanwhile there was sort of trouble 541 00:18:25,799 --> 00:18:28,740 brewing with the project that those of 542 00:18:27,360 --> 00:18:30,299 you who are thinking about dates and 543 00:18:28,740 --> 00:18:32,039 have been around I've dreamed Django for 544 00:18:30,299 --> 00:18:34,200 a long time probably saw coming which 545 00:18:32,039 --> 00:18:37,020 was that uh you know a few months after 546 00:18:34,200 --> 00:18:39,000 we published the paper uh Django 1.11 547 00:18:37,020 --> 00:18:41,460 was no longer getting any security 548 00:18:39,000 --> 00:18:43,200 updates at all and then a little bit 549 00:18:41,460 --> 00:18:45,179 after that you know we ran out of money 550 00:18:43,200 --> 00:18:47,100 and the developer who had done a lot of 551 00:18:45,179 --> 00:18:48,299 really great stuff had also learned 552 00:18:47,100 --> 00:18:49,559 everything that he could learn on this 553 00:18:48,299 --> 00:18:50,880 project and got a really great 554 00:18:49,559 --> 00:18:53,220 opportunity decided you didn't want to 555 00:18:50,880 --> 00:18:55,080 do a PhD that he wanted to be a research 556 00:18:53,220 --> 00:18:56,940 software engineer and he got a great 557 00:18:55,080 --> 00:18:58,799 opportunity to go join a team and do 558 00:18:56,940 --> 00:19:01,679 some really cool stuff and so he left 559 00:18:58,799 --> 00:19:04,380 the project so now it's just me 560 00:19:01,679 --> 00:19:06,539 and Django version 1.11 and a live 561 00:19:04,380 --> 00:19:07,919 website that's running in production the 562 00:19:06,539 --> 00:19:09,720 guy who knew how everything worked is 563 00:19:07,919 --> 00:19:11,100 gone we could hire him back as a 564 00:19:09,720 --> 00:19:13,679 consultant a little bit but that's a 565 00:19:11,100 --> 00:19:16,080 very finite resource the GitHub security 566 00:19:13,679 --> 00:19:17,940 notifications kept getting spicier 567 00:19:16,080 --> 00:19:20,039 and we were trying to figure out what to 568 00:19:17,940 --> 00:19:21,240 do and so obviously this is not going to 569 00:19:20,039 --> 00:19:24,960 work so we need to do something that's 570 00:19:21,240 --> 00:19:27,240 not just maintaining the old Django site 571 00:19:24,960 --> 00:19:28,980 so we could migrate to Django 2.2 572 00:19:27,240 --> 00:19:30,840 long-term support but then we would just 573 00:19:28,980 --> 00:19:31,740 have this problem again later on in not 574 00:19:30,840 --> 00:19:33,900 too long 575 00:19:31,740 --> 00:19:36,120 we could move to Django 3 but all the 576 00:19:33,900 --> 00:19:37,559 information we could find said this 577 00:19:36,120 --> 00:19:39,059 would be a lot of work and it was kind 578 00:19:37,559 --> 00:19:40,679 of new and scary and it also wasn't 579 00:19:39,059 --> 00:19:42,840 clear that we would get the kind of API 580 00:19:40,679 --> 00:19:45,059 support that we needed which again was a 581 00:19:42,840 --> 00:19:47,700 requirement that we didn't know that we 582 00:19:45,059 --> 00:19:50,760 had when we started the project but now 583 00:19:47,700 --> 00:19:53,280 was kind of our major requirement or we 584 00:19:50,760 --> 00:19:54,900 could try something new whatever that is 585 00:19:53,280 --> 00:19:56,520 now fortunately things worked out 586 00:19:54,900 --> 00:19:58,860 because while we were deciding what to 587 00:19:56,520 --> 00:20:00,840 do we ended up and doing a lot of this 588 00:19:58,860 --> 00:20:02,640 additional data curation work we ended 589 00:20:00,840 --> 00:20:05,039 up being successful in some major grant 590 00:20:02,640 --> 00:20:06,179 funding rounds in the United States this 591 00:20:05,039 --> 00:20:07,500 allowed me to hire a new Junior 592 00:20:06,179 --> 00:20:08,820 developer in Melbourne so I wasn't 593 00:20:07,500 --> 00:20:11,039 trying to do it all on my own anymore 594 00:20:08,820 --> 00:20:13,020 and also the host institution that was 595 00:20:11,039 --> 00:20:15,059 running the server finished their 596 00:20:13,020 --> 00:20:17,940 upgrades to Centos 7 just a little bit 597 00:20:15,059 --> 00:20:19,799 after the end of Centos six and we 598 00:20:17,940 --> 00:20:22,140 managed to get permission to run Docker 599 00:20:19,799 --> 00:20:23,640 compose which really simplified our 600 00:20:22,140 --> 00:20:25,620 devops challenges and made it much 601 00:20:23,640 --> 00:20:28,440 easier to kind of get all of these 602 00:20:25,620 --> 00:20:30,299 things to play well together 603 00:20:28,440 --> 00:20:32,460 [Applause] 604 00:20:30,299 --> 00:20:34,140 uh and then we got some huge help when a 605 00:20:32,460 --> 00:20:35,460 senior developer in Seattle who was 606 00:20:34,140 --> 00:20:37,740 working with one of the collaborating 607 00:20:35,460 --> 00:20:40,260 labs was able to contribute some time to 608 00:20:37,740 --> 00:20:41,640 the project and help us figure out a 609 00:20:40,260 --> 00:20:44,580 path forward 610 00:20:41,640 --> 00:20:46,500 so now that we had this experience and 611 00:20:44,580 --> 00:20:48,600 we had some senior people who had their 612 00:20:46,500 --> 00:20:50,580 own preferences we were able to take a 613 00:20:48,600 --> 00:20:53,580 look at this again and what we decided 614 00:20:50,580 --> 00:20:54,900 to do was move from Django 1.11 to fast 615 00:20:53,580 --> 00:20:56,700 API 616 00:20:54,900 --> 00:20:58,620 and even though fast API is obviously 617 00:20:56,700 --> 00:21:00,720 very different from Django the fact that 618 00:20:58,620 --> 00:21:02,220 all the application specific code that 619 00:21:00,720 --> 00:21:04,020 we wrote for data set validation 620 00:21:02,220 --> 00:21:06,179 handling genomic information all of 621 00:21:04,020 --> 00:21:08,880 these other tasks was written in Python 622 00:21:06,179 --> 00:21:10,860 and using pandas and other libraries it 623 00:21:08,880 --> 00:21:12,780 meant that most of what we had done we 624 00:21:10,860 --> 00:21:14,700 could just bring straight over and we 625 00:21:12,780 --> 00:21:16,679 were also even able to use the same 626 00:21:14,700 --> 00:21:18,660 postgres database that we had we didn't 627 00:21:16,679 --> 00:21:20,580 have to really I mean obviously we had 628 00:21:18,660 --> 00:21:22,500 to do some migrations but but that 629 00:21:20,580 --> 00:21:26,160 really pretty much got to stay the same 630 00:21:22,500 --> 00:21:28,500 we just shifted what was on top of it 631 00:21:26,160 --> 00:21:30,059 so now as of uh this sort of the new 632 00:21:28,500 --> 00:21:31,620 version here's what the Technology stock 633 00:21:30,059 --> 00:21:33,059 looks like we've gone from you know 634 00:21:31,620 --> 00:21:34,740 having to figure out how to get 635 00:21:33,059 --> 00:21:37,140 everything working under a specific 636 00:21:34,740 --> 00:21:39,419 version of Centos to doing things in 637 00:21:37,140 --> 00:21:41,340 Docker compose which is way nicer both 638 00:21:39,419 --> 00:21:43,320 for you know deploying it and also for 639 00:21:41,340 --> 00:21:44,580 development it's much easier to set up a 640 00:21:43,320 --> 00:21:46,559 development in a local development 641 00:21:44,580 --> 00:21:48,480 environment now we upgraded the version 642 00:21:46,559 --> 00:21:50,760 of postgresql that we were doing which 643 00:21:48,480 --> 00:21:52,320 is great because now our awesome Json 644 00:21:50,760 --> 00:21:54,179 fields are now even more awesome and 645 00:21:52,320 --> 00:21:57,000 more performant we got to move from 646 00:21:54,179 --> 00:22:00,419 Python 3.4 to 3.9 which made me very 647 00:21:57,000 --> 00:22:02,880 happy and then we also switched from 648 00:22:00,419 --> 00:22:07,460 using Django to using a combination of 649 00:22:02,880 --> 00:22:07,460 fast API and view 650 00:22:07,919 --> 00:22:12,059 and so this really took a lot of work 651 00:22:09,840 --> 00:22:13,980 but we're now basically back to where 652 00:22:12,059 --> 00:22:15,840 the Django site was but we rebuilt it in 653 00:22:13,980 --> 00:22:18,659 the new framework and our hope is that 654 00:22:15,840 --> 00:22:21,720 we've sort of built kind of a new floor 655 00:22:18,659 --> 00:22:23,220 um where where where we where we were 656 00:22:21,720 --> 00:22:25,140 before and now we can start adding some 657 00:22:23,220 --> 00:22:27,299 new features so time will tell and this 658 00:22:25,140 --> 00:22:28,799 took about a year so we started in you 659 00:22:27,299 --> 00:22:31,140 know the first commit in the new version 660 00:22:28,799 --> 00:22:33,240 was in April of 2022 and then in April 661 00:22:31,140 --> 00:22:36,659 of this year we switched over from the 662 00:22:33,240 --> 00:22:37,980 Django version to the new version 663 00:22:36,659 --> 00:22:40,080 um 664 00:22:37,980 --> 00:22:42,059 and so I think the moral of the story is 665 00:22:40,080 --> 00:22:43,620 this and which has been said by by a lot 666 00:22:42,059 --> 00:22:45,059 of other folks in different contexts 667 00:22:43,620 --> 00:22:46,320 which is that it's impossible to know 668 00:22:45,059 --> 00:22:48,179 what to build until you've tried to 669 00:22:46,320 --> 00:22:49,980 build something or build something else 670 00:22:48,179 --> 00:22:51,480 first and it's really held true for all 671 00:22:49,980 --> 00:22:53,520 of the software projects that I've done 672 00:22:51,480 --> 00:22:54,299 as a researcher 673 00:22:53,520 --> 00:22:57,360 um 674 00:22:54,299 --> 00:22:59,400 and I can confidently say that uh 675 00:22:57,360 --> 00:23:01,380 you know having a framework like Django 676 00:22:59,400 --> 00:23:03,120 available that we could pick up and use 677 00:23:01,380 --> 00:23:04,980 to get something built was absolutely 678 00:23:03,120 --> 00:23:06,600 essential for us to succeed in this 679 00:23:04,980 --> 00:23:08,820 project and while we were getting 680 00:23:06,600 --> 00:23:10,679 started we absolutely did not have the 681 00:23:08,820 --> 00:23:12,179 capacity or the funding to take on a 682 00:23:10,679 --> 00:23:14,760 project that required expertise into 683 00:23:12,179 --> 00:23:17,940 programming languages 684 00:23:14,760 --> 00:23:19,679 um we didn't have anywhere near a clear 685 00:23:17,940 --> 00:23:21,840 enough idea of what we needed to build 686 00:23:19,679 --> 00:23:24,780 in order to take advantage of a sort of 687 00:23:21,840 --> 00:23:27,240 a more flexible and open-ended uh kind 688 00:23:24,780 --> 00:23:29,400 of development uh environment like uh 689 00:23:27,240 --> 00:23:30,840 like we're getting with fast API and 690 00:23:29,400 --> 00:23:32,880 view 691 00:23:30,840 --> 00:23:34,620 um and most importantly I think no 692 00:23:32,880 --> 00:23:36,360 amount of us thinking hard about the 693 00:23:34,620 --> 00:23:38,940 problem would have gotten us to the 694 00:23:36,360 --> 00:23:40,799 point that we needed to get to because 695 00:23:38,940 --> 00:23:43,200 we needed the engagement from external 696 00:23:40,799 --> 00:23:44,940 parties uh with different needs and 697 00:23:43,200 --> 00:23:48,140 different backgrounds and we could only 698 00:23:44,940 --> 00:23:50,880 get that by deciding to build something 699 00:23:48,140 --> 00:23:53,700 building it and putting it out into the 700 00:23:50,880 --> 00:23:56,520 world and then seeing what happened 701 00:23:53,700 --> 00:23:58,260 um and so you know we would not have had 702 00:23:56,520 --> 00:24:01,740 success if we had started with the 703 00:23:58,260 --> 00:24:03,480 framework that we're using now and so we 704 00:24:01,740 --> 00:24:05,159 think that this project despite the fact 705 00:24:03,480 --> 00:24:06,659 that we're not currently using Django is 706 00:24:05,159 --> 00:24:08,059 a Django success story and I hope that 707 00:24:06,659 --> 00:24:11,159 you'll agree that that's the case 708 00:24:08,059 --> 00:24:15,299 because uh we never would have gotten to 709 00:24:11,159 --> 00:24:18,299 where we are without Django and I would 710 00:24:15,299 --> 00:24:20,159 really like to thank Django Khan and the 711 00:24:18,299 --> 00:24:22,679 and this audience for giving me the 712 00:24:20,159 --> 00:24:25,020 opportunity to share it with you 713 00:24:22,679 --> 00:24:26,700 um and with that here's some some links 714 00:24:25,020 --> 00:24:28,559 and references if you want the slides 715 00:24:26,700 --> 00:24:29,940 I'll figure out how to give you the 716 00:24:28,559 --> 00:24:32,760 slides 717 00:24:29,940 --> 00:24:34,440 um there's some some links to the to the 718 00:24:32,760 --> 00:24:36,840 papers that were linked in there and 719 00:24:34,440 --> 00:24:38,159 then also the website itself and and our 720 00:24:36,840 --> 00:24:38,820 GitHub 721 00:24:38,159 --> 00:24:41,400 um 722 00:24:38,820 --> 00:24:44,280 and also add that if if anybody wants 723 00:24:41,400 --> 00:24:46,080 papers and you you can just email the 724 00:24:44,280 --> 00:24:47,700 author that's listed and we can send it 725 00:24:46,080 --> 00:24:50,700 to you for free and we love to do that 726 00:24:47,700 --> 00:24:53,100 so if you want biological literature 727 00:24:50,700 --> 00:24:54,780 just just ask the scientists and with 728 00:24:53,100 --> 00:24:57,620 that I would be very happy to take any 729 00:24:54,780 --> 00:24:57,620 questions if there's time 730 00:25:04,020 --> 00:25:08,760 so much for that Alan we have the 731 00:25:06,840 --> 00:25:10,980 Discord where we will take questions I'm 732 00:25:08,760 --> 00:25:13,380 happy to read those out you can do it on 733 00:25:10,980 --> 00:25:15,900 mobile and stuff but it's been a couple 734 00:25:13,380 --> 00:25:18,840 of years so now I need to remember how 735 00:25:15,900 --> 00:25:22,260 fast I can run to get Russell the first 736 00:25:18,840 --> 00:25:24,659 question because of course it is 737 00:25:22,260 --> 00:25:26,700 hey we're back 738 00:25:24,659 --> 00:25:28,380 is it reporting you 739 00:25:26,700 --> 00:25:30,539 um thanks for the talk this is fantastic 740 00:25:28,380 --> 00:25:31,919 and I'd love to hear success stories of 741 00:25:30,539 --> 00:25:34,740 Django anytime 742 00:25:31,919 --> 00:25:39,179 um could you speak to what features of 743 00:25:34,740 --> 00:25:41,400 fast API led to that transition over and 744 00:25:39,179 --> 00:25:44,460 if they are things that Django as a 745 00:25:41,400 --> 00:25:46,200 project as a community could adopt to 746 00:25:44,460 --> 00:25:49,260 like not that you made the wrong choice 747 00:25:46,200 --> 00:25:51,000 but but to prevent other people making 748 00:25:49,260 --> 00:25:52,620 that choice in the future to make to 749 00:25:51,000 --> 00:25:53,820 make Django more appealing to the to the 750 00:25:52,620 --> 00:25:55,919 demographic that we're not currently 751 00:25:53,820 --> 00:25:57,059 satisfying sure 752 00:25:55,919 --> 00:26:00,360 um 753 00:25:57,059 --> 00:26:03,299 so I think that one of the in our 754 00:26:00,360 --> 00:26:05,940 experience in in Django one was that the 755 00:26:03,299 --> 00:26:07,679 the we got a rest API out of the box but 756 00:26:05,940 --> 00:26:10,860 that was essentially exposing the 757 00:26:07,679 --> 00:26:13,440 internal data models and so 758 00:26:10,860 --> 00:26:15,120 changing the API Behavior then would 759 00:26:13,440 --> 00:26:17,279 either require us to write a whole bunch 760 00:26:15,120 --> 00:26:19,320 of code to kind of build a custom API 761 00:26:17,279 --> 00:26:23,820 which we didn't really want to do 762 00:26:19,320 --> 00:26:26,100 or it would require us to be able to 763 00:26:23,820 --> 00:26:28,500 um go back and refactor those basic 764 00:26:26,100 --> 00:26:31,039 models which I think would have been 765 00:26:28,500 --> 00:26:35,220 not also not feasible 766 00:26:31,039 --> 00:26:37,020 so by using fast API and pedantic and 767 00:26:35,220 --> 00:26:39,659 the sort of you know models versus view 768 00:26:37,020 --> 00:26:41,700 models we can build view models which 769 00:26:39,659 --> 00:26:43,020 are relatively easy to consume we can 770 00:26:41,700 --> 00:26:46,580 have a lot of different flavors of view 771 00:26:43,020 --> 00:26:50,600 models and then build the API using that 772 00:26:46,580 --> 00:26:52,980 and that was a really uh kind of the the 773 00:26:50,600 --> 00:26:54,720 main advantage technical advantage that 774 00:26:52,980 --> 00:26:56,520 we were getting was that that ability to 775 00:26:54,720 --> 00:26:59,279 put sort of whatever it is that we 776 00:26:56,520 --> 00:27:01,799 needed in those view models and have it 777 00:26:59,279 --> 00:27:03,659 pretty much just work and even build in 778 00:27:01,799 --> 00:27:05,940 some behavior and other stuff in those 779 00:27:03,659 --> 00:27:07,200 it's also a cool nice I mean one of the 780 00:27:05,940 --> 00:27:09,840 things that I didn't mention is we built 781 00:27:07,200 --> 00:27:11,760 like sort of an SDK kind of thing so 782 00:27:09,840 --> 00:27:13,440 that people can just write python code 783 00:27:11,760 --> 00:27:15,720 and be able to upload and download data 784 00:27:13,440 --> 00:27:17,460 sets so people who are you know you're 785 00:27:15,720 --> 00:27:19,860 sort of like I did a python boot camp 786 00:27:17,460 --> 00:27:21,360 and you know data scientists like wet 787 00:27:19,860 --> 00:27:22,380 lab people 788 00:27:21,360 --> 00:27:24,419 um 789 00:27:22,380 --> 00:27:25,740 and and so it's nice being able we 790 00:27:24,419 --> 00:27:28,260 connect we actually import The View 791 00:27:25,740 --> 00:27:29,700 models into that and do to do vocal 792 00:27:28,260 --> 00:27:31,440 validation and so there's some kind of 793 00:27:29,700 --> 00:27:33,720 other nice features there so that's 794 00:27:31,440 --> 00:27:37,159 that's I think the main thing 795 00:27:33,720 --> 00:27:37,159 you have time for one more question 796 00:27:43,500 --> 00:27:48,179 um uh the is the experimental data 797 00:27:46,440 --> 00:27:49,860 consistent like is it like a 798 00:27:48,179 --> 00:27:53,820 well-established file format for the 799 00:27:49,860 --> 00:27:54,960 data Etc not at all not even close uh 800 00:27:53,820 --> 00:27:58,140 one of the cool things about this 801 00:27:54,960 --> 00:28:00,360 technology is that it's a very 802 00:27:58,140 --> 00:28:02,159 versatile technology and anyone who's 803 00:28:00,360 --> 00:28:03,179 interested in studying proteins whether 804 00:28:02,159 --> 00:28:04,860 that's someone who's interested in 805 00:28:03,179 --> 00:28:06,299 approaching engineering or someone who's 806 00:28:04,860 --> 00:28:07,440 interested in clinical genetics or 807 00:28:06,299 --> 00:28:09,840 someone who's interested in evolution 808 00:28:07,440 --> 00:28:11,100 they're all converging on using this 809 00:28:09,840 --> 00:28:12,840 technology and then they're publishing 810 00:28:11,100 --> 00:28:14,760 their papers and so they're all coming 811 00:28:12,840 --> 00:28:17,820 from different fields with different 812 00:28:14,760 --> 00:28:19,559 backgrounds and different sort of data 813 00:28:17,820 --> 00:28:22,159 conventions and data sharing conventions 814 00:28:19,559 --> 00:28:24,840 and so it is a hot mess 815 00:28:22,159 --> 00:28:27,000 and that's why it's even more important 816 00:28:24,840 --> 00:28:28,980 to have a database that can put all this 817 00:28:27,000 --> 00:28:31,440 stuff in one place so that then the 818 00:28:28,980 --> 00:28:33,600 computer scientists can go there and not 819 00:28:31,440 --> 00:28:36,779 have to learn all this different biology 820 00:28:33,600 --> 00:28:40,320 to you know get it get it what they want 821 00:28:36,779 --> 00:28:43,820 so and with that we're out of time we 822 00:28:40,320 --> 00:28:43,820 would like to thank Alan again 823 00:28:45,490 --> 00:28:48,869 [Applause]