Logs for #trilema-mod6

Go to: #trilema #pizarro #asciilifeform #trilema-mod6 #chainstate #eulora

This is the log of #trilema-mod6 : Contained within are the daily on-goings of The Bitcoin Foundation's (http://thebitcoin.foundation) Reference Implementation (trb) development. Do not, under any circumstances, use any code/vpatch/logic/configurations/ideas or otherwise found in this forum. In fact, you should NOT read this at all. Failure to heed these warnings may result in death or dismemberment! ALL ANNOUNCEMENTS, BUSINESS, AND/OR QUESTIONS INVOLVING THE BITCOIN FOUNDATION AND/OR THE BITCOIN REFERENCE IMPLEMENTATION (TRB) WILL TAKE PLACE IN #TRILEMA: YOU HAVE BEEN WARNED.

2018-2-28 | 2018-4-1

mod6: wb!
ben_vulpes: ohai
mod6: hey!
mod6: just a bit of a recap here...
trinque: something about the invite seems to kick me when I reconnect
trinque: hola!
mod6: huh, ok. sorry bout that.
mod6: I just wanted to get the four of us in here for a minute, and it's all logged if others wanna read. but wanted to get the take on this from you all...
mod6: so for about 5 solid weeks now, my node has been all sync'd up and i've been checking it's height & connections on an hourly basis...
mod6: and then let's say 2-3 days ago, it fell about 50 blocks behind and refused to catch back up after 24 hours. so I shut it down, then started it back up in -connect mode to just ben's mimi.
mod6: got sync'd up again very fast.
mod6: then restarted it again in -addnode mode, adding all of the republican IPs.
mod6: everything was dandy for about a day. maybe 30 hours.
mod6: which leads me to this morning.
mod6: again i found it like 40 blocks behind, and not only behind, but not even responding to RPC commands issued to it.
mod6: when i'd do a `LC_ALL=C ./bitcoind getinfo` it would just hang there -- i would see it ack'd in the debug.log, but nothin.
mod6: I let it run all day like that, just sitting there.
mod6: came back, still nothing. wouldn't even respond to a `LC_ALL=C ./bitcoind exit`
mod6: for the first time maybe ever, I had to actually kill the damn thing.
mod6: i will say, lemme go back for a moment; I've only had this problem while in -addnode mode. when i'm sync'ing up in -connect, never seen this problem.
mod6: here is my debug log from today: http://p.bvulpes.com/pastes/FO8dJ/?raw=true
mod6: I'd like to say that I'm running v0.5.4 Release + the DB timings + Aggressive PushGetBlocks (asciilifeforms)
mod6: So i'm trying to figure out what might be happening here, it does seemt to get into this weird state where it's only handling connections to trb, doing naught else.
mod6: asciilifeform: any thoughts here?
mod6: ive already spoken to ben_vulpes breifly about this, he said to try his add-ons for 'max aggression', which i'm fine to add in.
mod6: just wondering if any of you guys notice anything weird in the log posted above. you'll see in there, when it was started up, was 2 blocks behind, got those blocks, then promptly ate shit.
mod6: (for the time being, my node is down - it hath been killed.)
asciilifeform reads..
asciilifeform: mod6: you do not have timestamps in the log... how long of a time did this log span ?
mod6: sorry, it was from 7am CDT until about an hour ago... figure ~10.5-11 hours
asciilifeform: mod6: and for how long did you wait for your rpc command , before deciding that the thing is perma-wedged ?
mod6: 10 hours or so
asciilifeform: i.e. it sat overnight , and in the morning was still hanging ?
mod6: if you search for the 'getinfo' command, you can see it's somewhere near the first 1/3 of the file.
mod6: ok.
mod6: so this is what happened.
mod6: woke up earlier than usual this morning @ 6am. checked my node, was ~40 blocks behind (or so)
mod6: stopped the node.
mod6: started re-syncing up in -connect=<mimi_ip> mode
mod6: once was all the way up. stopped it again.
mod6: while it was stopping 2 more blocks came in.
mod6: put it back in -addnode mode, and then let it go. it got 2 more blocks (catching all the way up) 513344 was best height at the time.
asciilifeform: mod6: describe plz the machine. in particular , the disk.
mod6: then as soon as it was caught up (best of my memory) i issued the 'getinfo' command, which just hung there.
mod6: and i let it hang there like that all day while I was at the salt mine.
mod6: finally killed it off when i got back.
mod6: I couldn't even shut down the node.
mod6: asciilifeform: ok the machine is a dedicated box in a dc -- iirc it's 16 CPU / 48 Gb ram the disk is an 1Tb SSD.
mod6: I can pull that info from dmesg for you, just a sec.
asciilifeform: this is interesting
asciilifeform: and also interesting is the fact that the log does not have appearance of wedgedness around the time of either 'getinfo'
asciilifeform: each should have returned
asciilifeform: thing was happily processing mempool crapola
mod6: http://p.bvulpes.com/pastes/2YvW8/?raw=true
asciilifeform: mod6: when you worked on the box at the time of the wedge/restart, did you notice any palpable delay in the runtime of commands which make use of the disk ?
mod6: no, none at all.
mod6: was highly responsive. box load is ~1.0 when trb is running, otherwise ~nil. thats all this box does
asciilifeform: i ask because the flavour is reminiscent of what i saw on a box with failing ssd, just last week
mod6: ahh.
mod6: i've got those timings in there, and they seem about par
asciilifeform: did not affect load metrics at all. but sata packets were taking 10x as long as they ought to. without , again, generating any written eggog anywhere. at first.
mod6 looks for SetBestChain
mod6: 6k ms
mod6: 5k ms
asciilifeform: this is where ?
mod6: on ProcessBlock. search for 'SetBestChain'.
asciilifeform: aa
asciilifeform: i see it
mod6: oh, well, ProcessBlock tok 5218ms; db write wait; 856ms
mod6: they used remote hands to put the drive in for me... about 6 weeks ago. as far as I know, the drive is new.
asciilifeform: the oddity so far is : 1) neither getinfo is surrounded by anything resembling a typical wedge log ( dropped connections etc )
asciilifeform: 2) end of log shows 'schoolbook' wedge, but no block reception trigger
mod6: yeah, and #2 is what i was seeing in the AM before I restarted it. I have those logs too, i Just didn't post everything because was 54Mb or something.
mod6: so I just cut out the part from startup->shutdown today as a conversation starter.
asciilifeform: it is a quite interesting log, mod6 , and i dearly wish that this would happen here, to me, where i can take process dump and reverse ; because it is impossible to say anything else from the log . but as it happens my node has been running continuously since i last patched it !
mod6: so this is the 3rd time. counting: 3-4 days ago, 24 hours ago, and this am.
mod6: *nod*
asciilifeform: ( and not fallen behind for more than a few hrs )
mod6: im willing to try stuff to find out. also willing to add in other patches to see if that helps.
asciilifeform: mod6: start with timestamps
asciilifeform: and ben_vulpes's remover of string truncation
mod6: ben_vulpes just posted that patch recently right?
asciilifeform: the latter imho really ought to go in the mainline asap. it is demonstrably harmless, and at the same time increases the usefulness of the log considerably
mod6: or was there another timestamp patch im forgetting? maybe there was one by mats or nubbins` or something
asciilifeform: now that i think about it, afaik there is not a timestamp patch. timestamps were introduced by trinque , but his method involves piping the log into a log daemon
mod6: ahh.
ben_vulpes: tis what i do as well
asciilifeform: however ben_vulpes most certainly did make a truncation remover. there is no reason we ought to ever see a 'received getdata for: block 00000000000000000958' ever again.
asciilifeform: it is an atrocity.
asciilifeform: relic from the days of the tard and his winblowz box
mod6: totally agree. ok i'll add ben's patch in first, recompile, then get setup with a log daemon to help with ts's.
mod6: might be a day or so before it's all back going again and i get more logs.
asciilifeform: do you know how to use gdb, mod6 ?
mod6: anyway, thanks a lot for looking through that.
mod6: yeah.
mod6: want me to set some watch points?
asciilifeform: see if you can run the thing inside a gdb; and dump the whole thing if/when it hangs.
asciilifeform: that way there is something to work with.
mod6: just do an 'st' on it or ?
mod6: i can always grab ya to for real-time help, but don't wanna bug ya too much -- extra instructions here are good.
asciilifeform: i'd like a process dump.
mod6: *too
asciilifeform: ( see if you can build with -g , i.e. with symbols, prior to this )
mod6: i've done that before, it shouldn't be an issue.
asciilifeform: ok
mod6: all i gotta do is just run the thing right, and then when it does that weird shit at the end:
mod6: `gdp --pid=<PID>`
mod6: *gdb
mod6: then
mod6: (gdb) gcore
mod6: (gdb) detach
mod6: right?
mod6: i think the gdb 'gcore' command allows me to enter a filename. ok. will do just that. should have a corpus for you review.
mod6: want me to also : 'set use-coredump-filter off' ?
mod6: and
mod6: 'set dump-excluded-mappings off' ?
mod6: "If 'on' is specified, GDB will dump memory mappings marked with the VM_DONTDUMP flag. This flag is represented in the file /proc/pid/smaps with the acronym 'dd'.
mod6: "
asciilifeform: btw you can dump core without having run under gdb, see for instance https://moythreads.com/wordpress/2011/10/06/linux-core-dumps/
asciilifeform: my current suspicion, mod6 , is that you've hit upon a rare locking bug in classical trb
asciilifeform: ( these will exist, in potentially unlimited supply , anywhere where a tard was allowed to write a multithreading-with-locks proggy )
mod6: *nod* yeah, that stuff is gnarly in there.
mod6: thanks for the link asciilifeform
asciilifeform: np
mod6: i'll start it up in gdb, then take a gcore from there -- at least will try that first.
asciilifeform: lemme know if it happens again. and post the dump.
mod6: hopefully it'll yield something that makes some sense.
mod6: ok, i've got my homework. i'll ping you fellas when i've got something to look at :]
mod6: Got ben's patch to remove all of the substr's for the logging in. Built bitcoind with debugging symbols.
mod6: Last night, I was running it in -connect mode so I could get caught up.
mod6: And I did that, fully syncd. Then I stopped the node.
mod6: And restarted with -addnode for all of the republican ips. I also have connected gdb to the PID of trb.
mod6: It ran over night, and is still sync'd up. if/when we see more trouble like I did before, I'll grab a dump of what is happening via `gcore`.
mod6: Anyway, just an update. Thanks!
mod6: So far, since lastnight, it's been running & keeping up just fine. Will keep an eye on it.
mod6: Still all sync'd up, no problems yet.
mod6: Keeping an eye on it.
mod6: it's still sync'd up over here.
mod6: keeping an eye on it...
mod6: Mornin' ok, I fell 100 blocks behind. Looks like the same problem. I've gathered some info, as well as a core file.
mod6: I have a pressing chore here that I need to do, so I'm gonna do that, and when i'm done here, I'll be back to post this somewhere for viewing.
mod6: bbs
mod6: ok this core file is 270M. I think I'm gonna encrypt the thing to you guys.
mod6: http://www.mod6.net/trb-debug/20180318/bitcoind_20180318_1.core.asc
mod6: asciilifeform: later on, see history ^ http://logs.bvulpes.com/trilema-mod6
mod6: we're obv. busy with pizarro things atm, can wait.
asciilifeform: still catching up with log
mod6: np, take your time, pizarro stuff comes first.
asciilifeform: and still not home, 250M log will have to wait for tomorrow for me
asciilifeform: i'm on a shitmodem
mod6: oh, ok.
asciilifeform: promises to be interesting tho.
mod6: ummm..
mod6: whatever fuckery was going on in there seems to have figured itself out.
mod6: because I just checked now, and it's fully caught up again:
mod6: "blocks" : 514309,
mod6: I never did restart it. It's still running, even after i did the dump with gcore.
mod6: "connections" : 22,
mod6: I think it's still worth looking at the core file anyway. Just to see what was going on in there.
mod6: Either way, I'll be keeping a close eye on it.
asciilifeform: hey mod6 -- how long has the ssd in this box been in service ?
asciilifeform: ( or, what's the total sectors written, if you can get this number ? )
asciilifeform: e.g., smartctl -i /dev/sda
mod6 tries
mod6: http://p.bvulpes.com/pastes/9EdRN/?raw=true
mod6: oh, hang on. didn't know there was the -a flag.
mod6: one sec.
mod6: Here you go: http://p.bvulpes.com/pastes/fPp0P/?raw=true
asciilifeform looks
asciilifeform: hm this type dun report sectors-written
mod6: yeah, i see that.
asciilifeform: your reallocated sector count is nonzero tho, which for a modern disk generally means it is not long for this world
asciilifeform: and hence slow reads
asciilifeform: i saw a similar thing recently on a 4 y.o. 1TB ssd
mod6: Yeah, that really sucks. They threw this thing in my box ... think it was like end of January, or early Feb.
asciilifeform: right, 2916 hours
mod6: Must have been late Jan, I'd have to check my emails, but my guess is that they gave me a used drive.
asciilifeform: what can i say, apparently non-samsung ssd suxxx.
mod6: see that's 121 days.
mod6: my box only has 50 days of uptime since they threw it in there.
mod6: 18:43:38 up 50 days, 22:15, 7 users, load average: 0.02, 0.04, 0.00
mod6: fuckers fuckers fuckers fuckers
mod6: alright, well, is it worth taking a look at the core file anyway? I may learn something...
asciilifeform: for comparison , http://p.bvulpes.com/pastes/ocrjL/?raw=true << is a samsung ssd i installed on zoolag in august
mod6: if not, that's ok. i guess i'll let it ride for the time being... i may have to figure out a different plan for a node, or talk to that DC and see if they'll let me ship my own SSD to them.
mod6 looks
mod6: ah Total_LBAs_Written
mod6: ok.
asciilifeform: right
asciilifeform: the better ones, show this
mod6: now, what about Western Digital Blue series ssds?
asciilifeform: can't say, haven't ever yet tried one
mod6: (I've never used the samsung ones before, read some bad things about them on amazon)
asciilifeform: there was at least 1 famously terrible batch
mod6: I typically have been buying the Kingston ones at work for my perf hardware, those have been solid.
mod6: But may not be exercising them as hard as trb.
mod6: so *shrug*
asciilifeform: i've had premature (i.e. <2yrs of service) failures from every type other than samsung, so far, of all tried to date (iirc 'crucial', 'kingston', 'sandisk', buncha lesser-known chinesium ones)
mod6: ah. ok. good to know.
asciilifeform: ( i don't buy intel ssd out of sheer principle )
mod6: i'll talk to them and see if they'll let me ship 'em a new samsung one and throw it in for me.
asciilifeform: oh forgot to include 1 that hasn't died yet, but not samsung : toshiba
mod6: probably won't, but if they don't wanna lose my business, they'll probably play ball.
asciilifeform: can't comment on how it stands up to trb load tho
asciilifeform: it's on a graphics box
mod6: ah toshiba. ok.
mod6: let me know if/when it flakes on you
asciilifeform: will do
mod6: cheers, I appreciate your help here.
asciilifeform: np
asciilifeform: sadly i dun have these in industrial quantities, or in anything like scientific conditions
asciilifeform: so nfi how much use is this info. but here it is.
mod6: yeah me either. thanks.
mod6: but hey, learn something new every day.
mod6: Hi guys
mod6: So seeing this issue again with TRB, I just did a `gcore <PID>` on it while the problem is active.
mod6: I'm letting it run to see if it works itself out or what happens here -- strangely enough, last time, it just unfucked itself and then kept going - and was caught up since the last time,
mod6: which was on 2018.03.18
mod6: this core file is 5.8Gb.
mod6: if you wanna look at it, let me know and I'll gpg gram it and throw it to whomever wants to see it.
mod6: speaking of which, the core file from last time is still posted, but I'm taking it down now.
trinque: hola mod6!

2018-2-28 | 2018-4-1