Chia-blockchain: [BUG] Farmer faulted 2 times

0

Describe the bug
Farmer has been unstable and closes abruptly.

To Reproduce
Steps to reproduce the behavior:

  1. npm run electron &
  2. cat ~/.chia/.../log/debug.log

Expected behavior
Run indefinitely

Screenshots

Screenshot from 2020-10-29 19-37-44
Screenshot from 2020-10-30 14-18-27

Desktop (please complete the following information):

  • OS: Ubuntu Desktop
  • OS Version/Flavor: 18.04.2
  • CPU: TR 1950x

Additional context
Add any other context about the problem here.

etemiz picture etemiz  ·  30 Oct 2020

Most helpful comment

2

I have been running farm-crash-test for 5 days straight with no seg fault on the harvester. Peak memory usage is 983MB for me, and current usage is 928MB. What is everyone seeing as peak and current memory usage for their harvester?

eFishCent picture eFishCent  ·  12 Nov 2020

All comments

0

I want to confirm two things:

  1. This is Beta 17?
  2. You're using both a remote and local harvester?
hoffmang9 picture hoffmang9  ·  30 Oct 2020
0

Yes beta 17. Yes, I have 1 local harvester, 5 remote harvesters and 1 local farmer.

etemiz picture etemiz  ·  30 Oct 2020
0

I hit a similar issue with my farmer crashing unexpectedly. I have two nodes that are each running their own full node, farmer, and harvester. The full node and harvester continued to run but the farmer on both machines died after several days of running fine.

The farmer seems to have died a while back and I didn't immediately notice. All messages from the farmer disappeared after the crash:

$ grep -c farmer .chia/beta-1.0b17/log/debug.log.*
.chia/beta-1.0b17/log/debug.log.1:0
.chia/beta-1.0b17/log/debug.log.2:0
.chia/beta-1.0b17/log/debug.log.3:0
.chia/beta-1.0b17/log/debug.log.4:14575
.chia/beta-1.0b17/log/debug.log.5:15057
.chia/beta-1.0b17/log/debug.log.6:11195

And here is the last mentions of the string farmer in debug.log.4 on the first machine:

05:43:48.318 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('119.253.42.49', 60355)
05:43:48.318 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('108.211.109.173', 8444)
05:43:48.511 farmer farmer_server              : INFO     -> header_signature to peer ('127.0.0.1', 8444)
05:43:48.512 farmer farmer_server              : INFO     <- respond_signature from peer ('127.0.0.1', 45136)
05:43:48.671 full_node full_node_server        : INFO     <- header_signature from peer ('127.0.0.1', 38466)
05:43:48.673 full_node full_node_server        : INFO     <- respond_unfinished_block from peer ('24.186.76.230', 8444)
05:43:48.676 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('85.6.133.56', 8444)
05:43:48.677 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('85.149.90.80', 8444)
05:43:48.678 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('173.28.194.176', 55229)
05:43:48.678 full_node full_node_server        : INFO     <- new_unfinished_block from peer ('12.181.169.147', 39878)
05:43:50.118 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('12.181.169.147', 39878)
05:43:50.119 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('108.211.109.173', 8444)
05:43:50.120 full_node full_node_server        : INFO     <- ping from peer ('108.211.109.173', 8444)
05:43:50.121 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('78.183.106.140', 52854)
05:43:50.122 full_node full_node_server        : INFO     <- new_unfinished_block from peer ('85.6.133.56', 8444)
05:43:50.123 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('24.186.76.230', 8444)
05:43:50.124 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('119.253.42.49', 60355)
05:43:50.125 full_node full_node_server        : INFO     <- new_unfinished_block from peer ('108.211.109.173', 8444)
05:43:50.126 full_node full_node_server        : INFO     -> pong to peer ('108.211.109.173', 8444)
05:43:51.576 full_node src.full_node.full_node : INFO     New best unfinished block at height 16155
05:43:51.581 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('85.6.133.56', 8444)
05:43:51.582 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('85.149.90.80', 8444)
05:43:51.582 full_node full_node_server        : INFO     <- new_unfinished_block from peer ('119.253.42.49', 60355)
05:43:51.583 full_node full_node_server        : INFO     <- new_unfinished_block from peer ('78.183.106.140', 52854)
05:43:51.584 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('173.28.194.176', 55229)
05:43:51.586 full_node full_node_server        : INFO     -> request_unfinished_block to peer ('174.89.217.214', 8444)
05:43:52.074 harvester harvester_server        : INFO     Received EOF from ('127.0.0.1', 8447), closing connection.
05:43:52.078 daemon src.daemon.server          : INFO     Websocket exception. Closing websocket with chia_farmer code = 1006 (connection closed abnormally [internal]), no reason Traceback (most recent call last):
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 827, in transfer_data
    message = await self.read_message()
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 895, in read_message
    frame = await self.read_data_frame(max_size=self.max_size)
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 971, in read_data_frame
    frame = await self.read_frame(max_size)
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 1047, in read_frame
    frame = await Frame.read(
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/framing.py", line 105, in read
    data = await reader(2)
  File "/usr/lib/python3.8/asyncio/streams.py", line 721, in readexactly
    raise exceptions.IncompleteReadError(incomplete, n)
asyncio.exceptions.IncompleteReadError: 0 bytes read on a total of 2 expected bytes

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/chia-blockchain/src/daemon/server.py", line 132, in safe_handle
    async for message in websocket:
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 439, in __aiter__
    yield await self.recv()
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 509, in recv
    await self.ensure_open()
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 803, in ensure_open
    raise self.connection_closed_exc()
websockets.exceptions.ConnectionClosedError: code = 1006 (connection closed abnormally [internal]), no reason

05:43:52.280 full_node src.full_node.full_node : INFO     Block is slow, expected 10717 seconds, waiting
05:43:52.280 full_node src.full_node.full_node : INFO     New best unfinished block at height 16155
05:43:52.281 full_node src.full_node.full_node : INFO     VDF will finish too late 47324 seconds, so don't propagate
05:43:52.282 full_node src.full_node.full_node : INFO     VDF will finish too late 9565 seconds, so don't propagate
05:43:52.283 full_node full_node_server        : INFO     Received EOF from ('127.0.0.1', 38466), closing connection.

Same error message on the second machine, but looks like it happened a couple hours earlier on that machine at 03:33:42.197 with the New best unfinished block at height 16125.

mmulsow picture mmulsow  ·  31 Oct 2020
0

Found this in dmesg on both machines:

Machine 1:

[309239.876342] show_signal_msg: 18 callbacks suppressed
[309239.876345] chia_farmer[45874]: segfault at 0 ip 00007ffb795f3293 sp 00007ffd5307ab00 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7ffb795a1000+106000]
[309239.876354] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8

Machine 2:

[302029.128623] show_signal_msg: 18 callbacks suppressed
[302029.128625] chia_farmer[4407]: segfault at 0 ip 00007f9903ec58b8 sp 00007ffcabd63190 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7f9903e73000+106000]
[302029.128632] Code: 81 c4 a8 01 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 48 8b 44 24 30 48 8b 4c 24 28 ba 01 00 00 00 4c 89 e7 41 b8 20 00 00 00 <c6> 00 01 48 89 c6 e8 4d 3a fd ff 44 89 fa 41 83 ff 08 0f 83 b0 00
mmulsow picture mmulsow  ·  1 Nov 2020
0

I've had the same farming failure described in the issue. I had 3 machines on b15. Machine A had full stack, Machine B had full stack, Machine C (pi) had just harvester pointed to Machine B. This setup ran perfectly for a long time. I upgraded to b17 all 3 machines, and wanted to check full stack on the pi, so had the full stack running on the pi. My config for the pi however was still pointing to Machine B. Since then I've had several farming failures. I've shut down the pi farmer, and have 2 machines running full stack, and will monitor for failures.

ToovaTech picture ToovaTech  ·  1 Nov 2020
0

Same issue with my setup.
1 Farmer (running GUI)
Multiple Harvesters
All on B17

dmesg output from the farmer:
These are the last few entries in dmesg:
[182826.347693] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8
[193835.893863] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8
[212038.981592] perf: interrupt took too long (4901 > 4900), lowering kernel.perf_event_max_sample_rate to 40750
[239036.993424] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8

lamiaceae2 picture lamiaceae2  ·  2 Nov 2020
0

I ran "chia start farmer" on the latest run, instead of using the GUI. Same issue.

Not sure if it has any value, but these are the chia processes still running & defunct

(venv) [email protected]:~$ ps -ef |grep chia
user1 1162542 1 0 13:15 pts/2 00:00:06 /home/user1/chia-blockchain/venv/bin/python3.8 /home/user1/chia-blockchain/venv/bin/chia run_daemon
user1 1162665 1162542 0 13:15 pts/2 00:01:02 chia_harvester
user1 1162666 1162542 10 13:15 pts/2 00:28:08 [chia_farmer] <defunct>
user1 1162667 1162542 6 13:15 pts/2 00:18:13 chia_full_node
user1 1162668 1162542 0 13:15 pts/2 00:00:43 chia_wallet

lamiaceae2 picture lamiaceae2  ·  2 Nov 2020
0

nm -D -n blspy.cpython-38-x86_64-linux-gnu.so with offset 0x52293 says its in

00000000000520f0 T _ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj

gdb blspy.cpython-38-x86_64-linux-gnu.so

Dump of assembler code for function _ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj:
DeriveChildSk 0x00000000000520f0 <+0>: push %r15
0x00000000000520f2 <+2>: push %r14
0x00000000000520f4 <+4>: push %r13
0x00000000000520f6 <+6>: push %r12
0x00000000000520f8 <+8>: mov %edx,%r12d
0x00000000000520fb <+11>: push %rbp
0x00000000000520fc <+12>: mov %rsi,%rbp
0x00000000000520ff <+15>: bswap %r12d
0x0000000000052102 <+18>: push %rbx
0x0000000000052103 <+19>: sub $0x1a8,%rsp
0x000000000005210a <+26>: mov 0x2b4dd7(%rip),%rbx # 0x306ee8
0x0000000000052111 <+33>: mov %rdi,0x38(%rsp)
0x0000000000052116 <+38>: mov $0x20,%edi
0x000000000005211b <+43>: callq *(%rbx)
0x000000000005211d <+45>: mov $0x4,%edi
0x0000000000052122 <+50>: mov %rax,0x40(%rsp)
0x0000000000052127 <+55>: callq *(%rbx)
0x0000000000052129 <+57>: mov $0x20,%edi
0x000000000005212e <+62>: mov %rax,%r15
0x0000000000052131 <+65>: mov %rax,(%rsp)
0x0000000000052135 <+69>: callq *(%rbx)
0x0000000000052137 <+71>: mov $0x20,%edi
0x000000000005213c <+76>: mov %rax,%r13
0x000000000005213f <+79>: callq *(%rbx)
0x0000000000052141 <+81>: mov $0x1fe0,%edi
0x0000000000052146 <+86>: mov %rax,%r14
0x0000000000052149 <+89>: mov %rax,0x10(%rsp)
0x000000000005214e <+94>: callq *(%rbx)
0x0000000000052150 <+96>: mov $0x1fe0,%edi
0x0000000000052155 <+101>: mov %rax,0x18(%rsp)
0x000000000005215a <+106>: callq *(%rbx)
0x000000000005215c <+108>: mov %r12d,(%r15)
0x000000000005215f <+111>: mov %r13,%rsi
0x0000000000052162 <+114>: mov %rbp,%rdi
0x0000000000052165 <+117>: mov %rax,0x20(%rsp)
0x000000000005216a <+122>: callq 0x25850 <[email protected]>
0x000000000005216f <+127>: lea 0x10(%r14),%rax
0x0000000000052173 <+131>: cmp %rax,%r13
0x0000000000052176 <+134>: jae 0x52185 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+149>
0x0000000000052178 <+136>: lea 0x10(%r13),%rax
0x000000000005217c <+140>: cmp %rax,%r14
0x000000000005217f <+143>: jb 0x52a88 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+2456>
0x0000000000052185 <+149>: movdqu 0x0(%r13),%xmm1
0x000000000005218b <+155>: pcmpeqd %xmm0,%xmm0
0x000000000005218f <+159>: mov 0x10(%rsp),%rax
0x0000000000052194 <+164>: pxor %xmm0,%xmm1
0x0000000000052198 <+168>: movups %xmm1,(%rax)
0x000000000005219b <+171>: movdqu 0x10(%r13),%xmm3
0x00000000000521a1 <+177>: pxor %xmm3,%xmm0
0x00000000000521a5 <+181>: movups %xmm0,0x10(%rax)
0x00000000000521a9 <+185>: mov $0x20,%edi
0x00000000000521ae <+190>: mov $0x1,%ebp
0x00000000000521b3 <+195>: callq *(%rbx)
0x00000000000521b5 <+197>: mov (%rsp),%rcx
0x00000000000521b9 <+201>: mov $0x20,%edx
0x00000000000521be <+206>: mov %r13,%rsi
0x00000000000521c1 <+209>: mov $0x4,%r8d
0x00000000000521c7 <+215>: mov %rax,%rdi
0x00000000000521ca <+218>: mov %rax,%r14
0x00000000000521cd <+221>: callq 0x26310
0x00000000000521d2 <+226>: mov $0x20,%edi
0x00000000000521d7 <+231>: callq *(%rbx)
0x00000000000521d9 <+233>: mov $0x1,%edi
0x00000000000521de <+238>: mov %rax,%r15
0x00000000000521e1 <+241>: callq *(%rbx)
0x00000000000521e3 <+243>: mov $0x21,%edi
0x00000000000521e8 <+248>: mov %rax,0x30(%rsp)
0x00000000000521ed <+253>: callq *(%rbx)
0x00000000000521ef <+255>: mov %r13,0x50(%rsp)
0x00000000000521f4 <+260>: mov %rax,%r12
0x00000000000521f7 <+263>: xor %eax,%eax
0x00000000000521f9 <+265>: mov %r14,0x28(%rsp)
0x00000000000521fe <+270>: mov %rbp,%r14
0x0000000000052201 <+273>: mov %r12,%r13
0x0000000000052204 <+276>: mov %rax,%rdx
0x0000000000052207 <+279>: mov %r15,%r12
0x000000000005220a <+282>: jmp 0x52245 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+341>
0x000000000005220c <+284>: nopl 0x0(%rax)
0x0000000000052210 <+288>: and $0x4,%r15d
0x0000000000052214 <+292>: jne 0x52a70 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+2432>
0x000000000005221a <+298>: test %edx,%edx
0x000000000005221c <+300>: je 0x5222f <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+319>
0x000000000005221e <+302>: movzbl (%r12),%ecx
0x0000000000052223 <+307>: mov %cl,0x0(%rbp)
0x0000000000052226 <+310>: test $0x2,%dl
0x0000000000052229 <+313>: jne 0x52ac8 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+2520>
0x000000000005222f <+319>: cmp $0xff,%r14
0x0000000000052236 <+326>: je 0x52315 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+549>
0x000000000005223c <+332>: mov 0x48(%rsp),%rdx
0x0000000000052241 <+337>: add $0x1,%r14
0x0000000000052245 <+341>: mov $0x1fe0,%r15d
0x000000000005224b <+347>: mov $0x20,%ecx
0x0000000000052250 <+352>: sub %rdx,%r15
0x0000000000052253 <+355>: cmp $0x20,%r15
0x0000000000052257 <+359>: cmova %rcx,%r15
0x000000000005225b <+363>: lea (%r15,%rdx,1),%rax
0x000000000005225f <+367>: mov %rax,0x48(%rsp)
0x0000000000052264 <+372>: mov 0x18(%rsp),%rax
0x0000000000052269 <+377>: lea (%rax,%rdx,1),%rbp
0x000000000005226d <+381>: cmp $0x1,%r14
0x0000000000052271 <+385>: je 0x528a0 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+1968>
0x0000000000052277 <+391>: mov 0x28(%rsp),%rcx
0x000000000005227c <+396>: mov $0x21,%edx
0x0000000000052281 <+401>: mov %r13,%rsi
0x0000000000052284 <+404>: mov %r12,%rdi
0x0000000000052287 <+407>: movdqu (%r12),%xmm2
0x000000000005228d <+413>: mov $0x20,%r8d
SegFault 0x0000000000052293 <+419>: movups %xmm2,0x0(%r13)
0x0000000000052298 <+424>: movdqu 0x10(%r12),%xmm3
0x000000000005229f <+431>: mov %r14b,0x20(%r13)
0x00000000000522a3 <+435>: movups %xmm3,0x10(%r13)
* md_hmac 0x00000000000522a8 <+440>: callq 0x26310
0x00000000000522ad <+445>: mov %r15d,%edx
0x00000000000522b0 <+448>: cmp $0x8,%r15d
0x00000000000522b4 <+452>: jb 0x52210 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+288>
0x00000000000522ba <+458>: mov (%r12),%rdx

Guessed possible call stack, based on md_hmac call?

DeriveChildSk
ParentSkToLamportPK
IKMToLamportSk
ExtractExpand
Extract or Expand

static void Extract(uint8_t* prk_output, const uint8_t* salt, const size_t saltLen, const uint8_t* ikm, const size_t ikm_len) {
    // assert(saltLen == 4);  // Used for EIP2333 key derivation
    // assert(ikm_len == 32);  // Used for EIP2333 key derivation
    // Hash256 used as the hash function (sha256)
    // PRK Output is 32 bytes (HashLen)
    md_hmac(prk_output, ikm, ikm_len, salt, saltLen);
}

static void Expand(uint8_t* okm, size_t L, const uint8_t* prk, const uint8_t* info, const size_t infoLen) {
    assert(L <= 255 * HASH_LEN); // L <= 255 * HashLen
    assert(infoLen >= 0);
    size_t N = ceil((double)L / HASH_LEN);
    size_t bytesWritten = 0;

    uint8_t* T = Util::SecAlloc<uint8_t>(HASH_LEN);
    uint8_t* hmacInput1 = Util::SecAlloc<uint8_t>(infoLen + 1);
    uint8_t* hmacInput = Util::SecAlloc<uint8_t>(HASH_LEN + infoLen + 1);

    assert(N >= 1 && N <= 255);

    for (size_t i = 1; i <= N; i++) {
        if (i == 1) {
            memcpy(hmacInput1, info, infoLen);
            hmacInput1[infoLen] = i;
            md_hmac(T, hmacInput1, infoLen + 1, prk, HASH_LEN);
        } else {
            memcpy(hmacInput, T, HASH_LEN);
            memcpy(hmacInput + HASH_LEN, info, infoLen);
            hmacInput[HASH_LEN + infoLen] = i;
            md_hmac(T, hmacInput, HASH_LEN + infoLen + 1, prk, HASH_LEN);
        }
        size_t to_write = L - bytesWritten;
        if (to_write > HASH_LEN) {
            to_write = HASH_LEN;
        }
        assert (to_write > 0 && to_write <= HASH_LEN);
        memcpy(okm + bytesWritten, T, to_write);
        bytesWritten += to_write;
    }
    Util::SecFree(T);
    Util::SecFree(hmacInput1);
    Util::SecFree(hmacInput);
    assert(bytesWritten == L);
}


static void ExtractExpand(uint8_t* output, size_t outputLen,
                          const uint8_t* key, size_t keyLen,
                          const uint8_t* salt, size_t saltLen,
                          const uint8_t* info, size_t infoLen) {
    uint8_t* prk = Util::SecAlloc<uint8_t>(HASH_LEN);
    HKDF256::Extract(prk, salt, saltLen, key, keyLen);
    HKDF256::Expand(output, outputLen, prk, info, infoLen);
    Util::SecFree(prk);
}

Could that N double + ceil be causing mischief? Or maybe a sodium_malloc issue?

mov %r13,%rsi means movups %xmm2,0x0(%r13) is crashing on a reference to the second parameter to md_hmac which is ikm (passed in as key) or hmacInput1/hmacInput (which could have problems if N were wrong).

Maybe memcpy is having issues with the zero length info from here

    const uint8_t info[1] = {0};
    HKDF256::ExtractExpand(outputLamportSk, 32 * 255, ikm, ikmLen, salt, saltLen, info, 0);

All just guesses

wjblanke picture wjblanke  ·  3 Nov 2020
0

0x000000000005226d <+381>: cmp $0x1,%r14 // if (i == 1)
0x0000000000052271 <+385>: je 0x528a0 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+1968> if (i == 1)
so i !=1
} else {
memcpy(hmacInput, T, HASH_LEN);
memcpy(hmacInput + HASH_LEN, info, infoLen);
hmacInput[HASH_LEN + infoLen] = i;
md_hmac(T, hmacInput, HASH_LEN + infoLen + 1, prk, HASH_LEN);
}
0x0000000000052277 <+391>: mov 0x28(%rsp),%rcx // prk (md_hmac param - RCX)
0x000000000005227c <+396>: mov $0x21,%edx
0x0000000000052281 <+401>: mov %r13,%rsi // hmacInput (md_hmac param - RSI)
0x0000000000052284 <+404>: mov %r12,%rdi // T (md_hmac param - RDI)
0x0000000000052287 <+407>: movdqu (%r12),%xmm2 // Move first 16 bytes of T to xmm2
0x000000000005228d <+413>: mov $0x20,%r8d // 32 (HASHLEN md_hmac param = R8D)
SegFault 0x0000000000052293 <+419>: movups %xmm2,0x0(%r13) // Write first 16 bytes of T to hmacInput
0x0000000000052298 <+424>: movdqu 0x10(%r12),%xmm3 // Move second 16 bytes of T to xmm3
0x000000000005229f <+431>: mov %r14b,0x20(%r13) // hmacInput[HASH_LEN + infoLen] = i;
0x00000000000522a3 <+435>: movups %xmm3,0x10(%r13) // Write second 16 bytes of T to hmacInput
md_hmac 0x00000000000522a8 <+440>: callq 0x26310 [email protected]

It isn't happy copying T to hmacInput and fails on the first write. Since hmacInput allocates 33 bytes using sodium_malloc, the memory will be unaligned. Memcpy shouldn't care about that though (it is using unaligned movups). infoLen = 0 has been optimized out.

uint8_t* hmacInput = Util::SecAlloc<uint8_t>(HASH_LEN + infoLen + 1);

hmmm. this must be failing. maybe we are running into this:

https://github.com/jedisct1/libsodium/issues/901

it might be worth trying setrlimit

wjblanke picture wjblanke  ·  3 Nov 2020
0

OK a different spot from above. This crashed at 0x528B8

Machine 2:

[302029.128623] show_signal_msg: 18 callbacks suppressed
[302029.128632] Code: 81 c4 a8 01 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 48 8b 44 24 30 48 8b 4c 24 28 ba 01 00 00 00 4c 89 e7 41 b8 20 00 00 00 00 01 48 89 c6 e8 4d 3a fd ff 44 89 fa 41 83 ff 08 0f 83 b0 00

above branched to <+1968> if i == 1
{
memcpy(hmacInput1, info, infoLen); // optimized out
hmacInput1[infoLen] = i;
md_hmac(T, hmacInput1, infoLen + 1, prk, HASH_LEN);
}
0x00000000000528a0 <+1968>: mov 0x30(%rsp),%rax
0x00000000000528a5 <+1973>: mov 0x28(%rsp),%rcx
0x00000000000528aa <+1978>: mov $0x1,%edx
0x00000000000528af <+1983>: mov %r12,%rdi
0x00000000000528b2 <+1986>: mov $0x20,%r8d
** SegFault 0x00000000000528b8 <+1992>: movb $0x1,(%rax) // hmacInput1[infoLen] = i;
0x00000000000528bb <+1995>: mov %rax,%rsi
0x00000000000528be <+1998>: callq 0x26310

This is the i == 1 case. In this case it is trying to write into hmacInput1, somewhat the same. This fails?

uint8_t* hmacInput1 = Util::SecAlloc<uint8_t>(infoLen + 1);
wjblanke picture wjblanke  ·  3 Nov 2020
0

here are a few observations, just by lightly browsing the code at: https://github.com/Chia-Network/bls-signatures/blob/d1e8f892d1941ff38da08a85cf17fa2e40f4ea2a/src/hkdf.hpp#L46

  • Util::SecAlloc() should probably throw an exception if it fails to allocate memory. There's no checking for null right now.
  • md_hmac() can throw an exception if it fails to allocate stack space (here). If it does, we may leak all allocations made with SecAlloc(). So, those allocations should really be held by a std::unique_ptr<>. (it's a bit hard to say exactly what happens actually, RLC_THROW is defined here)
  • each call to SecAlloc() (when configured to use libsodium) is a call to sodium_malloc(), which allocates 3 or 4 pages. Which seems quite excessive. Given that these allocations are just used in this stack frame, it seems like they could be merged into a single allocation, presumably saving time, physical memory and virtual address space.
arvidn picture arvidn  ·  3 Nov 2020
0

I am now thinking this is caused by a sodium_malloc leak. Stock Ubuntu will only tolerate 15000 of these sodium leaks and it looks like BNWrapper may have a leak.

wjblanke picture wjblanke  ·  4 Nov 2020
0

@wjblanke has a potential fix for the crashing farmer. Can folks try the farm-crash-test https://github.com/Chia-Network/chia-blockchain/tree/farm-crash-test branch?

`git fetch; git checkout farm-crash-test; git pull; sh install.sh' etc...

I can make a Windows installer available too if anyone has seen this problem on Windows.

hoffmang9 picture hoffmang9  ·  4 Nov 2020
0

I ran the farm-crash-test branch on the farmer overnight, sill crashed. (Ran from 20:50 to 05:49)
I only updated the Farmer, do I need to get the harvesters on the same branch?

[12580.795642] perf: interrupt took too long (2530 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[18779.585991] show_signal_msg: 28 callbacks suppressed
[18779.585999] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8
[20597.406646] perf: interrupt took too long (3167 > 3162), lowering kernel.perf_event_max_sample_rate to 63000
[36877.859927] perf: interrupt took too long (3959 > 3958), lowering kernel.perf_event_max_sample_rate to 50500
[193217.234802] audit: type=1400 audit(1604494840.413:40): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=249015 comm="cups-browsed" capability=23 capname="sys_nice"
[214122.784876] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8

lamiaceae2 picture lamiaceae2  ·  4 Nov 2020
0

You actually need to update the bls library - blspy is it's python name. The probably fixed version is 0.2.5 - https://pypi.org/project/blspy/#history

I modified setup.py in the branch above to pull that in when you run install.sh. If you wait like 20 minutes I'll have one more possible fix in a version 0.2.6 of blspy.

hoffmang9 picture hoffmang9  ·  4 Nov 2020
0

Ok, I've updated farm-crash-test branch with the newest blspy of 0.2.6. Please pull and run that one.

hoffmang9 picture hoffmang9  ·  4 Nov 2020
0

I have been running the new version since 11:30 this morning (6.5 hours ago) and no crash yet, I will let it run through the night and report back tomorrow morning

lamiaceae2 picture lamiaceae2  ·  5 Nov 2020
0

Unfortunately crashed again. Let me know if I need to post any logs/dmesg output

lamiaceae2 picture lamiaceae2  ·  5 Nov 2020
0

Plots winding down now, will try out farm-crash-test today. One farmer did fail on regular b17 again, this time with just two plotters running full stack.

ToovaTech picture ToovaTech  ·  5 Nov 2020
0

beta-1.0b18.dev19 (0.2.5?) crashed after 24 hours of running.

etemiz picture etemiz  ·  5 Nov 2020
0

Yes please post the dmesg crash info

wjblanke picture wjblanke  ·  5 Nov 2020
0

This is the dmesg of my latest crash:

[12580.795642] perf: interrupt took too long (2530 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[18779.585991] show_signal_msg: 28 callbacks suppressed
[18779.585999] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8
[20597.406646] perf: interrupt took too long (3167 > 3162), lowering kernel.perf_event_max_sample_rate to 63000
[36877.859927] perf: interrupt took too long (3959 > 3958), lowering kernel.perf_event_max_sample_rate to 50500
[193217.234802] audit: type=1400 audit(1604494840.413:40): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=249015 comm="cups-browsed" capability=23 capname="sys_nice"
[214122.784876] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8
[225548.021245] perf: interrupt took too long (4951 > 4948), lowering kernel.perf_event_max_sample_rate to 40250
[258089.360173] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8

lamiaceae2 picture lamiaceae2  ·  5 Nov 2020
0

hmm same spot. we've found another leak, more info soon. i wonder why it picks this particular spot (well, and one other we know about) to crash though.

wjblanke picture wjblanke  ·  6 Nov 2020
0

I also get the same kind of crash with chia start harvester:

[141362.172271] chia_harvester[89830]: segfault at 0 ip 00007f4e20523293 sp 00007ffde54b62d0 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7f4e204d1000+106000]
[141362.172281] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8
eFishCent picture eFishCent  ·  7 Nov 2020
0

I've just pushed a potentially fixed blspy. Please git pull the farm-crash-test branch and run it as you can.

hoffmang9 picture hoffmang9  ·  7 Nov 2020
0

Apologies for the delay in reporting back, I ended up reinstalling Linux on three of the harvesters to sort our the Python issues (These were upgrades from Ubuntu 19.x to 20.04)

I have the farm-crash-test branch running on my Farmer and all the harvesters for the last 9 hours and still going without any errors. (Typically crashed +-6 hours in) I will keep an eye on it during our day.

lamiaceae2 picture lamiaceae2  ·  8 Nov 2020
0

I've got two machines running strong on dev20 farm-crash-test for 3+ days. I've just installed dev21 on a third machine that had only been plotting.

ToovaTech picture ToovaTech  ·  9 Nov 2020
0

Nearly 24 hours on the latest farm-crash-test branch and everything is still working fine. I ran the Farmer initially from the command line and then via the GUI, both working fine.

lamiaceae2 picture lamiaceae2  ·  9 Nov 2020
2

1.0b18.dev21 blspy 0.2.7 running for 47 hours without a problem.

etemiz picture etemiz  ·  9 Nov 2020
0

I merged this into the dev branch. If anyone comes new to this issue, that's the branch you now need and not farm-crash-test branch. Feel free to keep running if you have the most recent farm-crash-test however.

I'm going to leave this open for another couple of days to make sure we don't see any more of this over longer time periods.

hoffmang9 picture hoffmang9  ·  9 Nov 2020
2

I have been running farm-crash-test for 5 days straight with no seg fault on the harvester. Peak memory usage is 983MB for me, and current usage is 928MB. What is everyone seeing as peak and current memory usage for their harvester?

eFishCent picture eFishCent  ·  12 Nov 2020
0

Having heard no further reports of failure, I'm closing this as fixed for the next release.

hoffmang9 picture hoffmang9  ·  23 Nov 2020